Parallel Corpora

Multilingual
Albanian
Arabic
Armenian
Bosnian
Bulgarian
Catalan
Czech
Dutch
French
Hindi
Japanese
+10

About this Dataset

The translation dataset is a collection parallel corpora of texts translated from English to other languages. There are 4 billion units in 16 domains. There are different quality levels that impact the pricing. Language pairs available (translation from English): Albanian, Arabic, Armenian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, Estonian, Finnish, French, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Korean, Kyrgyz, Latvian, Lithuanian, Malay, Maltese, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swedish, Thai, Turkish, Ukrainian, Vietnamese.

This dataset is covered by our standard Data license agreement. The license agreement is perpetual and allows for the commercialization of all models built on the data.

Download Free Sample

Fill out the form below and get access to Parallel Corpora dataset sample.

All fields are required

By downloading, installing, accessing, and/or using this data sample, you consent to receive communications from Defined.ai and affirm your acceptance of our Privacy Policy, Terms of Use, and Data License Agreement. Consent can be revoked at your discretion.


© 2025 DefinedCrowd. All rights reserved.