February 21, 2025

70+ languages now supported in OCR

By Karan Sampath, Engineer and Jon Fritz, CPO

We're excited to announce support for 60+ languages in DocParse's OCR functionality, including Chinese, Hindi, Japanese, Arabic, and more. Our multi-language support leverages DocParse's leading OCR technology, which we discuss in this blogpost.

You can easily set the OCR language using the ocr_language parameter. For a list of supported languages, visit the DocParse documentation.

Let's kick the tires on this new feature! For this example, we'll use this document in Chinese. Next, let's run this through DocParse using the Aryn SDK:

file = path/to/Chinese_(simplified)_PDF.pdf

partitioned_file = partition_file(file, aryn_api_key, 
	extract_table_structure=True, use_ocr=True, 
	ocr_language=Chinese)

DocStore will return the parsed document, and the extracted Chinese text is in the text_representation field of each element of type text:

 {...
 	{
            "type": "Text",
            "bbox": [
                0.11757302676930147,
                0.663062577681108,
                0.8859951602711397,
                0.7473425847833807
            ],
            "properties": {
                "score": 0.77675861120224,
                "page_number": 1
            },
            "text_representation": "投诉得到解决是民权司联邦协调和监察处(FCS)一项举措的组成部分,以确保该州法院 符合第六篇对语言服务的要求。联邦协调和监察处通过诸如最近发布的法院语言服务计划 和援助工具等工具,给州法院系统提供政策指导和技术援助,并在全国范围内承担实施 行动。"
        },
        {
            "type": "Text",
            "bbox": [
                0.11769083359662225,
                0.7665277099609376,
                0.8580068072150735,
                0.8044300426136364
            ],
            "properties": {
                "score": 0.701048731803894,
                "page_number": 1
            },
            "text_representation": "新泽西方面的事宜是由迪迪.摩西律师处理的,得到特别法律顾问克里斯」.斯通曼的协 助。"
        },
        {
            "type": "Text",...
}

It's that easy! We'd love to hear what languages you're using with DocParse - drop us an email or join us on Slack.