Release Notes: DocumentCloud Beta now supports making PDFs searchable in 103 languages

Release Notes: DocumentCloud Beta now supports making PDFs searchable in 103 languages

Join us Thursday to hear more about upgrading your newsroom to the DocumentCloud Beta

Written by
Edited by Beryl Lipton

DocumentCloud has long offered a number of options when it comes to making documents easier to search, including support for OCR, which “reads” through text that’s saved as images to make it searchable. With the latest upgrades to the DocumentCloud Beta, we’re expanding support from 22 languages to 103, so whether you’re analyzing a document in Afrikaans, Yiddish, or anything in between, we’ve got you covered.

For previous site improvements, check out all of MuckRock’s release notes, and if you’d like updates emailed to you — along with ways to help contribute to the site’s development yourself — subscribe to our developer newsletter here.

Multilingual language support is back and better than ever in the DocumentCloud Beta

We’ve been ticking off the list of key features to port into the upgraded version of DocumentCloud, and today we’re excited to share that we’ve wrapped up the number one request: Multilanguage OCR support. Now, if you upload a document that’s missing searchable text, DocumentCloud can try to parse it in 103 different languages.

https://muckrock.s3.us-east-1.amazonaws.com/files_static/2020/ocr_omniselect.mov

And we’ve architected it so that no matter which language you choose, we still process any document in less than a minute. We’re using the latest and greatest open source OCR technology from Tesseract, which helps improve accuracy and allows us to offer such a broad base of language support.

We’ve also included an advanced option to allow you to “Force OCR.” Generally, when DocumentCloud detects that there’s already searchable text within a PDF, we don’t do OCR instead presuming that the PDF’s text is probably an accurate representation of what’s in the file. But sometimes you do want us to replace the text with a fresh read through — force OCR re-processes the text with our own OCR in these cases.

Both of these options appear under “More Options” when you upload documents. We’re working on implementing a way to save language preferences going forward so you don’t always have to select a language if most of your documents aren’t in English.

This update doesn’t offer multi-language support for interface elements, but that is on our road map as well.

Current DocumentCloud users: Join us Thursday for everything you need to know about the coming upgrade!

We’ve migrated roughly a third of newsrooms over to the DocumentCloud Beta, and our goal is to move everyone over before the end of the year. We want to make this as smooth a process as possible, so we encourage all administrators of DocumentCloud accounts that are still using the legacy version to join us Thursday for a webinar walking you through everything you need to know about the upgrades, including key changes to the interface, how you can test it early, and how user management is changing.

We also strongly recommend checking to see if you’re a current API user — we’ve overhauled the API with some major changes that will require tweaking your code, but we’re eager to make this as seamless a change as possible.

If you can’t make Thursday’s session, we’ll record it and have it available to watch later, or you can just email DocumentCloud support at any time.


Image via Wikimedia Commons