Release Notes

The latest release of DocumentCloud introduces OCR injection, ensuring that OCR’d text from Textract or any Add-Ons is grafted into the underlying PDF, allowing for better local search functionality in downloaded files. This update ensures that documents OCR’d by any engine now have OCR text embedded in the PDFs for seamless local, offline searching. Additional improvements include updated DocumentCloud API documentation, bug fixes for page-level annotation access, and enhanced search result excerpts for easier text matching.

DocumentCloud

OCR Injection

Documents that are now OCR’d by Textract (built in) or any of our Add-Ons also get the OCR’d text grafted back into the underlying PDF during processing. Previously, only tesseract had this grafting functionality and the other OCR engines would only store the newly OCR’d text on DocumentCloud. Now, that OCR’d text gets added back into the underlying PDFs as well.

Before this update, the locally downloaded file would be missing the newly OCR’d text, which means searching from within the locally downloaded copy of the PDF would miss the OCR’d text, as seen below. The OCR’d text was still available for search from within DocumentCloud, but the underlying PDF was not updated.

Now that OCR injection is applied on all models, the downloaded file can be searched locally with all of the newly OCR’d text added back into the original PDF.