An upside down stock photo of documents in Russian and manilla envelopes.

Release Notes: Making it easier to sort, filter and reprocess document OCR

We also upgraded DocumentCloud logging to help more quickly recover if Add-Ons stop working

Written by
Edited by Michael Morisy

Since our last Release Notes, we released a new Add-On OCR Tagger that allows you to tag your document(s) based on the OCR engine used and we added better logging for when scheduled Add-Ons like Klaxon or Scraper get disabled. This helps more easily diagnose and correct outages that impact Add-Ons.

DocumentCloud Add-Ons

Add-On Disable Logs

When a scheduled Add-On like Klaxon fails five times in a row, the Add-On is disabled and users get notified by email that their Add-On failed the last five times it ran. In the rare event DocumentCloud has an outage that lasts several hours, many Add-Ons get disabled. Previously, there was no unique logging that allowed us to easily track and easily re-enable scheduled Add-Ons that failed during the outage. Now, we track when Add-Ons get disabled and have a mechanism to re-enable Add-Ons in bulk in the case of outages. This logging also allows us to assist users who are having trouble with a scheduled Add-On more easily for debugging.

OCR Tagger

The OCR Tagger Add-On tags documents in your collection with the OCR engine that was applied to the document. This allows you to search within your document set for documents that used a particular OCR engine.

Tagging documents makes it easier to sort and see which ones were OCR’d by which engine, particularly helpful if you need to re-run OCR on certain documents to improve the results if the base OCR is not working well enough. In addition to the default Tesseract OCR engine, DocumentCloud now offers alternatives, including free options like docTR and OCRSpace, as well as premium Add-Ons like Google Cloud Vision OCR and Azure Document Intelligence. Each has different strengths, but they offer a wide array of approaches to even the gnarliest of document collections.


Header image by Quinn Dombrowski and licensed under CC BY-SA 2.0 DEED