At NICAR 2024, three University of Texas students shared tips and experiences about using DocumentCloud to make a searchable collection of local campaign finance reports that aren’t gathered by state agencies.
The students – junior Karina Kumar and seniors Athena Hawkins and Carolyn Parmer – were the vanguard of a team I put together last year to develop processes to gather these documents at a larger scale. While we are able to get digital copies of the forms, they often come with multiple challenges: Most are scanned PDFs with no searchable text layer, many have redactions and some of them are hand-written.
We’ve been using DocumentCloud because it helps us in a number of ways:
-
The tags and key:value pairs allow us to create an organization system for our documents. Even though there is built-in optical character recognition (OCR) applied when PDFs are uploaded, we wanted to ensure a user could find documents relating to specific candidates, races or elections.
-
Premium OCR engines like Azure Document Intelligence, Amazon Textract and Google Cloud Vision do a much better job at translating handwriting than the default Tesseract engine.
-
While we aren’t in a position to widely share our collections at this time, the user management by project is also an asset for us.
The project is one of several experiential learning projects I run through the Media Innovation Group, which is funded through the Dallas Morning News Journalism Innovation Endowment at the Moody College of Communication’s School of Journalism and Media.
MuckRock has been a good partner in this project, helping us get the most of the platform and even building new features to help overcome different challenges we’ve faced.