Automate your beat: Unredact documents, monitor websites and much more with DocumentCloud

Automate your beat: Unredact documents, monitor websites and much more with DocumentCloud

Written by

Ever had a spreadsheet-turned-PDF you’re stuck untangling? Wish story ideas came right to you? Over the past two years, MuckRock’s DocumentCloud tool has built several ways to automate common journalism and research tasks, taking once-cumbersome processes and breaking them down to just a few clicks — or even letting users set up automations that work without having to do anything at all. Most of these tools are free of charge and available to any verified newsroom, so register today to get started.

In this guide, we’ll run through a host of everyday use cases and workflows you can set up to simplify your day-to-day drudgery. But this is just the start of what DocumentCloud can do: If you’d like to learn more, browse our list of current Add-Ons and register for the DocumentCloud newsletter. If you have questions or want to try building your own Add-On, check out the Add-On template GitHub repository, and then join us on Slack and say hello.

Get alerts for site changes with Klaxon

Want to monitor a specific webpage for new updates? It’s now easy to get alerts when a page — or even just a specific part of a page — has changed thanks to Klaxon Cloud.

First, while logged in, pull up the Klaxon Add-On and then pin it by clicking the thumbtack icon. This will ensure it always shows up in your sidebar for easy access.

Then, make sure you have a bookmarks toolbar showing in your browser. If you’re using Firefox, Safari, Chrome or Opera, then drag this link up to your bookmark bar: Monitor with Klaxon.

Now you’re ready to go. If you’re using Internet Explorer or Edge, dragging might not work. Right-click on the bookmarklet instead, and select Add to Favorites at the prompt.

Now, from a webpage you want to monitor, click on the bookmark and then follow the prompts to highlight the area of the page you want to get alerts upon changes. Klaxon will let you set intervals for hourly, daily or weekly checks, and you can even configure it to send alerts to a Slack channel in addition to the default email alerts. Learn more by watching our tutorial video or read more about Klaxon Cloud here.

Automatically archive documents of interest — and get fine-tuned alerts

Klaxon is great if you want to keep tabs on when a web page updates, but DocumentCloud is most useful if you have documents to actually analyze. Fortunately, the Scraper Add-On will fetch all the linked documents on a given page and drop them into your DocumentCloud account for safekeeping. You can optionally specify a project to put them in.

It also offers fine-grained alerting tools — you can specify if you want it to alert you to all newly uploaded documents on a page or only ones that mention a set of keywords you define. This is handy if you’re covering a public body and only want to get updates if a certain topic, company or individual is mentioned.

Similar to Klaxon, you can schedule Scraper to check a page hourly, daily, or weekly, or have it run just once if you’re trying to grab all the documents off a public reading room quickly. You can also have the Scraper search not only the page you link to but also all the other pages linked from the page you give it, making it a very versatile tool for a range of situations. Try it on a local agency that posts permits, inspections or other regular reports.

Flag any opportunities to peak beneath bad redactions

Black bars are the bane of most requesters, but sometimes there’s valuable information hidden behind them. The Bad Redactions Add-On will analyze a collection of documents and look behind every redaction mark it finds to see if there is hidden text it can unveil.

It will then privately note where there’s text for you to review and you can flip into DocumentCloud’s text view to see exactly what is hiding under the redactions. If that material is something that should be redacted, you can even have Bad Redactions automatically fully remove the underlying material before publication. Bad Redactions is powered by the open source X-Ray library from Free Law Project.

If you set it to Run on a schedule: Upload, Bad Redactions will automatically run every time you upload a new document, keeping an eye out without any extra work.

Identify sensitive information before you publish

If you build up a big enough collection of documents, you’re bound to come across some released information that should stay private. The PII Detector can be helpful here — you can configure it to help spot social security numbers, email addresses, zip codes, credit card numbers and more. It uses a variety of regular expressions and then adds private annotations anywhere it finds potentially sensitive information and optionally adds those documents to a project for you to review.

Transcribe audio files

DocumentCloud also includes free transcription assistance, thanks to integration with the open-source Whisper library. Transcribe Audio can be pointed at a Google Drive or Dropbox folder of audio files, or it can be given a YouTube or other video URL. It will then automatically transcribe the audio and save it as a new document into your DocumentCloud account. You can optionally tweak the settings for faster transcription or to go a little slower and improve accuracy.

Spreadsheet extraction

There are few things more tedious than retyping a spreadsheet that was saved as a PDF. Fortunately, DocumentCloud has a few options that can save you the trouble.

The first is the free Tabula Add-On. Simple select the document or documents which have a spreadsheet embedded in them, then run Tabula. The open source software often works well if someone exported an Excel file as a PDF directly and it can identify the structure of the spreadsheet without too much trouble. Just select the document that has the embedded spreadsheet and run the Add-On, or you can run it on multiple documents at once. It will then notify you when it’s done and you can download the spreadsheet as a CSV.

If that doesn’t work, there are two options available to users with any paid MuckRock subscription — Azure Table Extraction and Textract Table Extraction. Both work well across a wider set of documents.

Document summarization and tagging

Another paid feature that can save you time or help sift through even the largest document sets with ease is the GPT Playground. This lets you easily connect your documents to OpenAI’s APIs and add a custom prompt. You can have the saved responses added as tagged metadata on your documents or just have it generate a spreadsheet with the results.

This can be used as a simple document summarizer — just put in the prompt, “Summarize this document.” Or you can use it to try to tag documents on relatively sophisticated rubrics, such as “Respond if this document is either in support of further building development, opposed, or ambiguous. Only respond with one of these three words: Support, Opposed, Ambiguous” and then have the output saved as a tag so you can quickly sort through large amounts of public comments. You can even use it to try to extract data like the total value of contracts.

Explore other automations or build and share your own

These are some of the ways hundreds of newsrooms and journalists have started automating their workflows already, but there are now over 50 Add-Ons that cover everything from data extractions to visualization. Most are completely free, while Premium Add-Ons let you tap into a wide range of third-party services for more advanced needs. It’s also easy to develop your own Add-Ons building on our templates and just a few lines of code. You can read more about Add-Ons or join us on Slack and we’re happy to help you build and share other tools to make everyone’s life a little easier.

This guide was based off of the 2024 NICAR presentation, “Automating Your Beat.”