DocumentCloud Add-Ons: Automate data extraction, alerts, ingestion, and much more with our simple, open source plugin system

DocumentCloud Add-Ons: Automate data extraction, alerts, ingestion, and much more with our simple, open source plugin system

Have your docs do more with a click, from the command line, or even in your sleep

Written by
Edited by Mitchell Kotler

DocumentCloud’s most powerful feature has always been our users. Every day, that community pushes the boundaries of what can be done with documents, from solo journo-coders extracting data on deadline to the Documenters platform rethinking how to make public meetings more public.

To help drive that community’s impact and collaboration, today we’re launching Add-Ons, an easy way for anyone to launch, maintain, and share new capabilities right within DocumentCloud, ranging from exporting notes to applying machine learning techniques.

To get started, all you need to do is log in to DocumentCloud, select some documents, and then pick an Add-On. It will start running in the background, and then notify you of its progress. Add-Ons can also optionally send you an email, generate files for you, or be configured to integrate with a wide range of external tools, such as Slack, cloud-hosted APIs, or a range of open source packages.

All DocumentCloud users will get easy access to a growing library of new features. For those who are comfortable with writing or modifying code, Add-Ons also offer the ability to quickly build powerful custom analysis tools, workflows, and integrations. If you’re at the NICAR 2022 conference, join us Sunday morning for a session (virtually or in person) where we’ll be talking through and demonstrating Add-Ons live as well as getting your feedback on how to make it more useful.

Familiar tools, fine-tuned

While the entry point for DocumentCloud Add-Ons is a graphical interface, behind the scenes each Add-On is a standalone GitHub repository that is given temporary access to a set of documents specified by the user as well as permissions such as uploading, editing, and annotating, all through the DocumentCloud Python wrapper which gives you full access to do anything programmatically you could do through the web interface, as well as some additional features such as bulk upload via URL.

We have a starter “Hello World” Add-On template that covers many of the basic features and provides a structure to build your own projects, and our hope is that if you’ve ever written a script in Python to scratch your own itch that you’ll feel right at home and can update it to be a Add-On with very changes.

When a user dispatches an Add-On, the execution itself actually takes place via GitHub Actions, meaning that Add-On creators and users don’t have to worry about managing a server, setting up a development environment, or many of the other common barriers that get seek to thwart your work on deadline.

Inspired by Simon Willison’s Git scraping technique, this approach makes developing and maintaining extensions of DocumentCloud simpler while making it incredibly easy for users to iterate and build on each other’s work in exciting ways, whether that’s tweaking a public records exemption citation extractor to work in a different state or mixing and matching Add-Ons to create powerful combinations that work together.

Run Add-Ons with a click, from the command line, or in your sleep

We wanted to make Add-Ons as approachable for everyone as possible, but also easy to continue to tweak, test, iterate, and remix. To that end, in addition to running as an in-browser click-to-run interface, every Add-On is just as easy to pull locally and run from the command line, making it easy to tweak and change existing code to fit your own needs or allow you to integrate services with more intensive computing requirements or sensitive information.

But we’re also excited for the potential to not just simplify but automate a wide range of document analysis tasks, from monitoring and alerting to scraping and data analysis. Via GitHub, you can set up Add-Ons to run on a regular schedule. At NICAR this year, we were excited to see all interest and creative implementations of Git scraping, including data pipelining with Aadit Tambe and Nick McMillan and an introduction to Git scraping by Iris Lee, Ben Welsh, and Tambe.

Combined with those techniques, you can use Add-Ons to automate everything from keeping an eye on agency reading rooms to turning messy PDFs into structured spreadsheets and visualizations.

Use Add-Ons to extract data, organize your notes, and classify documents today

In addition to the Hello World Add-On template that demonstrates basic functionality, we have a few Add-Ons live now that can also serve as a base for your to fork and build on:

  • Regex Extractor: Let’s you define a Regex string to pull out specified text matches into a spreadsheet across a selection of documents.
  • PDF Export: Helps you get your PDFs out of DocumentCloud, adding the selected documents into a Zip file that’s then displayed to you.
  • Note Export: Extracts all the notes on selected documents and saves them as text files you can download.
  • Bulk Edit: Let’s you update metadata on many documents at once.
  • SideKick Document Classification: Makes it easy to train a machine learning model to classify documents by an arbitrary type, such as identifying if a document is likely to be an email, a resident complaint, or other categories of records.
  • Notification Alerts: A simple example of a scheduled automation that lets you adjust a search query and have DocumentCloud alert you if any new results match it.

Currently, running Add-Ons from the web interface requires submitting them through a review process and giving our team a chance to check in, but you can run Add-Ons from the command line or as scheduled GitHub Actions now, and we’ll be adding the ability to directly import and run your own Add-Ons from within the web interface, with no review process, in the coming months.

If you have written an Add-On or have an existing DocumentCloud script or other document analysis tooling you’d like to share as an Add-On for everyone to use, fill out this submission form and our team will follow up with you.

If you want a place to talk through your ideas or troubleshoot code, find us in the MuckRock FOIA Slack #public-tech channel (you might have to ask to be added to the tech channel) or shoot us a note at info@documentcloud.org.

Let’s build tools for a more open world together

The DocumentCloud and MuckRock communities have always inspired and driven us in unexpected ways, and we built DocumentCloud Add-Ons to embrace that and reflect our vision of an open, iterative, and collaborative future for both journalism and communities around the world.

We’re excited work with Prof. Hilke Schellmann and Sr. Research Scientist Dr. Mona Sloane of New York University to bring their pioneering Gumshoe machine learning analysis platform to DocumentCloud through Add-Ons with support from the Patrick J. McGovern Foundation, and we hope to use this as a model for other ways to collaborate with researchers and others who have exciting technologies that they want to put in the hands of those who inform the public.

To help a wide range of organizations tackle big challenges and build solutions that help their own and other communities, we’ll be announcing additional opportunities to work with our team to support your ideas that push the boundaries of how journalists, researchers, and civic organizations collect, analyze, and share documents with the world. To get updates on those opportunities as well as trainings, new Add-Ons, and additional resources, register for the DocumentCloud newsletter to get a few updates a month.

We’re also going to be growing our own team in the coming months, particularly with software development, project management, and design roles. We don’t have those exact roles ready to announce quite yet, but if you’re interested, please let us know and we’ll reach out when we’re ready to share more.

We’re incredibly excited to find new ways of bringing the amazing community of journalists, open records fans, and open data lovers together, and we can’t wait to see what you come do with Add-Ons.