Release Notes: Import and search emails, extract metadata from PDFs and more

Release Notes: Import and search emails, extract metadata from PDFs and more

DocumentCloud ❤️ 📧

Over the past few weeks, the MuckRock team has been busy with several updates and improvements. Our new Email Archiver Add-On allows you to preserve email files (EML/MBOX) with corresponding metadata for long-term storage by seamlessly converting emails to EA-PDFs, a new archive-friendly standard that preserves email metadata in a consistent way while ensuring emails are consistently preserved as PDFs.

The Add-On also uploads the converted emails to DocumentCloud and stores metadata about the email as key-value pairs on the document. Finally, the Add-On allows you to download a compressed archive of email attachments when it has completed converting all of your emails. The Metadata Extractor Add-On utilizes exiftool to extract essential metadata such as Author, Creation Date, Creator, Modify Date and Producer from PDF documents, allowing you to pull metadata out of the underlying PDF into DocumentCloud key-value pairs.

We’ve also fixed a few bugs across the Regex Extractor, Push to IPFS/Filecoin and Email Conversion Add-Ons, improving overall stability and reliability.

DocumentCloud

Email Archiver

The EA-PDF specification is an archive-friendly metadata-preserving PDF format for long term preservation of email (EML/MBOX) files.

The Email Archiver Add-On, developed with support from by the University of Illinois Library’s Email Archives: Building Capacity and Community program, allows you to convert emails to an EA-PDF and import the converted emails to your DocumentCloud account. Email attachments are presented for download upon completion of the conversion and metadata about each email is stored as key-value pairs on the uploaded files. EA-PDF is a new draft archival PDF standard developed by the PDF Association in partnership with the Library of the University of Illinois at Urbana-Champaign and others. This Add-On builds on an email conversion library developed by Tom Habing.

This is an early version of this Add-On, and we welcome feedback via email as well as pull requests to its source code.

Metadata Extractor Add-On

As part of our collaboration with the Email Archives: BCC, we are also working to enhance how we extract and display document metadata. The metadata extractor add-on uses exiftool, a popular file metadata analysis tool, to pull the Author, Creation Date, Creator, Modify Date and Producer metadata from the underlying PDF, if they are specified, and then save those as corresponding key-value pairs on the document.

Bug Fixes

The Regex Extractor, Push to IPFS/Filecoin and Email Conversion Add-Ons all received bug fixes that made them less likely to fail.