Recent Award

Creating Email Archives from PDFs: The Covid-19 Corpus

Columbia University will contribute email archiving solutions on both ends of the email stewardship cycle — acquisition and preservation, on one end, and research access, on the other. The focus will be on government responses to the Covid-19 pandemic that are being released through FOIA requests made available online by journalists. Consequently, researchers are facing a number of challenges accessing these records and cannot easily determine the scope of arrangement of the collections, or find descriptions of the contents of the main components. To combat these challenges Columbia will build an open-source tool and associated library that takes email embedded in PDFs as input and generates an MBOX file as output, thereby making these records compatible with existing email archiving solutions. In addition, the project team will process a large corpus of FOIAed records on Covid-19 to enhance its value to researchers and to develop it into a new collection as part of the Freedom of Information Archive (FOIArchive), an aggregated database of government records.

Home Department: 

Date: 

Friday, January 1, 2021 to Friday, December 31, 2021

Research Category: 

Amount: 

$98,630

Newsletter

Don't want to miss our interesting news and updates! Make sure to join our newsletter list.

* indicates required

Contact us

For general questions about ISERP programs, services, and events.

Working Papers Bulletin Sign-up

Sign up here to receive our Working Papers Bulletin, featuring work from researchers across all of the social science departments. To submit your own working paper for our next bulletin, please upload it here, or send it to iserp-communication@columbia.edu.
* indicates required