Creating Email Archives from PDFs: The Covid-19 Corpus
Columbia University will contribute email archiving solutions on both ends of the email stewardship cycle — acquisition and preservation, on one end, and research access, on the other. The focus will be on government responses to the Covid-19 pandemic that are being released through FOIA requests made available online by journalists. Consequently, researchers are facing a number of challenges accessing these records and cannot easily determine the scope of arrangement of the collections, or find descriptions of the contents of the main components. To combat these challenges Columbia will build an open-source tool and associated library that takes email embedded in PDFs as input and generates an MBOX file as output, thereby making these records compatible with existing email archiving solutions. In addition, the project team will process a large corpus of FOIAed records on Covid-19 to enhance its value to researchers and to develop it into a new collection as part of the Freedom of Information Archive (FOIArchive), an aggregated database of government records.
Home Department:
Funder:
Date:
Research Category:
