Lost in the image records are the steps that involved the data – and there were a lot of them. The archive was text that came from an OCR (optical character recognition) process, and was incredibly messy. To make matters worse, the file names for each issue were machine-generated and didn’t tie to the actual date order of the documents. A great deal of our time was spent cleaning up this data, and compiling customized datasets (many of which never ended up getting used).