The CALO dataset is perhaps the most widely used data set and is available for download at http://www.cs.cmu.edu/~enron/. This dataset is a derivative of the FERC dataset and has been referenced in many email research studies and is also used by many commercial E-Discovery organizations. The CMU page describes this dataset as follows:
CALO correctly identified 8 duplicate, misspelled custodians in the FERC dataset, resulting in 150 CALO custodians vs. 158 FERC custodians..
In addition to the above, the CALO dataset has a number of optimizations:
- Message-ID: New Message-IDs have been created and used in place of existing Message-IDs
- Date: Dates have been canonicalized replacing the raw dates
- Headers: Some other headers are missing from the email
Removing the attachments makes the dataset much more manageable in size. Mark Dredze has created a version of the CALO dataset with attachment information brought over from the FERC dataset.
K. Krasnow Waterman discusses how these changes affect the email in Knowledge Discovery in Corporate Email: The Compliance Bot Meets Enron, 2006.