Although much of the original Enron Email came in PST files, the most common form to get this email in today is in MIME format (through CALO / CMU) and as a MySQL database. To recreate the email in PST format, Pete Warden performed an earlier PST conversion of the CALO dataset. Pete’s PST is
In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc. Various datasets have used a string consisting generally of lastname and first initial to identify custodians.
While this project’s initial goal is to create original, non-deduped datasets, oftentimes, the full dataset is not needed. Sometimes duplicates are not desired and sometimes attachments are not desired. The challenge is to meet this requirements while maintaining a realistic dataset. One of the challenges with deduping is which duplicate do you remove and do
The UC Berkeley ANLP has performed user categorization of about 1700 emails from the CALO email data set. The information provided in the ANLP derivative data set is a subset of the CALO data set and has been reorganized. This UCB-ANLP to CALO mapping file provides the information to associate the ANLP data with emails
The CALO dataset is perhaps the most widely used data set and is available for download at http://www.cs.cmu.edu/~enron/. This dataset is a derivative of the FERC dataset and has been referenced in many email research studies and is also used by many commercial E-Discovery organizations. The CMU page describes this dataset as follows: This dataset
The FERC Enron Email Data Set may be the second data set users typically find if they look for a more comprehensive data set than the CALO Enron Email Data Set. This is because googling “enron email” will bring up the CMU hosting page for the CALO email data set which refers to the FERC