The FERC Enron Email Data Set may be the second data set users typically find if they look for a more comprehensive data set than the CALO Enron Email Data Set. This is because googling “enron email” will bring up the CMU hosting page for the CALO email data set which refers to the FERC data set.
Using the FERC data set has a few challenges, namely:
- Large size: The large size of the dataset (100+GB) means that it isn’t readily downloadable. An online iCONECT interface is available for browsing with attachments. The site is hosted by Lockheed Martin.
- iCONECT format: The data comes as static images and in a flat file database format. The latter are “iCONECT24/7 / Concordance databases in delimited record format, with attachments,” not a standard email form such as MIME, PST, or NSF. The format is described in this WMCU0356_UMD_Transmittal.pdf document.
The dataset is made available in the following formats which are described in the Aspen Systems document.
- Enron Email database
- Enron Email (re-released) database
- Enron Email (.pst) database
- Enron Email (.pst) (re-released) database
- Scanned Documents database
- Scanned Documents (re-released) database
One of the EnronData Project’s goals is to take the FERC email and convert it into properly formatted PST and NSF formats, similar to their original states. A few software vendors have been contacted to see if iCONECT / Concordance databases can be reconstituted into PST / NSF files with attachments without success to date. Without an established solution, the EnronData Project is working on it’s own conversion utilities.