magnify
Home 2009 January
formats

The Mailbox PST Dataset

Although much of the original Enron Email came in PST files, the most common form to get this email in today is in MIME format (through CALO / CMU) and as a MySQL database. To recreate the email in PST format, Pete Warden performed an earlier PST conversion of the CALO dataset. Pete’s PST is

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
1 Comment  comments 
formats

Custodian Names and Titles

In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc. Various datasets have used a string consisting generally of lastname and first initial to identify custodians.

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

Deduplication and Attachment Stripping – Reducing the Dataset

While this project’s initial goal is to create original, non-deduped datasets, oftentimes, the full dataset is not needed. Sometimes duplicates are not desired and sometimes attachments are not desired. The challenge is to meet this requirements while maintaining a realistic dataset. One of the challenges with deduping is which duplicate do you remove and do

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

UC Berkeley ANLP User Categorization to CALO Mapping

The UC Berkeley ANLP has performed user categorization of about 1700 emails from the CALO email data set. The information provided in the ANLP derivative data set is a subset of the CALO data set and has been reorganized. This UCB-ANLP to CALO mapping file provides the information to associate the ANLP data with emails

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

The CALO Enron Email Dataset

The CALO dataset is perhaps the most widely used data set and is available for download at http://www.cs.cmu.edu/~enron/. This dataset is a derivative of the FERC dataset and has been referenced in many email research studies and is also used by many commercial E-Discovery organizations. The CMU page describes this dataset as follows: This dataset

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

The FERC Enron Email Dataset

The FERC Enron Email Data Set may be the second data set users typically find if they look for a more comprehensive data set than the CALO Enron Email Data Set. This is because googling “enron email” will bring up the CMU hosting page for the CALO email data set which refers to the FERC

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

The Enron Email Datasets

A lot of work has already been formed on the Enron Email Dataset. K. Krasnow Waterman identifies the following datasets in his 2006 report: Dataset Records Users FERC / Aspen 1,000,000+ 158 CALO 517,431 151 USC 252,759 161 CMU Intermedate 619,446 158 CMU 200,399 158 UMass ? 149 Queens University ? ? He makes note

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

EnronData.org Introduction

Published on January 3, 2009 by in EDO, Enron Data

Welcome to the EnronData.org (EDO), the Enron Data Reconstruction Project. The collapse of Enron and subsequent public release of Enron data by the FERC has resulted in one of the largest and richest publicly available data sets for email research. This data has been widely and successfully used to support many academic research projects and

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments