While this project’s initial goal is to create original, non-deduped datasets, oftentimes, the full dataset is not needed. Sometimes duplicates are not desired and sometimes attachments are not desired. The challenge is to meet this requirements while maintaining a realistic dataset. One of the challenges with deduping is which duplicate do you remove and do
The UC Berkeley ANLP has performed user categorization of about 1700 emails from the CALO email data set. The information provided in the ANLP derivative data set is a subset of the CALO data set and has been reorganized. This UCB-ANLP to CALO mapping file provides the information to associate the ANLP data with emails
The CALO dataset is perhaps the most widely used data set and is available for download at http://www.cs.cmu.edu/~enron/. This dataset is a derivative of the FERC dataset and has been referenced in many email research studies and is also used by many commercial E-Discovery organizations. The CMU page describes this dataset as follows: This dataset
The FERC Enron Email Data Set may be the second data set users typically find if they look for a more comprehensive data set than the CALO Enron Email Data Set. This is because googling “enron email” will bring up the CMU hosting page for the CALO email data set which refers to the FERC
A lot of work has already been formed on the Enron Email Dataset. K. Krasnow Waterman identifies the following datasets in his 2006 report: Dataset Records Users FERC / Aspen 1,000,000+ 158 CALO 517,431 151 USC 252,759 161 CMU Intermedate 619,446 158 CMU 200,399 158 UMass ? 149 Queens University ? ? He makes note
Welcome to the EnronData.org (EDO), the Enron Data Reconstruction Project. The collapse of Enron and subsequent public release of Enron data by the FERC has resulted in one of the largest and richest publicly available data sets for email research. This data has been widely and successfully used to support many academic research projects and
An increasingly important aspect of email and file management is the issue of open vs. closed file formats. Open formats are gaining popularity and allow organizations to retain control their own data without the costs often associated with vendor lock-in. The acceptability of high switching costs and sometimes operational costs are giving way to the
While there are many commercial tools that will convert between various email formats, for NSF to PST conversion, John Randall of Randall Consulting provides the following warnings: Subject: Lotus Notes to .pstFrom: John RandallDate: Mon Jul 28, 2008 10:58 pmURL: http://www.litigation-support.org/viewtopic.php?t=16502 You should be very careful of any migration tool that converts .NSF to .PST.