magnify
formats

Deduplication and Attachment Stripping – Reducing the Dataset

While this project’s initial goal is to create original, non-deduped datasets, oftentimes, the full dataset is not needed. Sometimes duplicates are not desired and sometimes attachments are not desired. The challenge is to meet this requirements while maintaining a realistic dataset.

One of the challenges with deduping is which duplicate do you remove and do you leave a link behind? For example, if Alice sends a message to Bob, it will typically exist in at least 3 places, in Alice’s Sent folder, in Alice’s Inbox and in Bob’s Inbox. If you were to remove two of those, which two would you remove and how representative would the resulting dataset be?

There seem to be three solutions to this depending on which problem you are solving.

  1. Single-Instance Storage (Loss-less Storage Reduction): If storage is a problem, email archiving solutions solve this through SIS, or Single-Instance Storage where multiple copies of an email are stored only once. Emails on the mail server can be replaced by stubs which are emails where the body consists of a pointer to the full email. This way all records are accounted for but the storage costs are dramatically reduced for duplicates.
  2. Attachment Elimination (Lossy Storage Reduction): Dataset size can be reduced even more by elimination attachments entirely while including attachment information either in the header or body. This can be created to simulate email with Attachment Stubbing where the attachments are no longer available.
  3. Journal Email (Duplicate Elimination): If the goal is to eliminate duplicates entirely, maintaining user mailbox folders becomes problematic because certain folders will be missing email while others will not be. One way to address this problem is to eliminate folders entirely and move all the email into one or a few global folders, say organized by date instead of user. This is similar to how email archiving via journaling works. With journaling, a copy of each email sent or received, often eliminating duplicates, making it seem like an ideal approach for this requirement.

By using the above approaches, we can reduce the number of duplicates and the storage requirements while maintaining the characteristics of real world email datasets. I’ve added these to the proposed dataset list for consideration. Please let me know if you think these or other datasets would be of use. For now, I’ve added these to the projects page.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

UC Berkeley ANLP User Categorization to CALO Mapping

The UC Berkeley ANLP has performed user categorization of about 1700 emails from the CALO email data set. The information provided in the ANLP derivative data set is a subset of the CALO data set and has been reorganized.

This UCB-ANLP to CALO mapping file provides the information to associate the ANLP data with emails in the larger CALO dataset.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

The CALO Enron Email Dataset

The CALO dataset is perhaps the most widely used data set and is available for download at http://www.cs.cmu.edu/~enron/. This dataset is a derivative of the FERC dataset and has been referenced in many email research studies and is also used by many commercial E-Discovery organizations. The CMU page describes this dataset as follows:

  1. This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes).
  2. It contains data from 150 custodians, mostly senior management of Enron, organized into folders.
  3. The corpus contains a total of about 0.5M messages.
  4. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
  5. The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available.
  6. The dataset here
    1. does not include attachments, and
    2. some messages have been deleted “as part of a redaction effort due to requests from affected employees”.
    3. Invalid email addresses were converted to something of the form user@enron.com whenever possible (i.e., recipient is specified in some parse-able format like “Doe, John” or “Mary K. Smith”) and to no_address@enron.com when no recipient was specified.

CALO correctly identified 8 duplicate, misspelled custodians in the FERC dataset, resulting in 150 CALO custodians vs. 158 FERC custodians..

In addition to the above, the CALO dataset has a number of optimizations:

  1. Message-ID: New Message-IDs have been created and used in place of existing Message-IDs
  2. Date: Dates have been canonicalized replacing the raw dates
  3. Headers: Some other headers are missing from the email

Removing the attachments makes the dataset much more manageable in size. Mark Dredze has created a version of the CALO dataset with attachment information brought over from the FERC dataset.

K. Krasnow Waterman discusses how these changes affect the email in Knowledge Discovery in Corporate Email: The Compliance Bot Meets Enron, 2006.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

The FERC Enron Email Dataset

The FERC Enron Email Data Set may be the second data set users typically find if they look for a more comprehensive data set than the CALO Enron Email Data Set. This is because googling “enron email” will bring up the CMU hosting page for the CALO email data set which refers to the FERC data set.

Using the FERC data set has a few challenges, namely:

  1. Large size: The large size of the dataset (100+GB) means that it isn’t readily downloadable. An online iCONECT interface is available for browsing with attachments. The site is hosted by Lockheed Martin.
  2. iCONECT format: The data comes as static images and in a flat file database format. The latter are “iCONECT24/7 / Concordance databases in delimited record format, with attachments,” not a standard email form such as MIME, PST, or NSF. The format is described in this WMCU0356_UMD_Transmittal.pdf document.

The dataset is made available in the following formats which are described in the Aspen Systems document.

  1. Enron Email database
  2. Enron Email (re-released) database
  3. Enron Email (.pst) database
  4. Enron Email (.pst) (re-released) database
  5. Transcripts
  6. Scanned Documents database
  7. Scanned Documents (re-released) database

One of the EnronData Project’s goals is to take the FERC email and convert it into properly formatted PST and NSF formats, similar to their original states. A few software vendors have been contacted to see if iCONECT / Concordance databases can be reconstituted into PST / NSF files with attachments without success to date. Without an established solution, the EnronData Project is working on it’s own conversion utilities.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

The Enron Email Datasets

A lot of work has already been formed on the Enron Email Dataset. K. Krasnow Waterman identifies the following datasets in his 2006 report:

Dataset Records Users
FERC / Aspen 1,000,000+ 158
CALO 517,431 151
USC 252,759 161
CMU Intermedate 619,446 158
CMU 200,399 158
UMass ? 149
Queens University ? ?

He makes note that different datasets identify different numbers of users. EDRP has identified 158 FERC custodians and 150 CALO users The FERC list was generated by taking a case insensitive list of the iCONECT ORIGIN column and the CALO list was compiled using a directory listing of the CMU hosted tar file. Looking at the comparison quickly, it appears likely that some of the users that were eliminated from the CALO dataset were misspellings.

In follow up posts, individual datasets will be discussed regarding their applicability for different purposes.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

EnronData.org Introduction

Published on January 3, 2009 by in EDO, Enron Data

Welcome to the EnronData.org (EDO), the Enron Data Reconstruction Project. The collapse of Enron and subsequent public release of Enron data by the FERC has resulted in one of the largest and richest publicly available data sets for email research. This data has been widely and successfully used to support many academic research projects and commercial organizations that require email data; however, much more can be done.

The goals of the EnronData.org are to provide some alternative derivative data sets and to explain some of the more esoteric aspects of the datasets. This project was inspired by examining the current state of this rich dataset including: (a) examining the data itself, (b) listening to requirements from the community, and (c) observing questions people had on existing data sets. If you’ve ever wondered why the Enron email is the way it is, EDRP may be able to explain it for you.

Projects actively being considered by EDO include:

  • Native PST and NSF Files: reconstituting PST and NSF email in the most original state possible, including attachments
  • Modified Datasets: creating modified datasets for research purposes, e.g. MIME / Maildir with restored headers and attachments if a need is identified
  • Directory Load Files: creating files for LDAP servers, Active Directory, and Domino Directory
  • Metadata Organization: creating EDRM files to associate metadata with the email files

EDO is actively seeking individuals and organizations that wish to contribute to this effort. If you or your organization would like to assist, please contact John Wang at johncwang@gmail.com.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

Open Archive Formats

An increasingly important aspect of email and file management is the issue of open vs. closed file formats. Open formats are gaining popularity and allow organizations to retain control their own data without the costs often associated with vendor lock-in. The acceptability of high switching costs and sometimes operational costs are giving way to the demand for open standards formats. The debate for file formats has changed from open vs. closed to which open standard to choose in the discussions of ODF vs. OOXML and, to a lesser extent, PDF vs. XPS.

Parallel to the debate for living file formats, the issue of archival formats is just as important and growing in criticality. The importance of archival format is highlighted by the recent changes to the FRCP extending coverage to electronically stored information (ESI) covering email and files. In addition to these legal requirements, organizations are facing ever increasing amounts of ESI that they must manage, often reaching into the billions of records requiring terabytes of storage. As the amount of data increases, the negative consequences of storing ESI in a proprietary, closed format also increases. It is time for vendors and organizations to move to open standard archival formats.

An example

To highlight the importance of this issue, it is instructive to look at an example with a useful one being Symantec Enterprise Vault (EV). Enterprise Vault is a popular content archiving solution and like many content archiving solutions, EV began its life as an email management solution and has evolved into a more general purpose records management solution adding file system archiving, categorization, retention management, ILM, and preservation hold capabilities. These capabilities are important, but let’s look at how it stores files. It archives a variety of content including native email formats (MS Exchange and Lotus Domino) and file formats including MS Office, PDF, ODF, etc. The native formats of these files come in both open and closed formats. When EV archives these files, it creates a proprietary EV Digital Vault Saveset (DVS) file to encapusualte the native file, an HTML copy and some metadata. The problem with this is that once the file is in DVS format, the only way to read it is using Symantec tools and neither the specification nor the tools are freely available. An organization’s open format files (ODF, OOXML, PDF, etc.) suddenly cannot be opened or indexed by other solutions.

What can vendors do?

What can be done with this situation? Vendors can begin the process of moving to oen standard formats. To see how this can work, let’s use the DVS file as an example again. The DVS file is a proprietary container that includes the native format file, a HTML conversion copy and some metadata. The HTML is an open standard and the metadata can be written as XML. Then the DVS container can be changed to an open format such as ZIP. This would turn the file into an open standard much the same way that Microsoft’s OOXML and XPS are designed. To see this, simply change the extension of an OOXML file from say .docx to .zip and then open the file in your favorite unzip utility. The technology is available for vendors to move to open stadards. It is also available for organizations to add this to their requirements.

What can you do?

Here are some items organizations should consider when managing their data:

  1. organizations that are choosing a solution, ensure that it stores files in open standard formats. When running a proof of concept (POC), ask your engineers if they can copy the files on to a separate system and read them without proprietary vendor tools.
  2. if the current vendor does not support open standards formats, ask them when they will support it on their roadmap.
  3. If the vendor does not currently support open standard formats and you are uncomfortable, consider migrating to a solution that does support open standards. The amount of ESI inside organizations is growing everyday and the sooner a migration is made the less time and costs will be incurred.

If a decision is made to migrate away from closed format solutions, some options may be available, though at a cost. TransVault and Procedo are two providers that can assist. TransVault makes a product with the same name that can migrate data out of Symantec Enterprise Vault, Autonomy/ZANTAZ EAS and OpenText IXOS. TransVault has a growing Partner Network that can provide services including Instant InfoSystems in the US. Procedo is a similar solution called the Procedo Archive Migration Manager.

Conclusion

As organizations start to manage their files and email in content archiving and management solutions, it is becoming increasingly important to keep maintain the data’s readability. The need for open standard document formats has been established and it is time for the same philosophies to extend to the archiving solutions designed to preserve them. Organizations can maintain control of their content by deploying an open standards solution, encouraging vendors to support opens standards, or migrating to solutions that supports open standards. In the long run, the open standard approach is the most logical choice.

Photo courtesy of Elliott Brown.

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

NSF to PST Conversion Issues

While there are many commercial tools that will convert between various email formats, for NSF to PST conversion, John Randall of Randall Consulting provides the following warnings:

Subject: Lotus Notes to .pst
From: John Randall
Date: Mon Jul 28, 2008 10:58 pm
URL: http://www.litigation-support.org/viewtopic.php?t=16502

You should be very careful of any migration tool that converts .NSF to .PST. Do not just assume that the tool will convert all e-mails and attachments because the program says so.

With the mjaority of migration tools in converting .PST to .NSF these are just a few of the problems.

  1. Possibility of bogus duplicates. Because Lotus Notes actualy contain different views of the same message. So when converting to Outlook it is quite possible there will be duplicates and lots of them.
  2. Converting to .PST usually increases the size of the e-mail store thus the client will be charged extra.
  3. “All Documents” folder does not always contain all documents. So some migration tools only use the All documents folder to try and get around the bogus duplicate problem.
  4. If this is not reason enough how about this one. You will most likely not get all e-mails and attachments. I could go on and on and on.
  5. We are dealing with electronic discovery where every e-mail. And every attachment and embedded document needs to be accounted for.

I would finally ask this question. Why do you need to convert it to a .PST? Can you process it natively?

If using a migration tool to convert .NSF files to .PST files be careful! Be very careful!

Thanks
John Randall
President
Senior Consultant/Trainer
Randall Consulting

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments