A collection of 1073 email messages and 72 documents from Britain’s University of East Anglia (UEA)‘s Climate Research Unit (CRU) related to climate change research was leaked on to the Internet last week. This collection is currently being widely discussed on the Internet and gives a peek into how climate change research has been managed,
The EDRM Enron PST files are now available on the EDRM Data Set website thanks to George Socha, EDRM, and ZL Technologies. I am co-lead of the EDRM Data Set project and personally worked on this data set at ZL Technologies so I thought I would provide a brief introduction to this data before our
On Tuesday October 13, I gave a presentation at Oracle OpenWorld on E-Mail Archiving with “Extreme Performance” and “Green Computing” using a ZL+Oracle solution. The presentation discusses proven performance 100x greater than other solutions by using technologies such as private cloud computing and grid computing. The Extreme Performance theme of the show is especially fitting
I’m happy to report that ZL Unified Archive has been named Trend Setting Product of 2009 by KMWorld. As the disciplines of eDiscovery, Records Management, and Email Management continue to merge, it is becoming more important than ever to proactively and effectively manage information in a scalable manner. Regarding this year’s selected products, KMWorld wrote:
Email servers were never designed to archive email messages for long periods of time, apply organizational retention and disposition policies, or perform fast search across an entire email environment. However, the email landscape has changed considerably and organizations that must contend with these requirements have increasingly turned to archiving solutions to fill this need. With
I’m pleased to announce that an initial version of the EDRM Enron Email Data Set consisting of 40GB of PST files with attachments and folder structure is now available within the EDRM project as of the EDRM 2009-2010 Kick-Off Meeting. The EDRM Data Set Project is now working to make this data set publicly available.
A number of people have contacted me about getting the current PST corpus via an alternative manner. This is partially due to the bandwidth restrictions that have been in place for the HTTP download. I planned to put in some other download methods but haven’t had time yet. Until then, if you will be at
Use of Search Engine Term Black Lists (Stop Words or Noise Words) Can be Detrimental for Findability
Stop words, or noise words, are black lists of words that search engines choose not to index. These are used by some search engines that consider the words of little value; however, they should still be used in eDiscovery where it is more important to find all responsive documents than to provide a just a
Although much of the original Enron Email came in PST files, the most common form to get this email in today is in MIME format (through CALO / CMU) and as a MySQL database. To recreate the email in PST format, Pete Warden performed an earlier PST conversion of the CALO dataset. Pete’s PST is
In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc. Various datasets have used a string consisting generally of lastname and first initial to identify custodians.