1073 “Global Warming” Emails Leaked

Published on November 23, 2009 by in email

A collection of 1073 email messages and 72 documents from Britain’s University of East Anglia (UEA)‘s Climate Research Unit (CRU) related to climate change research was leaked on to the Internet last week. This collection is currently being widely discussed on the Internet and gives a peek into how climate change research has been managed, including the process of peer review. Phil Jones, Director of the CRU and Professor at UEA, told Investigate Magazine‘s TGIF Edition the emails appeared to be genuine and that they may have been retrieved during a recent hacking incident, saying “It was a hacker. We were aware of this about three or four days ago that someone had hacked into our system and taken and copied loads of data files and emails.” Some are now referring to this incident as the “CRUHack” which is searchable on Twitter.

The collection was first posted to a Russian FTP site before finding its way on to BitTorrent and being published as a web searchable archive.

I won’t get into the specific subject matter as this is already being covered on many sites (some links provided below); however, I will provide an overview of the email that is provided.

  • Email Files: the email is available in 1073 text files each containing one MIME message. Some messages have the Eudora x-flowed tag indicating that Eudora may be the email client used.
  • Email Headers: only common headers such as From/To/Cc/Subject/Date are available.
  • Date Header: the email date header shows a variety of dates including different formats, time zones, and UTC offsets indicating the date field is original and has not been canonicalized.
  • From/To/Cc Headers: These headers appear to all contain SMTP email address with some headers also containing display names. The original collection has full email addresses while many downstream reposters have anonymized email addresses by removing the server name portion of the email address.

The CRU appears to be considering their options in light of this hack. As of yet, they have not threatened legal action against the numerous blogs and users that have reposted the email. Some have suggested that the email has already been but into the public domain but CRU has not made a statement on this yet; however, as of now, they do not appear to be taking, or threatening, legal action against parties that are posting this data.

For discussion of the contents, including alleged efforts to manipulate climate change data, see the following sites:

Photo courtesy of Victius.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 

EDRM Enron PST files are now available

The EDRM Enron PST files are now available on the EDRM Data Set website thanks to George Socha, EDRM, and ZL Technologies. I am co-lead of the EDRM Data Set project and personally worked on this data set at ZL Technologies so I thought I would provide a brief introduction to this data before our formal description comes out. In the interests of full disclosure, I created the PST files available at as a precursor to the EDRM PST files which are now available. If you have any questions regarding the data set you would like answered, either in the paper or informally, please post to the EDRM Data Set webpage, here, or the litsupport mailing list thread. Alternately, you can send email to or myself directly at

As with other publicly available Enron email, this data set originates from a FERC distribution. The FERC distribution contains email from Microsoft Exchange and Lotus Domino email environments that have been processed for eDiscovery through IPRO. A challenge with this data is that it is available as a load file and not as email. The EDRM Data Set project’s research into conversion utilities indicated that many eDiscovery tools are available to convert from email format to load file format but not the other way around. Based on this, ZL created conversion tools to migrate IPRO’s load file format back to email format from which the PST files were created.

Since the email was processed for eDiscovery, there are varying levels of restoration that can be performed beyond simply converting the load file format to email format. Some of these have been implemented in this data set. Some additional steps such as recreating Notes email have been scheduled for future work. There will be a discussion of this in the description paper.

As mentioned above, please send us your questions on this data set so we can answer in our formal description as well as informally beforehand.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 

Extreme Performance Archiving Presentation at Oracle OpenWorld (OOW)

On Tuesday October 13, I gave a presentation at Oracle OpenWorld on E-Mail Archiving with “Extreme Performance” and “Green Computing” using a ZL+Oracle solution. The presentation discusses proven performance 100x greater than other solutions by using technologies such as private cloud computing and grid computing. The Extreme Performance theme of the show is especially fitting for E-Mail Archiving as organizations look for ways to solve multiple performance and scalability challenges. While the numbers presented are already orders of magnitude greater than many existing solutions, it will be interesting to see what additional benefits Oracle Exadata 2 can provide.

We had a great discussion, covering a range of topics on eDiscovery and integration with various Oracle products including RAC, Data Guard, UOA, EAS, URM, UCM, SES, etc. That looks like quite the acronym soup but if you’re interested in any of these integrations, just ask.

OOW 2009 was a blast and I hope everyone enjoyed it as much as I did.

Drive Information Governance and Significant Cost Savings with Email Archiving

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
1 Comment  comments 

ZL Unified Archive named Trend Setting Product of 2009

Published on August 29, 2009 by in award, ZL

I’m happy to report that ZL Unified Archive has been named Trend Setting Product of 2009 by KMWorld.

As the disciplines of eDiscovery, Records Management, and Email Management continue to merge, it is becoming more important than ever to proactively and effectively manage information in a scalable manner. Regarding this year’s selected products, KMWorld wrote:

  • They represent what we believe are the solutions that best exemplify the spirit of innovation demanded by the current economy, while providing their customers with the unique tools and capabilities to move and grow beyond the recession.
  • They do represent the ones best suited to meet the needs of KMWorld readers.
  • They all have been designed with a clear understanding of customers’ needs.

At ZL Technologies, we’ve been expanding the high performance capabilities of the Unified Archive cloud computing information management platform with advanced analytics, concept search and governance capabilities. Our customers are a great testament to our success and we are honored to have KMWorld recognize our accomplishments. If you are looking for a scalable email and information management solution, please contact ZL for an introduction to our solutions.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 

Exchange 2010 Archiving Considerations

Published on August 3, 2009 by in email

Email servers were never designed to archive email messages for long periods of time, apply organizational retention and disposition policies, or perform fast search across an entire email environment. However, the email landscape has changed considerably and organizations that must contend with these requirements have increasingly turned to archiving solutions to fill this need. With Exchange 2010 (E14), Microsoft will be introducing first generation email archiving and there have been many questions on what this will mean for third-party archives, many of which are provided by Microsoft Gold Certified Partners.

As with many software solutions, it usually takes a few versions to work out the kinks and also add the basic feaure requirements and Exchange 2010 is no different. Indeed, Microsoft employees discussing Exchange 2010 features have suggested some requirements may still be better addressed by third-party archives and even the continued use of PST files. The following are some key considerations when looking at Exchange 2010 archiving and other third-party solutions:


  1. Limited Enterprise-Wide Search
    • Description: In Exchange 2010, eDiscovery searches are limited by Exchange organization and multi-Org searches cannot be performed. Users that require offline access also will not be covered as Exchange 2010 archiving will not support offline access (see more below) and PST files have been suggested as a continued solution for these uesrs. Finally, you will not be able to search across other repositories including Windows file shares, SharePoint, and other non-Microsoft repositories.
    • Impact: Exchange 2010 is providing more eDiscovery search capabilities; however, the capabilities still appear to fall short of the ultimate requirements and may require the Exchange data be exported to another eDiscovery solution for more comprehensive search and litigation hold. As eDiscovery needs to cover all ESI within the organization, third-party archives are still ahead in performing full enterprise-wide search of unstructured content by query terms, custodian, and more advanced features such as faceted search and clustering.
  2. No Legal Holds for Public Folders
    • Description: Exchange 2010 supports legal holds for user mailboxes but not for public folders.
    • Impact: All responsive ESI must be preserved when litigation is anticipated. A data map that shows ESI stored in Exchange public folders naturally leads to the question of how that information is collected, preserved, reviewed, and produced. Because Exchange 2010 will not handle public folders, organizations using this feature may wish to consider or stay with a third-party solution.

Costs and Manageability

  1. Increased Primary Exchange Mailbox Database Sizes
    • Description: One of the primary goals of many Exchange administrators for years has been to reduce the sizes of active Exchange stores, primarily by limiting mailbox sizes and having user’s store archived email in PST files. While moving email off of Exchange to PST files was considered best practices at one time, this is no longer the case as organizations seek to better manage their email. Exchange 2010 will reverse this process by moving all of a user’s email back to the Exchange server, on to the user’s primary mailbox database.
    • Impact: By moving additional email messages on to the Exchange primary mailbox databases, organizations will have to contend with increased storage costs as well as longer backup and retore times. Organizations that wish to keep their older emails off of Exchange for infrastructure management will want to continue to look to third-party archives.
  2. Increased Exchange Storage Requirements (Elimination of SIS)
    • Description: Single-Instance Storage, a leading de-duplication technique that has existed in Exchange since 4.0, has been removed. A key reason for this is that Exchange’s design of increasing the number of stores and databases reduces, if not entirely eliminates, the storage benefits afforded by SIS. This occurs because duplicate messages are not distributed within individual Exchange databases. SIS has been replaced by in-store compression which according to some Microsoft MVPs will only cover easily compressible email parts such as headers and message bodies. Email attachments, which are often already compressed (e.g. Microsoft Office 2007 files) will see little benefit and are reportedly not covered.
    • Impact: Replacing SIS with a solution that covers email headers and bodies will not be effective in controlling storage. According to Radicati Group, attachments account of 85% of all email data. As more and more attachments come in a pre-compressed state (Office 2007, PDF, ZIP, JPEG, etc. files), it may be unlikely in-store compression can offer storage savings compared to a global SIS solution. Some SIS solutions from email archive vendors can SIS all of an organization’s email, without having the per-database limitations imposed by Exchange.
  3. Requirement to Upgrade to Outlook 2010
    • Description: Outlook 2010 will be required to access Exchange 2010 Archives.
    • Impact: Organizations will need to upgrade to Outlook 2010 to have manage email using Exchange 2010 archiving; however, this will not support offline access (see below). Many third-party archives will continue to support multiple versions of Outlook in a managed email environment.
  4. No Offline Access to the Archive
    • Description: Road warriors often need access to email offline or in an otherwise disconnected mode. PST files provided a way to achieve this because the email could always be located with the user, whether it was on a plane, train, or in an automobile. With Exchange 2010, there will be no offline access and Outlook users will need to have live access to Exchange 2010′s archive mailboxes. At this time, there are no plans to add this capability.
    • Impact: Some high value users may not find it acceptable to require a live connection to Exchange to access their email. An offline capability will need to exist eventually before these users will be willing to move their email into an Exchange 2010 archive.


Email management has become a pressing need for organizations that need to manage that data for retention, disposition, and E-Discovery. Exchange 2010 is a step in the right direction, but as with many first generation products, it has large functionality gaps before it can replace the archive solutions that are in place and fulfilling requirements today. For now, analyze your requirements and decide if Exchange 2010 will meet your requirements or if it still makes more sense to use a purpose-designed archiving system.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
2 Comments  comments 

EDRM Enron Data Set with Attachments

Published on May 15, 2009 by in Uncategorized

I’m pleased to announce that an initial version of the EDRM Enron Email Data Set consisting of 40GB of PST files with attachments and folder structure is now available within the EDRM project as of the EDRM 2009-2010 Kick-Off Meeting. The EDRM Data Set Project is now working to make this data set publicly available.

This initial data set was created by myself and a team at ZL Technologies; however, more work remains and I think the EDRM Data Set project is an ideal group to head up the effort to publish some industry standard data sets.

Some of the issues that the EDRM Data Set Project will be looking at include addressing privacy concerns, the publishing of smaller data set slices, and distribution methods for large data sets. If you would like to participate in this process, please join EDRM.

John Wang
EDRM Data Set Project Lead

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 

Enron Data at 2009-2010 EDRM Kick-Off Meeting

A number of people have contacted me about getting the current PST corpus via an alternative manner. This is partially due to the bandwidth restrictions that have been in place for the HTTP download. I planned to put in some other download methods but haven’t had time yet. Until then, if you will be at the EDRM Kick-Off meeting and you would still like a copy, bring a 1+ GB USB key and find me at the meeting. If you are interested, please let me know beforehand so I can plan ahead.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 

Use of Search Engine Term Black Lists (Stop Words or Noise Words) Can be Detrimental for Findability

Stop words, or noise words, are black lists of words that search engines choose not to index. These are used by some search engines that consider the words of little value; however, they should still be used in eDiscovery where it is more important to find all responsive documents than to provide a just a selection for users where false negatives may not pose a large risk (e.g. web search engines).

Disadvantages of Stop Words or Noise Words for eDiscovery

  1. Information Removal (Lower Recall, False Negatives, and Increased Risk): Stop words are often words of little value and interest for search which is one reason for not indexing them; however, sometimes, they can be exactly the words you are looking for. A common example is the phrase “to be or not to be.” By themselves, each of these words often exist in a stop word list, but combined they have obvious value. Other areas where stop words can cause problems are with terms like C++ which would often be not indexed at all due to the elimination of the “+” symbol and the single letter “c” rendering this important technology term with obviously meaning unfindable.
  2. Increased Noise (Lower Precision, False Positives and Increased Costs): When individual letters are not indexed, a search query like “vitamin a” would be reduced to “vitamin” resulting in many more documents than responsive documents, leading to more review and additional expense. Another area where this is often problematic is with stock symbols.
  3. The Need to Identify the Record’s Language: Stop words are different per language so there is a need to identify the language beforehand before stop words can be removed. If a document’s language is identified incorrectly or if a document has multiple languages, meaningful words may be eliminated leading to additional problems with false negatives and false positives. When black lists are used, testing must be performed to ensure the correct language is identified and the correct black list is applied


  1. Complete Term Indexing: For eDiscovery, indexing all words will ensure that all words can be found and lead to increased findability, no matter what terms.
  2. Partial Term Indexing with Black Lists: When black lists are used, the black listed words cannot be searched on and if they become important in the course of eDiscovery, the ESI may need to be re-indexed without those worse on the black list. If black lists are used by either party in eDiscovery, it is important to understand of words that have been eliminated from the search index and how that will affect the search results. If black lists are used in either party’s search engine, ask for the list of stop or noise words to evaluate the accessibility of documents with the search queries of interest.
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 

The Mailbox PST Dataset

Although much of the original Enron Email came in PST files, the most common form to get this email in today is in MIME format (through CALO / CMU) and as a MySQL database. To recreate the email in PST format, Pete Warden performed an earlier PST conversion of the CALO dataset. Pete’s PST is similar to journal email in that per-user delineation and folder structure of the user email stores have been removed.

To preserve the user information associated with the email, is now offering the CALO Enron Email Dataset in the form of 148 PST files with folder structure, preserving the information in the CALO dataset. Email for each of the 148 identified custodians is available a individual per-custodian PST files. A few minor changes were made to correct names and merge duplicate users where both correct and incorrect names existed. Custom X-headers have also been added including unique IDs to facilitate testing.

The files are currently available for download from the homepage as a 734MB 7z archive. 7z is an archival format similar to ZIP, BZIP2 and RAR that generally achieves higher compression rates. The uncompressed size for this dataset is roughly 8.6 gigabytes.

This dataset is licensed under the Creative Commons Attribution 3.0 United States license. To provide attribution, please cite to “”

Update 1: If you are experiencing difficulties downloading the file, try using wxDFast, a free open source download manager.

Update 2: Bandwidth management has been implemented. This is likely set too conservatively right now and will be adjusted up soon.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
1 Comment  comments 

Custodian Names and Titles

In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc.

Various datasets have used a string consisting generally of lastname and first initial to identify custodians. This ID appears in the FERC dataset as the ORIGIN field and in the CALO dataset as a directory name.

Using this ID, different versions of the Enron email dataset use different numbers of users. Notably, the following numbers of users are used:

  1. 158 users: The FERC dataset identifies 158 unique users using the iCONECT ORIGIN database column.
  2. 150 users: The CALO dataset identifies 150 unique users using the maildir user directory structure.
  3. 149 users: Andrés Corrada-Emmanuel has identified 149 unique users noting that phanis-s is a misspelling of panus-s.
  4. 148 users: has identified 148 unique users noting that whalley-l is a duplicate of whalley-g, both representing Lawrence “Greg” Whalley. has verified the duplicates identified by CALO, identified two more duplicates, and corrected two misspellings as shown in this Enron custodian list. This list was created before analyzing Corrada-Emmanuel’s custodian list which correctly identified one of additional duplicates.

After identifying custodians by ID, additional information can be associated with the custodians including names, ranks, titles, email addresses etc. A large part of the work has been performed by Jitesh Shetty and Jafar Adibi in their Ex-Employee Status Report. Combining this data with some additional data allows us to associate this information directly with the custodians in the dataset. The final results are available in the this custodian information report. Corrada-Emmanuel has also created a custodian ID to email address mapping. Email addresses will be incorporated by at a future date.

 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments