magnify
Home Articles posted by John Wang (Page 4)
formats

1073 “Global Warming” Emails Leaked

Published on November 23, 2009 by in email

A collection of 1073 email messages and 72 documents from Britain’s University of East Anglia (UEA)‘s Climate Research Unit (CRU) related to climate change research was leaked on to the Internet last week. This collection is currently being widely discussed on the Internet and gives a peek into how climate change research has been managed,

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

EDRM Enron PST files are now available

The EDRM Enron PST files are now available on the EDRM Data Set website thanks to George Socha, EDRM, and ZL Technologies. I am co-lead of the EDRM Data Set project and personally worked on this data set at ZL Technologies so I thought I would provide a brief introduction to this data before our

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

Extreme Performance Archiving Presentation at Oracle OpenWorld (OOW)

On Tuesday October 13, I gave a presentation at Oracle OpenWorld on E-Mail Archiving with “Extreme Performance” and “Green Computing” using a ZL+Oracle solution. The presentation discusses proven performance 100x greater than other solutions by using technologies such as private cloud computing and grid computing. The Extreme Performance theme of the show is especially fitting

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
1 Comment  comments 
formats

ZL Unified Archive named Trend Setting Product of 2009

Published on August 29, 2009 by in award, ZL

I’m happy to report that ZL Unified Archive has been named Trend Setting Product of 2009 by KMWorld. As the disciplines of eDiscovery, Records Management, and Email Management continue to merge, it is becoming more important than ever to proactively and effectively manage information in a scalable manner. Regarding this year’s selected products, KMWorld wrote:

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

Exchange 2010 Archiving Considerations

Published on August 3, 2009 by in email

Email servers were never designed to archive email messages for long periods of time, apply organizational retention and disposition policies, or perform fast search across an entire email environment. However, the email landscape has changed considerably and organizations that must contend with these requirements have increasingly turned to archiving solutions to fill this need. With

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
2 Comments  comments 
formats

EDRM Enron Data Set with Attachments

Published on May 15, 2009 by in Uncategorized

I’m pleased to announce that an initial version of the EDRM Enron Email Data Set consisting of 40GB of PST files with attachments and folder structure is now available within the EDRM project as of the EDRM 2009-2010 Kick-Off Meeting. The EDRM Data Set Project is now working to make this data set publicly available.

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

Enron Data at 2009-2010 EDRM Kick-Off Meeting

A number of people have contacted me about getting the current PST corpus via an alternative manner. This is partially due to the bandwidth restrictions that have been in place for the HTTP download. I planned to put in some other download methods but haven’t had time yet. Until then, if you will be at

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

Use of Search Engine Term Black Lists (Stop Words or Noise Words) Can be Detrimental for Findability

Stop words, or noise words, are black lists of words that search engines choose not to index. These are used by some search engines that consider the words of little value; however, they should still be used in eDiscovery where it is more important to find all responsive documents than to provide a just a

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments 
formats

The Mailbox PST Dataset

Although much of the original Enron Email came in PST files, the most common form to get this email in today is in MIME format (through CALO / CMU) and as a MySQL database. To recreate the email in PST format, Pete Warden performed an earlier PST conversion of the CALO dataset. Pete’s PST is

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
1 Comment  comments 
formats

Custodian Names and Titles

In order to reconstruct the Enron email dataset accurately it is important to identify the correct number of custodians for which email exists. From this canonical list, we can build out user information including actual names, rank, title, etc. Various datasets have used a string consisting generally of lastname and first initial to identify custodians.

Read More…

 
 Share on Facebook Share on Twitter Share on Reddit Share on LinkedIn
No Comments  comments