I recently returned from the EDRM VI Kickoff Meeting in Minneapolis and wanted to provide everyone with an update for the Data Set Project, which I co-chair. The Data Set Project’s goals have expanded to cover projects that will not only make testing and evaluation of eDiscovery solutions easier, but also projects that should lower the costs of processing through better culling and streamline the litigation process through better information on ESI for negotiations and expert witnesses. Our current projects are listed below:
- EDRM ESI Reference Data Sets: EDRM provides a number of reference ESI data sets that can be used for testing and benchmark purposes. Currently, these include the following:
- EDRM Enron PST Data Set: 40GB of Enron e-mail messages and attachments in PST format organized in 32 zipped files, each less than 700 MB in size, containing 168 .pst files.
- EDRM File Format Data Set: 381 files covering 200 file formats.
- EDRM Internationalization Data Set: A snapshot of selected Ubuntu localization mailing list archives covering 23 languages in 724 MB of email.
- EDRM Hash Data Sets: Hash data sets for use in culling collections to remove non-user generated files. The hash sets will provide hashes for files to cull on a deterministic and probabilistic basis.
- EDRM Software Reference Data Set (SRDS): An enhancement of the NSRL or “NIST List,” the EDRM SRDS or “EDRM List” seeks to provide a list of hashes covering popular software as it is installed on the system and tools with which to generate the hashes.
- EDRM Probabilistic Hash Data Set (PHDS): This projects seeks to create a probabilistic approach for determining whether a file is a user file or a system file for culling purposes. For this system, there would be no need to positively identify a file as a known file beforehand as with the EDRM SRDS.
- EDRM Data Set Documentation Projects
- EDRM ESI Checklist: When litigants prepare for the initial Meet & Confer, the EDRM ESI Checklist will help ensure that litigants are covering potential ESI locations for both the parties they represent and opposing parties.
- EDRM ESI Guide: The EDRM ESI Guide is designed to be the eDiscovery practitioner’s guide to ESI and the nuances of ESI types that are encountered in the eDiscovery process. Expert witness, users, and vendors should be able to use the EDRM ESI Guide to ensure they understand how ESI looks and behaves from an eDiscovery perspective.
The first two project categories are covered in the EDRM VI Kickoff Presentation for the Data Set Project below while we just initiated the documentation projects at the kick off meeting.
If you are interested in participating in any of these projects, please join EDRM and sign up for the Data Set Project.