Stop words, or noise words, are black lists of words that search engines choose not to index. These are used by some search engines that consider the words of little value; however, they should still be used in eDiscovery where it is more important to find all responsive documents than to provide a just a selection for users where false negatives may not pose a large risk (e.g. web search engines).
Disadvantages of Stop Words or Noise Words for eDiscovery
- Information Removal (Lower Recall, False Negatives, and Increased Risk): Stop words are often words of little value and interest for search which is one reason for not indexing them; however, sometimes, they can be exactly the words you are looking for. A common example is the phrase “to be or not to be.” By themselves, each of these words often exist in a stop word list, but combined they have obvious value. Other areas where stop words can cause problems are with terms like C++ which would often be not indexed at all due to the elimination of the “+” symbol and the single letter “c” rendering this important technology term with obviously meaning unfindable.
- Increased Noise (Lower Precision, False Positives and Increased Costs): When individual letters are not indexed, a search query like “vitamin a” would be reduced to “vitamin” resulting in many more documents than responsive documents, leading to more review and additional expense. Another area where this is often problematic is with stock symbols.
- The Need to Identify the Record’s Language: Stop words are different per language so there is a need to identify the language beforehand before stop words can be removed. If a document’s language is identified incorrectly or if a document has multiple languages, meaningful words may be eliminated leading to additional problems with false negatives and false positives. When black lists are used, testing must be performed to ensure the correct language is identified and the correct black list is applied
- Complete Term Indexing: For eDiscovery, indexing all words will ensure that all words can be found and lead to increased findability, no matter what terms.
- Partial Term Indexing with Black Lists: When black lists are used, the black listed words cannot be searched on and if they become important in the course of eDiscovery, the ESI may need to be re-indexed without those worse on the black list. If black lists are used by either party in eDiscovery, it is important to understand of words that have been eliminated from the search index and how that will affect the search results. If black lists are used in either party’s search engine, ask for the list of stop or noise words to evaluate the accessibility of documents with the search queries of interest.