Is there something I'm missing? Topic Modeling in eDiscovery

by   Herbert L. Roitblat, et al.

In legal eDiscovery, the parties are required to search through their electronically stored information to find documents that are relevant to a specific case. Negotiations over the scope of these searches are often based on a fear that something will be missed. This paper continues an argument that discovery should be based on identifying the facts of a case. If a search process is less than complete (if it has Recall less than 100 be complete in presenting all of the relevant available topics. In this study, Latent Dirichlet Allocation was used to identify 100 topics from all of the known relevant documents. The documents were then categorized to about 80 Recall (i.e., 80 designated the hit set and 20 the fact that less than all of the relevant documents were identified by the categorizer, the documents that were identified contained all of the topics derived from the full set of documents. This same pattern held whether the categorizer was a naïve Bayes categorizer trained on a random selection of documents or a Support Vector Machine trained with Continuous Active Learning (which focuses evaluation on the most-likely-to-be-relevant documents). No topics were identified in either categorizer's missed set that were not already seen in the hit set. Not only is a computer-assisted search process reasonable (as required by the Federal Rules of Civil Procedure), it is also complete when measured by topics.


page 6

page 8

page 10


FOMO: Topics versus documents in legal eDiscovery

In the United States, the parties to a lawsuit are required to search th...

Probably Reasonable Search in eDiscovery

In eDiscovery, a party to a lawsuit or similar action must search throug...

Technology Assisted Reviews: Finding the Last Few Relevant Documents by Asking Yes/No Questions to Reviewers

The goal of a technology-assisted review is to achieve high recall with ...

Discovering topics with neural topic models built from PLSA assumptions

In this paper we present a model for unsupervised topic discovery in tex...

Topic Scaling: A Joint Document Scaling – Topic Model Approach To Learn Time-Specific Topics

This paper proposes a new methodology to study sequential corpora by imp...

Choosing the Number of Topics in LDA Models – A Monte Carlo Comparison of Selection Criteria

Selecting the number of topics in LDA models is considered to be a diffi...

Heuristic Stopping Rules For Technology-Assisted Review

Technology-assisted review (TAR) refers to human-in-the-loop active lear...

Please sign up or login with your details

Forgot password? Click here to reset