Reliable Access to Massive Restricted Texts: Experience-based Evaluation

03/02/2019
by   Zong Peng, et al.
0

Libraries are seeing growing numbers of digitized textual corpora that frequently come with restrictions on their content. Computational analysis corpora that are large, while of interest to scholars, can be cumbersome because of the combination of size, granularity of access, and access restrictions. Efficient management of such a collection for general access especially under failures depends on the primary storage system. In this paper, we identify the requirements of managing for computational analysis a massive text corpus and use it as basis to evaluate candidate storage solutions. The study based on the 5.9 billion page collection of the HathiTrust digital library. Our findings led to the choice of Cassandra 3.x for the primary back end store, which is currently in deployment in the HathiTrust Research Center.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/06/2021

What's in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus

Whereas much of the success of the current generation of neural language...
research
07/14/2023

Aspect-Driven Structuring of Historical Dutch Newspaper Archives

Digital libraries oftentimes provide access to historical newspaper arch...
research
06/25/2021

Digital libraries: textual analysis for a systematic review and meta-analysis

Purpose: We seek to explore the realm of literature about digital librar...
research
05/23/2023

Beyond the Library Collections: Proceedings of the 2022 Erasmus Staff Training Week at ULiège Library

No library can buy or hold everything its patrons need. At a certain poi...
research
09/08/2011

Digital Libraries, Conceptual Knowledge Systems, and the Nebula Interface

Concept Analysis provides a principled approach to effective management ...
research
09/19/2023

Estimating Contamination via Perplexity: Quantifying Memorisation in Language Model Evaluation

Data contamination in model evaluation is getting increasingly prevalent...
research
03/13/2017

MetaPAD: Meta Pattern Discovery from Massive Text Corpora

Mining textual patterns in news, tweets, papers, and many other kinds of...

Please sign up or login with your details

Forgot password? Click here to reset