The Anatomy of a Search and Mining System for Digital Archives

03/23/2016
by   Martyn Harris, et al.
0

Samtla (Search And Mining Tools with Linguistic Analysis) is a digital humanities system designed in collaboration with historians and linguists to assist them with their research work in quantifying the content of any textual corpora through approximate phrase search and document comparison. The retrieval engine uses a character-based n-gram language model rather than the conventional word-based one so as to achieve great flexibility in language agnostic query processing. The index is implemented as a space-optimised character-based suffix tree with an accompanying database of document content and metadata. A number of text mining tools are integrated into the system to allow researchers to discover textual patterns, perform comparative analysis, and find out what is currently popular in the research community. Herein we describe the system architecture, user interface, models and algorithms, and data storage of the Samtla system. We also present several case studies of its usage in practice together with an evaluation of the systems' ranking performance through crowdsourcing.

READ FULL TEXT

page 24

page 26

research
10/05/2018

Sifaka: Text Mining Above a Search API

Text mining and analytics software has become popular, but little attent...
research
09/05/2020

Analysis and representation of Igbo text document for a text-based system

The advancement in Information Technology (IT) has assisted in inculcati...
research
08/08/2022

Information Extraction from Scanned Invoice Images using Text Analysis and Layout Features

While storing invoice content as metadata to avoid paper document proces...
research
12/03/2018

Comparing Neural- and N-Gram-Based Language Models for Word Segmentation

Word segmentation is the task of inserting or deleting word boundary cha...
research
06/04/2018

History Playground: A Tool for Discovering Temporal Trends in Massive Textual Corpora

Recent studies have shown that macroscopic patterns of continuity and ch...
research
07/07/2017

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Predicting the click-through rate of an advertisement is a critical comp...
research
10/14/2017

On Supporting Digital Journalism: Case Studies in Co-Designing Journalistic Tools

Since 2013 researchers at University College Dublin in the Insight Centr...

Please sign up or login with your details

Forgot password? Click here to reset