GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

06/02/2023
by   Aleksandra Piktus, et al.
0

Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We leverage the existing functionalities of both platforms while proposing novel features further facilitating their integration. Our goal is to give NLP researchers tools that will allow them to develop retrieval-based instrumentation for their data analytics needs with ease and agility. We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub at https://github.com/huggingface/gaia. We then demonstrate how the ideas we present can be operationalized to create a powerful tool for qualitative data analysis in NLP. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections. GAIA serves a dual purpose of illustrating the potential of methodologies we discuss but also as a standalone qualitative analysis tool that can be leveraged by NLP researchers aiming to understand datasets prior to using them in training. GAIA is hosted live on Hugging Face Spaces - https://huggingface.co/spaces/spacerini/gaia.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2023

Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

We present Spacerini, a modular framework for seamless building and depl...
research
08/29/2019

HARE: a Flexible Highlighting Annotator for Ranking and Exploration

Exploration and analysis of potential data sources is a significant chal...
research
06/08/2023

RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit

Although Large Language Models (LLMs) have demonstrated extraordinary ca...
research
10/05/2018

Sifaka: Text Mining Above a Search API

Text mining and analytics software has become popular, but little attent...
research
03/17/2021

TNM: A Tool for Mining of Socio-Technical Data from Git Repositories

Networks of collaboration between engineers are reflected in traces of d...
research
07/02/2021

DRIFT: A Toolkit for Diachronic Analysis of Scientific Literature

In this work, we present to the NLP community, and to the wider research...

Please sign up or login with your details

Forgot password? Click here to reset