Cross-domain Retrieval in the Legal and Patent Domains: a Reproducibility Study

by   Sophia Althammer, et al.

Domain specific search has always been a challenging information retrieval task due to several challenges such as the domain specific language, the unique task setting, as well as the lack of accessible queries and corresponding relevance judgements. In the last years, pretrained language models, such as BERT, revolutionized web and news search. Naturally, the community aims to adapt these advancements to cross-domain transfer of retrieval models for domain specific search. In the context of legal document retrieval, Shao et al. propose the BERT-PLI framework by modeling the Paragraph Level Interactions with the language model BERT. In this paper we reproduce the original experiments, we clarify pre-processing steps, add missing scripts for framework steps and investigate different evaluation approaches, however we are not able to reproduce the evaluation results. Contrary to the original paper, we demonstrate that the domain specific paragraph-level modelling does not appear to help the performance of the BERT-PLI model compared to paragraph-level modelling with the original BERT. In addition to our legal search reproducibility study, we investigate BERT-PLI for document retrieval in the patent domain. We find that the BERT-PLI model does not yet achieve performance improvements for patent document retrieval compared to the BM25 baseline. Furthermore, we evaluate the BERT-PLI model for cross-domain retrieval between the legal and patent domain on individual components, both on a paragraph and document-level. We find that the transfer of the BERT-PLI model on the paragraph-level leads to comparable results between both domains as well as first promising results for the cross-domain transfer on the document-level. For reproducibility and transparency as well as to benefit the community we make our source code and the trained models publicly available.



There are no comments yet.


page 1

page 2

page 3

page 4


JuriBERT: A Masked-Language Model Adaptation for French Legal Text

Language models have proven to be very useful when adapted to specific d...

DoSSIER@COLIEE 2021: Leveraging dense retrieval and summarization-based re-ranking for case law retrieval

In this paper, we present our approaches for the case law retrieval and ...

Adapt or Get Left Behind: Domain Adaptation through BERT Language Model Finetuning for Aspect-Target Sentiment Classification

Aspect-Target Sentiment Classification (ATSC) is a subtask of Aspect-Bas...

Legal Document Retrieval using Document Vector Embeddings and Deep Learning

Domain specific information retrieval process has been a prominent and o...

Publicly Available Clinical BERT Embeddings

Contextual word embedding models such as ELMo (Peters et al., 2018) and ...

HyFlex: A Benchmark Framework for Cross-domain Heuristic Search

Automating the design of heuristic search methods is an active research ...

IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

We present IndoBERTweet, the first large-scale pretrained model for Indo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.