Content Reduction, Surprisal and Information Density Estimation for Long Documents

09/12/2023
by   Shaoxiong Ji, et al.
0

Many computational linguistic methods have been proposed to study the information content of languages. We consider two interesting research questions: 1) how is information distributed over long documents, and 2) how does content reduction, such as token selection and text summarization, affect the information density in long documents. We present four criteria for information density estimation for long documents, including surprisal, entropy, uniform information density, and lexical density. Among those criteria, the first three adopt the measures from information theory. We propose an attention-based word selection method for clinical notes and study machine summarization for multiple-domain documents. Our findings reveal the systematic difference in information density of long text in various domains. Empirical results on automated medical coding from long clinical notes show the effectiveness of the attention-based word selection method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2021

Towards Clinical Encounter Summarization: Learning to Compose Discharge Summaries from Prior Notes

The records of a clinical encounter can be extensive and complex, thus p...
research
10/26/2018

Extractive Summarization of EHR Discharge Notes

Patient summarization is essential for clinicians to provide coordinated...
research
07/03/2022

An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics

Long documents such as academic articles and business reports have been ...
research
07/13/2023

Making the Most Out of the Limited Context Length: Predictive Power Varies with Clinical Note Type and Note Section

Recent advances in large language models have led to renewed interest in...
research
03/15/2022

Long Document Summarization with Top-down and Bottom-up Inference

Text summarization aims to condense long documents and retain key inform...
research
05/01/2020

Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization

Sequence-to-sequence (seq2seq) network is a well-established model for t...
research
04/18/2021

Attention-based Clinical Note Summarization

The trend of deploying digital systems in numerous industries has induce...

Please sign up or login with your details

Forgot password? Click here to reset