On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a state-of-the-art text classifier based on convolutional neural networks. Despite potentially affecting the final performance of any given model, this aspect has not received a substantial interest in the deep learning literature. We perform an extensive evaluation in standard benchmarks from text categorization and sentiment analysis. Our results show that a simple tokenization of the input text is often enough, but also highlight the importance of being consistent in the preprocessing of the evaluation set and the corpus used for training word embeddings.
READ FULL TEXT