Page Stream Segmentation with Convolutional Neural Nets Combining Textual and Visual Features
For digitization of paper files via OCR, preservation of document contexts of single scanned images is a major requirement. Page stream segmentation (PSS) is the task to automatically separate a stream of scanned images into multi-page documents. This can be immensely helpful in the context of "digital mailrooms" or retro-digitization of large paper archives. In a digitization project together with a German federal archive, we developed a novel PSS approach based on convolutional neural networks (CNN). Our approach combines image and text features to achieve optimal document separation results. Evaluation shows that our approach achieves accuracies up to 93 state-of-the-art for this task.
READ FULL TEXT