On Extracting Data from HTML Tables
The Web provides many data in user-friendly tabular formats that are encoded using HTML. Information extractors are intended to extract those data as datasets that can feed business applications. There exist many proposals to implement them, which has motivated several previous surveys. Unfortunately, they are outdated and we do not think that it suffices to update them because they do not provide a good conceptual framework, they do not provide a taxonomy of web tables, they do not analyse the exact tasks involved, and they do not provide a good comparison framework. This article presents a review of the literature that does not have any of the previous problems, which we hope will be useful to both researchers and practitioners.
READ FULL TEXT