KBSET -- Knowledge-Based Support for Scholarly Editing and Text Processing

08/29/2019 ∙ by Jana Kittelmann, et al. ∙ 0

KBSET supports a practical workflow for scholarly editing, based on using LaTeX with dedicated commands for semantics-oriented markup and a Prolog-implemented core system. Prolog plays there various roles: as query language and access mechanism for large Semantic Web fact bases, as data representation of structured documents and as a workflow model for advanced application tasks. The core system includes a LaTeX parser and a facility for the identification of named entities. We also sketch future perspectives of this approach to scholarly editing based on techniques of computational logic.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the age of Digital Humanities, scholarly editing [10] involves the combination of natural language text with machine processable semantic knowledge, typically expressed as markup. The best developed machine support for scholarly editing is the XML-based TEI format [12], mainly targeted at rendering for different media and extraction of metadata, achieved through semantics-oriented or declarative markup. Recent efforts stretch TEI by aspects that are orthogonal to its original ordered hierarchy of content objects (OHCO) text model, through support for entities like names, dates, people, and places as well as structuring with linking, segmentation, and alignment [12, Chap. 13 and 16]. Also ways to combine TEI with Semantic Web techniques, data modeling and ontologies are investigated [2]. Nevertheless, there are various demands in today’s practical scholarly editing as well as with respect to future perspectives that are not well covered by TEI and the associated XML processing workflow, which we will address here:

  1. An economic workflow for scholarly editing should be supported. Only very few people from the Humanities seem willing to write XML documents. But it should be possible for them to create, review and validate text annotations as well as fact bases with metadata and knowledge on entities such as persons and places.

  2. It should be possible to generate high-quality print and hypertext presentations in an economic way.

  3. Linking with external knowledge bases should be supported. These include results of other edition projects as well as large fact bases such as authority files like Gemeinsame Normdatei (GND),111http://www.dnb.de/gnd. metadata repositories like Kalliope,222http://kalliope-verbund.info. domain specific bases like GeoNames, or aggregated bases like YAGO [4] and DBpedia [9].

  4. It should be possible to incorporate advanced semantics related techniques such as named entity recognition or statistics-based text analysis.

  5. It should be possible to couple object text with associated information in ways that are more flexible than in-place markup: Markup can be by different authors or automatically generated and can be for some specific purpose. Queries and transformations should remain applicable also after changes of the markup.

  6. It should be possible to associate proper logic-based semantics with annotations and links. Ontology reasoning alone is not sufficient, as classification seems not the main operation of interest. The GND fact base on persons, institutions and works, for example, gets by with a quite small ontology.

Our environment KBSET (Knowledge-Based Support for Scholarly Editing and Text Processing) is on the one hand a practical workflow that combines different systems and is applied in a large project, the edition of the correspondence of philosopher and polymath Johann Georg Sulzer (1720–1779) with author, critic and poet Johann Jakob Bodmer (1698–1783). The print version will be published as [11, Vol. 10] and, including commentaries and registers, spans about 2000 pages. On the other hand, KBSET is a prototype system that allows to experiment with various advanced features.

As basic format for scholarly editing KBSET suggests to use LaTeX with a set of newly defined custom commands that provide semantics-oriented markup adequate for the application domain, which currently is the edition of correspondences. This is complemented by a core system written in Prolog which includes a LaTeX parser, an internal representation of text and annotations, support for the representation of entities like persons, places and dates as well as a named entity identifier based on the GND as gazetteer. The core version of KBSET is available as free software from its homepage

http://cs.christophwernhard.com/kbset.

It comes with a demo application, the draft edition of a book from the 19th century. Release of the extended version of KBSET used for the Sulzer/Bodmer correspondence is planned together with the release of the digital edition in the near future. Most importantly, the forthcoming version adds the specification and support for descriptive LaTeX markup for correspondences and supports the generation of a HTML presentation, similar to www.pueckler-digital.de [6]. The 2016 version of KBSET was presented at DHd 2016 [7] and AITP 2016 [8].

The rest of this system description is structured as follows: In Sect. 2 we discuss the practical workflows for digital scholarly editing supported by KBSET. Prolog plays various roles in the environment, which are outlined in Sect. 3. In Sect. 4 the Prolog-implemented core components of the system are described. We conclude the paper in Sect. 5 with sketching future perspectives of scholarly editing and logic-based knowledge processing.

2 Workflows of Scholarly Editing Supported by KBSET

Three phases can be identified for machine assisted scholarly editing:

  1. Creating the object text, enhanced by markup and other statements in formal languages.

  2. Generating intermediate representations for inspection by humans or machines, analogously to debugging.

  3. Generating consumable presentations.

Support for all phased should be of high quality, which implies the incorporation of existing specialized systems, in our case only free software, in particular the GNU Emacs text editor and the LaTeX document preparation and typesetting system along with various packages.

Inputs

[style=figbox] Object Text Documents Format: LaTeX with domain specific semantics-based markup, e.g. for letter correspondences Tool: Emacs [style=figbox] Annotation Documents Annotations that are maintained outside of the object text Format, Tool: Same as for object text documents [style=figbox] Assistance Documents To configure and adjust KBSET Format: KBSET specific, Prolog readable Tool: Emacs [style=figbox] Application Specific Fact Bases E.g. persons, works, bibliography Formats: Prolog, LaTeX markup, BibLaTeX Tools: Emacs, JabRef [style=figbox] Large Imported Fact Bases E.g. GND, GeoNames, Yago, DBPedia Formats: e.g. RDF/XML, CSV

KBSET Core System

[style=figbox] Text Combination [topsep=4pt,label=•,leftmargin=8pt] Reordering object text fragments (e.g. letters by different writers in chronological order) Merging with external annotations Merging with automatically generated annotations [style=figbox] Named Entity Identification Persons, locations, dates [style=figbox] Consistency Checking E.g. for void entity identifiers, insufficient or implausible date specifications, duplicate entries in fact bases [style=figbox] Register Generation [topsep=4pt,label=•,leftmargin=10pt] Various indexes for print presentations Overview and navigation documents for Web presentation

Outputs

[style=figbox] Print-Oriented Presentation Formats: LaTeX, PDF [style=figbox] Web-Oriented Presentation Format: HTML [style=figbox] Display of Identified Entities Tool: Emacs
Figure 1: KBSET: Overview on inputs, core system functionality and outputs

Figure 1 shows an overview on KBSET. The basic way to use the system is the standard LaTeX workflow, however, with LaTeX commands restricted to elements for semantics-oriented markup according to the application domain. For example, correspondences with letters with a sender, recipient, date, mentioned persons, works and locations as well as scholarly comments that are associated with specific text positions in the letters. The user has to manage a text editor and to know how to handle the markup elements, which directly reflect tasks of scholarly editing. LaTeX packages implement these markup commands, such that the standard LaTeX workflow immediately provides some validation and yields a formatted PDF document with hyperlinks, realizing support for phase (2) and also for phase (3) with respect to print editions. Support to express some fact bases on entities such as persons, works, locations and events in LaTeX syntax is supported, to allow the user to stay in this workflow as far as possible.

Advanced functionality such as complex consistency validation, re-ordering of document fragments such as letters and commentaries, alignment with large external fact bases such as the GND, automated named entity identification, and merging with annotations that are automatically generated or maintained in external documents, as well as conversion to other output formats like a HTML presentation is implemented in Prolog and basically invoked through the Prolog interpreter, although this can be hidden behind shell scripts and a GNU Emacs interface for the named entity identification.

Figure 2 shows a screenshot with the presentation of named entity identification results in Emacs. In the object text buffer the system highlights words or phrases about which it assumes that they denote a person, place or date. In the lower buffer additional information on the selected occurrence of Gleim is displayed, including a rationale for the entity identification and a listing of lower ranked alternate candidate entities. Further aspects of named entity identification in KBSET will be outlined below in Sect. 4.3.

Figure 2: Screenshot: Named entity identification with KBSET

Prolog syntax is used for so-called assistance documents, that is, configuration files where external fact bases are specified and information is given to bias or override automated inferencing in named entity identification. The idea is that the user, instead of annotating identified entities manually lets the system do it automatically and mainly gives hints in exceptional cases, where the automatic method would otherwise not recognize an entity correctly. That method was used in the example document supplied with KBSET. For the Sulzer/Bodmer correspondence, the primary method was more traditional manual annotation, motivated since the mentioned entities often need to be carefully commented anyways. Like Prolog program files, the assistance documents can be re-loaded, which effects updating of the associated settings.

3 Roles of Prolog in Kbset

The implementation language of the KBSET core system is Prolog. Actually, Prolog, and in particular SWI-Prolog [14] with its extension packages to access modern formats like XML and RDF, is for KBSET not just a programming language, but covers different essential requirements within a single system:

Representation Mechanism for Relational Fact Bases.

We basically useSWI-Prolog’s standard indexing facilities. Some relations are supplemented with semantically redundant extracts whose standard indexing supports specific access patterns. We call these here caches.

Query Language.

The standard predicates findall and setof provide powerful means to specify queries in a declarative manner. Complex tests and constructions can be smoothly incorporated, as query and programming language are identical, without much impedance mismatch. Problems of the interplay between different systems like difficult debugging and communication overhead are avoided. Of course, queries written in Prolog can not rely on an optimizer, and have to be designed “manually” such that their evaluation is done efficiently. A further useful feature of Prolog is sorting based on a standard order of terms. We used this to implement ranked answers, or top-k querying, which seems adequate for tasks such as searching entities that are most plausibly denoted by a given name.

Representation Mechanism for Structured Documents.

As in Lisp, data structures are in Prolog by default terms that are print- and readable, a feature which is supplemented to “non-AI” languages often as XML serialization. In our application context it is particularly useful as it allows to represent XML and HTML documents directly as Prolog data structures.

Parser for Semantic Web Formats.

SWI-Prolog comes with powerful interfaces to Semantic Web formats, of which we use in particular the XML parser and the RDF parser, which provides a call-back interface that allows to process in succession the triples represented in a large RDF document (the GND has about 160 million triples, the size of its RDF/XML representation is about 2 GB).

Workflow Model.

Workflow aspects of experimental AI programming seem also useful in the Digital Humanities: loading and re-loading documents with formal specifications as well as invocation of functionality and running of experiments through an interpreter. In AI as well as in DH all of this should be manageable by the researcher herself instead of further parties.

4 Main Components of the Prolog-Based Core System

Main components of the KBSET core system are a LaTeX parser, a certain approach to integrate large fact bases for efficient access, and a subsystem for named entity identification that makes use of such fact bases as gazetteers.

4.1 LaTeX Parser

The system includes a LaTeX parser written in Prolog that yields a list of items, terms whose argument is a sequence of characters represented as atom, and whose functor indicates a type such as word, punctuation, comment, command, or begin and end of an environment. A special type opaque is used to represent text fragments that are not further parsed, such as LaTeX preambles. LaTeX commands and environments can be made known to the parser to effect proper handling of their arguments. The parser aims to be practically useful, without claiming completeness for LaTeX in full. It does not permit, for example, a single-letter command argument without enclosing braces. The parser is supplemented by conversions of parsing result to LaTeX and to plain text.

4.2 Representation of Entities from External Knowledge Bases

KBSET incorporates large fact bases which are typically available in Semantic Web formats by converting them in a preprocessing phase to a set of caches, that is, Prolog relations with extracts adapted to the application scope (for example, retaining only data on persons born before creation of the edited text) and access patterns required by queries (for example, accessing a person via last name or via a GND identifier). These caches can be stored in SWI-Prolog’s quick-load format, allowing to load them typically in a few seconds when initializing the system with application data. Keeping the data then in main memory does not raise problems with fact bases such as the GND which includes about 12 million fact triples on persons born before 1850. To access the relations, KBSET supports interfaces with predicates for entity types such as persons and locations.

4.3 Named Entity Identifier (NEI)

KBSET includes a system for named entity identification, which detects dates by parsing as well as persons and locations based on the GND and GeoNames as gazetteers, using additional knowledge from YAGO and DBpedia. Differently from systems like the Stanford Named Entity Recognizer [3], the KBSET NEI does not just associate entity types such as person or location with phrases but attempts to actually identify the entities. The identification is based on single word occurrences with access to a context representation that includes the text before and after the respective occurrence. Hence an association of word occurrences to entities is computed, which is adequate for indexes of printed documents and for hypertext presentations, but not fully compatible with TEI, where the idea is to enclose a phrase that denotes an entity in markup.

The named entity identification is controlled by rules which can be specified and configured and determine the evaluation of syntactic features matched against the considered word, for example, is-no-stopword or is-no-common-substantive, and of semantic features matched against candidate entities, for example, is-in-wikipedia, is-linked-to-others-identified-in-context, has-an-occupation-mentioned-in-context, or date-of-birth-matches-context. Evaluation of these features is done with respect to the mentioned context representation, which includes general information like the date of text creation and inferred information such as a set of entities already identified near the evaluated text position. Features that are cheaply to compute and have great effect on restricting the set of candidate entities are evaluated first. This allows, for example, to apply named entity identification of persons on the demo book provided with the system, which involves several 10.000s queries against the underlying fact bases, in about 7 seconds on a modern notebook computer.

Feature evaluation results are mapped to Prolog terms whose standard order represent their plausibility ranking, realizing a form of top-k query evaluation. Information about the features that contributed to selection of a candidate entity is preserved and can be presented to the user in the form of an explanation why the system believes the entity to be a plausible candidate for being referenced by a word occurrence. The Emacs interface of KBSET allows to browse through these candidate solutions, displaying the explanations as well as hyperlinks to the GND, Wikipedia, and GeoHack, which may help to judge them (see Fig. 2 in Sect. 2). After adapting the assistance document accordingly and re-loading it, the system will produce more accurate results in the next run of named entity identification.

5 Conclusion

Digital scholarly editing involves the interplay of natural language text with formal code and with knowledge bases in ways that suggests various interesting possibilities related to computational logic in a long-term perspective:

There are parallels of digital scholarly editing and a classical AI scenario, where an agent in an environment makes decisions on actions to perform, which indicates a potential relevance of AI methods to scholarly editing: General background knowledge in the AI scenario corresponds to knowledge bases like GND and GeoNames; the position of the agent in the environment corresponds to a position in the text; temporal order of events corresponds to the order of word occurrences; the environment which is only incompletely sensed or understood by the agent corresponds to incompletely understood natural language text; coming to decisions about actions to take corresponds to decisions about denotations of text phrases and about annotations to associate with text components.

A key requirement of a modern system to support scholarly editing is the interplay of knowledge that is inferred by automated and statistic-based techniques, which is inherently incomplete and not fully incorrect, with manually supplied knowledge. Non-monotonic reasoning should be applicable to provide a systematic logic-based approach to mediate between the two types of knowledge.

KBSET already supports abstract ways to specify positions in text that are used as target of external annotations. It seems an interesting topic of further research to investigate this more systematically, also taking approaches to programming into account such as the composition of information in aspect-oriented programming (AOP) [5], where items relevant in scholarly editing roughly match concepts from AOP as follows: Position in text – joint point; set of positions – pointcut; specifier of a set of positions – pointcut designator; action to be performed at all positions in a set – advice; effecting execution of “advices” – weaving.

If queries are written in a suitable fragment of Prolog, they can be automatically optimized, abstracting from caring about indexes (relation caches), the order of subgoals and the ways in which answer components are combined. Recent approaches to interpolation based query reformulation might be applicable there

[13, 1]. The optimized version of a query is extracted there as a variant of a Craig interpolant from a proof obtained from a first-order prover. It seems also possible to apply this approach to determine from a given set of queries the caches that need to be constructed for efficient evaluation of the queries.

For now, we have seen with KBSET an environment for digital scholarly editing that has proved to be economic and practically workable in serious edition projects. So far, the user from the Humanities applies KBSET mainly in a LaTeX workflow, although advanced functionality is implemented as free software in Prolog, which is successfully and efficiently used there in a variety of roles.

References

  • [1] Benedikt, M., Leblay, J., ten Cate, B., Tsamoura, E.: Generating Plans from Proofs: The Interpolation-based Approach to Query Reformulation. Morgan & Claypool (2016)
  • [2] Eide, O.: Ontologies, data modeling, and TEI. Journal of the Text Encoding Initiative 8 (2015)
  • [3] Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: ACL 2005. pp. 363–370. ACL (2005)
  • [4]

    Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence

    194, 28–61 (2013)
  • [5] Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C.V., Loingtier, J.M., Irwin, J.: Aspect-oriented programming. In: ECOOP 1997. LNCS, vol. 1241. Springer (1997)
  • [6] Kittelmann, J., Wernhard, C.: Semantik, Web, Metadaten und digitale Edition: Grundlagen und Ziele der Erschließung neuer Quellen des Branitzer Pückler-Archivs. In: Krebs, I., et al. (eds.) Resonanzen. Pücklerforschung im Spannungsfeld zwischen Wissenschaft und Kunst, pp. 179–202. trafo Verlag (2013)
  • [7] Kittelmann, J., Wernhard, C.: Knowledge-based support for scholarly editing and text processing. In: DHd 2016. pp. 178–181. nisaba verlag (2016)
  • [8] Kittelmann, J., Wernhard, C.: Towards knowledge-based assistance for scholarly editing. In: AITP 2016 (Book of Abstracts). pp. 29–31 (2016)
  • [9] Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia – A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6(2), 167–195 (2015)
  • [10] Plachta, B.: Editionswissenschaft: Eine Einführung in Methode und Praxis der Edition neuerer Texte. Reclam (1997)
  • [11] Sulzer, J.G.: Gesammelte Schriften. Kommentierte Ausgabe. Adler, H., Décultot, E. (eds.). Schwabe (2014–2021)
  • [12] The TEI Consortium: TEI P5: Guidelines for Electronic Text Encoding and Interchange, Version 3.5.0. Text Encoding Initiative Consortium (2019), http://www.tei-c.org/Guidelines/P5/
  • [13] Toman, D., Weddell, G.: Fundamentals of Physical Design and Query Compilation. Morgan and Claypool (2011)
  • [14]

    Wielemaker, J., Schrijvers, T., Triska, M., Lager, T.: SWI-Prolog. Theory and Practice of Logic Programming

    12(1-2), 67–96 (2012)