In the context of the authors’ metatools compiler construction toolkit, XML plays a fundamental role for the encoding of data structures as well as for text documents. Related standardized languages are XPath for inquiring XML objects, and XSL-T transformation. In this context, constraints on the valid syntactic structure of classes of documents are defined by “Document Type Definitions”, DTD. It turned out that all these components had to be implemented from scratch, for modular and compositional usage, and satisfying error diagnosis.
Given a fixed DTD and a fixed XPath expression, the two most important questions in practice are satisfiability of the expression, when applied to node of a certain kind, and the general type of the resulting nodes; given an XSL-T program and two DTDs, the question is for the correctness of the program, i.e. will the result always adhere to the second DTD if the input adheres to the first.
On the theoretical side, much attention has been given to these and related questions as exact problems: in general they turn out to be undecidable. A hierarchy of syntactic restrictions of the two languages involved gives decidable subsets with complexity ranging from P to NEXPTIME; see the exhaustive discussions in [4, 5].
Here we propose the opposite approach, namely to give approximate solutions, in the tradition of type systems, for the full expressiveness of DTD, XPath 1.0 and XSL-T 1.0. Semantically problematic constructs are supported with trivial approximations for graceful degradation of the analysis, rather than rejected categorically. Thus we give a light-weight pragmatic solution which answers the practically most relevant, structural questions.
2 Abstract Interpretation of DTD and XPath
Our proposed analysis of a particular combination of DTD and XPath has the form of a simple, easily understood and just as easily implemented algorithm, which calculates an upper bound of possible results by abstract interpretation based on a relation algebra. The formal presentation given here corresponds almost literally to our concrete implementation. It uses a Java library of declarative finite set-based relational operations that is part of our metatools toolkit.
The chief purpose and obvious interpretation of an XPath expression is to select a subset of the nodes of a concrete XML document. At a higher level of abstraction, the same XPath expression can be interpreted as selecting a subset of the nodes of all XML documents conforming to a fixed DTD. By partitioning such an infinite set of potential nodes into the finite set of node types in the sense of [2, Sect. 5], an abstract interpretation of the XPath language can be given that assigns an upper bound of selectable node types to each XPath expression. This interpretation relates the types of context nodes to the types of nodes potentially selected by the expression from that context.
All of the above general questions can be addressed thus in a uniform way; for instance, the empty set of selected node types implies unsatisfiability of the interpreted expression. XML toolkits can leverage the node type information for the development, analysis, optimization, maintenance and quality assurance of XML processing applications.
Abstract interpretation of the XPath language is far simpler than for the average programming language because of the absence of recursive expressions. No fixpoint computations (cf. ) are needed; our interpretation is purely syntax-directed and bottom-up.
2.1 Node Types and XPath Axes
For one fixed DTD we introduce a finite set of node types, which partition the nodes of all documents conforming to that DTD. For each node type, there is a primitive, characteristic XPath expression that selects nodes of this type, but excludes all others. All relations under consideration are binary relations on . Table 1 shows the four generic node types that apply to all DTDs.
|XPath Node Type||Characteristic Expression||Type Symbol|
Besides the four generic node types, each DTD declares a finite number of element and attribute names, each giving rise to a node type. The characteristic XPath expressions for an element named and an attribute named are child:: and attribute::. These can be abbreviated to and @, respectively, which we shall also use to denote the corresponding node type. We write and for the corresponding sets of types, hence .
In XPath, navigation within a document is accomplished by composing steps classified by so-calledaxes. The specification text defines thirteen axes, which can be reduced to a basis of three (see Table 2): and model the nesting and local order of element nodes, respectively, whereas models the placement of attributes. All three primitive axes are specific to a fixed DTD. The following paragraphs consider them in turn.
|XPath Axis||Relation||XPath Axis||Relation|
2.1.1 Child Axis
The relation is the smallest relation that meets the following requirements:
The root type is related to the types of all elements admissible as document (outermost) elements, and to the comment and processing instruction node type. Since the DTD formalism cannot express constraints on the admissible elements, generally .
Informal normative constraints (such as XHTML 1.0 allowing only html as the root element) can be imposed to improve the analysis.
Any element type is related to the comment and processing instruction node types, so .
The type of any element declared by the DTD with mixed content is related to the text node type. Let be the types of all elements declared in the form <!ELEMENT (#PCDATA | )*>. Then .
The type of any element declared by the DTD is related to the types of elements occurring in its content declaration, whether of mixed or element content (regular) type. Note that the content specifier ANY is not supported, since it defies closed-world static analysis by definition.
2.1.2 Attribute Axis
The relation is the smallest relation that relates the type assigned to each element declared by the DTD to the types assigned to the attributes declared for that element by the DTD. That is, for every pair of matching declarations
|<!ELEMENT >||<!ATTLIST >|
that declare an attribute of element with value type and default value , there is a pair .
2.1.3 Following-Sibling Axis
The following-sibling axis is by far the most complicated of the primitive XPath axes. The following paragraphs construct a relational interpretation in four conceptual steps:
Define an even more primitive relation (next-proper-sibling) per element by induction on the declared content model.
Define a relation per element that includes both the transitive closure of and potentially intervening non-element node types.
Demonstrate that it is infeasible to handle siblings on a per-element basis.
Conclude by taking the union of all local relations as a reasonable approximate interpretation.
2.1.4 Local Next-Proper-Sibling Relation
We construct a relation per element with the following meaning: iff an element-or-text node of type may be followed immediately by an element-or-text node of type within the content of an element of type .
For mixed content, this construction is trivial: For an element declared in the form <!ELEMENT (#PCDATA | | | )*>, take the set of node types and let , since the only combination of consecutive children forbidden by the XPath data model is a pair of text nodes.
For so-called element content, the construction is more complicated. Consider the following mathematical interpretation of DTD content models, as implied in the XML specification: Each content model denotes a content language (set of finite sequences of element types), by the usual interpretation of regular expressions; see Table 3. For compositionality we add the empty specification (), which had been forgotten in the XML specification. As usual, denotes the empty word. Confer also DTD normalization in .
We say that a content language ends with iff is the smallest set of node types such that all sequences end with some .
We calculate the next-sibling relation inductively over the meta-language . To this end, we shall define an auxiliary function , such that for the following invariants hold:
The relation contains precisely the pairs such that, for all content languages ending with , occurs as a contiguous subsequence in . (Since the language of singletons from is a language ending with , at least must stem from .)
For all content languages ending with , ends with .
Note that the two invariants together fix the result of , which depends on only via : We define two equivalence relations as iff (modulo language), and if and only if for all (modulo abstract interpretation), respectively. Then we have the lemma , which will be used below.
Intuitively models the operation of a DTD content model on a set of prefixes, namely updating the set of possible final elements, and extending the set of possible neighbourships. For an element declared as <!ELEMENT >, we then define iff for some .
Table 4 shows the induction rules that define and induce a syntax-directed, deterministic algorithm. To verify that these rules are adequate and complete, we systematically check the invariants given above for all syntactic cases:
The single-element case is trivial.
The nullary sequence case () is equally trivial. Together with the choice rule it defines the rules for the content model iterators ? and *.
The binary cases (, ) and ( | ) formalize the notion of sequence and choice, respectively. Ternary and higher cases follow uniquely because of associativity modulo language.
The most interesting case is the iterator + . It holds asymptotically that
It is easy to show inductively that and hence, by associativity, all sequences of two or more s are equivalent. We conclude
which justifies the rule.
Note that the given rule invokes the body twice, and that these formulas directly represent the operation of our algorithm. So the there may be exponential worst-case complexity for nested + and * iterators. To our experience, practical content models are not nested deeply enough to cause trouble in this regard. The XHTML DTD, for instance, does not contain any irreducible nested iterators at all.
2.1.5 Local Following-Sibling Relation
The following-sibling relation can be derived from the next-proper-sibling relation by noting that additionally, transitivity is required, and comments and processing instructions may intervene arbitrarily. Hence we define , where is the set of element node types occuring in and . The analog case for the children of the root node is . The operator denotes symmetric closure of a relation: .
2.1.6 Per-Element Following Siblings
Consider for a moment the impact of handling the following-sibling relationseperately for each element type : The type of a node is not enough to infer its following siblings. The type of its parent is required as well and, because of the ancestor axis, so are the types of all of its ancestors. Hence we are forced to lift our interpretation from context-free relations on node types to context-sensitive relations on node type paths. On the upside, this can be done in a mathematically straightforward way. For every context-free primitive relation there is a corresponding context-sensitive one
where is any (possibly empty) sequence of ancestor node types. On the downside, this lifting ruins the simple finite representation of relations: since can take on infinitely many values, the structure of composite relations becomes quite complicated; for instance consider
Though finding an effective representation for this kind of relations may be an interesting challenge, we leave it as an open problem for now.
2.1.7 Unified Following-Sibling Relation
Having conceded that there is no obvious solution to the context problem for the following-sibling relations, we simply define
thereby abstracting from the possibility of an element having significantly different potential siblings in different contexts. It remains to be established empirically how much information is lost in this way. Note that XPath subexpressions which do not use the horizontal axes are not affected.
2.2 Abstract Interpretation of XPath Expressions
Table 5 shows the abstract interpretation, which is a partial function from XPath expressions to node type relations, specified as a relation . The details are explained in the following subsections.
2.3 Location Paths
An XPath location path expression takes one of three forms: absolute (/::), relative (::) or recursive (/::); following filter predicates will be considered below. Interpretation is defined as follows:
Compute a base relation : In the absolute case, the context is ignored and replaced by the document root; set . In the relative case, set . In the recursive case, recursively compute the relation assigned to . If undefined, the interpretation of the whole expression is undefined.
Otherwise, assign a relation to the axis (see Sect. 2.1) and a relation to the test (see below).
Assign the relation to the whole expression. In the relative case, this simplifies to .
Table 6 shows the relations assigned to generic node tests. Name tests are mapped to relations as follows:
An explicit name test maps to the relation if the principal node type for the current axis is element, or to if the principal node type is attribute. The latter applies to the attribute axis only, the former to all other axes except the unsupported namespace axis.
A wildcard name test * maps to the relation or , if the principal node type for the current axis is element or attribute, respectively.
|XPath Node Test||Relation||XPath Node Test||Relation|
An XPath union expression has the form |. Interpretation is defined as follows:
Assign a relation to the left argument . Likewise, assign a relation to the right argument . If either is undefined, the interpretation of the whole expression is undefined.
Otherwise, assign the relation to the whole expression.
An XPath filter expression has the form . Interpretation is defined as follows:
Assign a relation to the base expression . If undefined, the interpretation of the whole expression is undefined.
Otherwise, assign a relation to the filter predicate (see below). If undefined, assign the relation to the whole expression. This cop-out interpretation is safe as an upper bound because a filter can at most remove node types from the base node set.
Otherwise, assign the relation to the whole expression. This interpretation is safe as an upper bound because a filter predicate that evaluates to a node set in the context of a node by definition selects iff is nonempty. This in turn implies that the type of is related by to the types of the members of , hence .
Filter predicates are mapped to relations as follows:
The default is the ordinary abstract interpretation of , if defined.
Logical operators are treated specially:
A predicate of the form and , where the relations and are assigned recursively to and , respectively, is mapped to the relation .
A predicate of the form or , where the relations and are assigned recursively to and , respectively, is mapped to the relation .
Note that this treatment is not strictly necessary, because if both and evaluate to node sets in the context selected by , then the expressions [ and ] and [ or ] are equivalent to  and [|], respectively.
For all other filter predicates, the abstract interpretation is undefined, and hence does not impose any restriction on the relation assigned to the base expression. Note that filter predicates of the form not() are explicitly not covered, as the anti-monotonicity of negation would break the upper bound property of the abstract interpretation.
2.6 Other Expressions
The XPath function id selects elements from the whole context document, regardless of the particular context node, by the value of their identity attribute. A function call expression of the form id() is mapped to the relation , where is the set of all element node types in the DTD that have a declared attribute of value type ID.
Otherwise, our abstract interpretation is undefined for all XPath expressions not covered above. Similar restrictions apply to other semantical models of XPath such as  as well, and do not hinder analysis of document structure unduly.
2.7 Semantic Properties and Their Applications
Consider some fixed XPath expression and DTD. If the abstract interpretation assigns a relation on node types to , then the following properties should hold:
For each valid document with respect to the DTD and each context node within the document, either selects a node set from the document or fails to evaluate, but does not evaluate to a string, number, truth value or other data (well typing).
Let be a nonempty set of node types and the image of under the relation .
For absolute , does not depend on .
For both absolute and relative , each node in the set selected by starting from a context node of some type has a type in (completeness).
As a corollary of completeness, if is empty then is unsatisfiable: the expression selects only the empty node set from any valid document and any context node of type . This has important practical implications:
For the root of an XPath expression, it is most likely an error and should be reported to the user of the XML processing tool.
For any expression fragment, it indicates optimization potential in the XPath implementation: The most frequently-used axes are the child and attribute axes, as witnessed by their special abbreviated syntax. Node selection along these axes is usually implemented by recursive tree traversal, for instance using the visitor style pattern. The current state of the traversal is specified by a relative XPath subexpression. Whenever the type of the root node of a subtree is not in the image of the relation associated with the governing subexpression, traversal of the whole subtree can be pruned safely.
Note that there is no soundness property dual to completeness: One might expect that for each type , there is a valid document and context node such that selects a node of type . But since our interpretation is an approximate upper bound, this is not the case in general.
2.8 Usage of the Command Line Tool
The following inputs to the command line tool show typical questions to a particular DTD, here XHTML 1.0:
make test XPATH="p/ol"
(“Can a p element ever contain directly an ol element?” – “No!”)
make test XPATH="p//ol"
(“Can a p element contain indirectly an ol element?” – “Yes! And the p element itself is contained in a form or an ins or a map …”)
make test XPATH="self::p//*[ol]"
(“Which element under that p element can directly contain that ol?”)
3 XSL-T and Fragmented Validation
When trying to apply the standard open source XSL-T implementation “Xalan” , it soon turned out that error diagnosis is too bad for efficient programming work. So we decided to implement our own XSLT 1.0 processor. It is based on the “tdom” Typed Document Model, which generates a collection of Java classes from a DTD, for serialization, deserialization, construction and inquiry in a strictly typed fashion.
In this setting, an XSL-T program is a collection of trees of two different “colors”, namely the pure XSL-T code, and the sub-trees from the result language, which are interspersed in the code, and which will be combined later, when the code is applied to some input, to construct the output document. In our implementation, the connection between the leaves of a tree of one color and the root of the tree of the other color are realized non-invasively, by an adjoined map, because tdom, being strictly typed, can per se not express connections of this mixed nature. Then simply a visitor class must be constructed which respects this map and calls the other two visitors (generated by tdom) accordingly, to gain all the comfort of tdom declarative programming.
It soon turned out that already when parsing the transformation source, i.e. when constructing the internal model of the code, the target fragments can easily be validated against the target DTD. Two kinds of non-determinism come into play:
First: Whenever a fragment starts with a reference to a target DTD element, all positions in all target content models must be considered, because the later context is not known.
Second: Whenever XSL-T code is interspered into a target DTD fragment, the transitive closure of the sibling relation must be taken for all valid transitions, because the XSL-T code may produce zero to all of all those elements still missing to complete the current content model.
This technique is explained in detail in . It is easily implemented when the parsing process is again based on relations. That this technique comes from the preceding XPath research can be seen clearly when comparing Table 1 there with Table 4 above. It turned out to be very efficient in comparative tests and very helpful in practice.
-  Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E., Yergeau, F., Cowan, J.: Extensible Markup Language (XML) 1.1 (Second Edition). W3C, http://www.w3.org/TR/2006/REC-xml11-20060816/. (2006)
-  Clark, J., DeRose, S.: XML Path Language (XPath) Version 1.0. W3C, http://www.w3.org/TR/1999/REC-xpath-19991116/. (1999)
-  W3C http://www.w3.org/TR/1999/REC-xslt-19991116: XSL Transformations (XSLT) Version 1.0. (1999)
-  Benedikt, M., Fan, W., Geerts, F.: XPath satisfiability in the presence of DTDs. J. ACM 55(2) (2008) 8:1–8:79
-  Genevès, P., Layaïda, N., Schmitt, A.: Efficient static analysis of xml paths and types. SIGPLAN Not. 42(6) (2007) 342–351
-  Cousot, P., Cousot, R.: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings 4th POPL, ACM (1977) 238–252
-  Marx, M., de Rijke, M.: Semantic characterizations of navigational xpath. SIGMOD Rec. 34(2) (2005) 41–46
-  Pemberton, S., al.: XHTML 1.0 The Extensible HyperText Markup Language (Second Edition). W3C, http://www.w3.org/TR/2002/REC-xhtml1-20020801/. (2002)
-  Apache Foundation xalan.apache.org: Xalan Official Site. (2000-2011)
-  Trancón y Widemann, B., Lepper, M., Wieland, J.: Automatic construction of XML-based tools seen as meta-programming. Automated Software Engineering 10(1) (2003) 23–38
-  Lepper, M., Trancón y Widemann, B.: A simple and efficient step towards type-correct xslt transformations. In: Proceedings 26th International Conference on Rewriting Techiques and Applications (RTA 2015). LIPICS, Dagstuhl Publishing (2015) In press.