XML Document Trees
Markup as a tree grammar
XML (Extensible Markup Language, Definition 7.6.1, notes p. 130) is a
framework for markup formats. It is essentially HTML you can extend — and
the conceptual model is the same: a tree of nested elements.
Document trees (Definition 7.6.6, notes p. 132): an XML document is a tree
with three kinds of nodes:
1. Element nodes — labeled with a tag name; may have children.
2. Attribute nodes — extra metadata attached to an element.
3. Text nodes — the actual character data (leaves of the tree).
Serialization (Definition 7.6.7): the tree is encoded as a string with:
- Opening tag: <el>
- Closing tag: </el>
- Empty element: <el/>
- Attributes inside the opening tag: <el attr="value">
- Text content as raw characters.
There must be a unique document root — the outermost element.
Well-formed vs valid:
- A document is well-formed if it parses under the XML grammar
(Definition 7.6.9, notes p. 134 — 81 productions for the character-level syntax).
- A document is valid (Definition 7.6.11) if it is well-formed AND its
parse tree is accepted by a schema language.
Schema languages (Definition 7.6.10): tree grammars that specify which
element structures are allowed:
- DTD (Definition 7.6.12) — built into XML but outdated.
- XSD (Definition 7.6.13) — XML Schema Definition, based on regular tree
grammars, XSDs are themselves XML documents.
- RELAX NG (Definition 7.6.14) — Regular Language for XML Next Generation;
has both an XML syntax and a compact non-XML syntax.
XHTML (Definition 7.6.5) is the XML-compliant version of HTML — same tree
model, stricter syntax.
XML's contribution to symbolic AI: a standardized way to transfer trees instead
of strings between systems.
Real XML documents combine multiple vocabularies (e.g. PDF metadata that mixes Dublin Core for bibliographic data with RDF triples). Namespaces disambiguate element names:
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:pdf="http://ns.adobe.com/pdf/1.3/">
<rdf:Description rdf:about="">
<dc:title>SMAI</dc:title>
<dc:creator>Kohlhase</dc:creator>
<pdf:CreationDate>2026-01-29</pdf:CreationDate>
</rdf:Description>
</rdf:RDF>
Here rdf:, dc:, and pdf: are namespace prefixes pointing to different XML schemas. The schema languages (DTD, XSD, RELAX NG) can then constrain each namespace independently.
RELAX NG comes in two flavours: a compact non-XML syntax and a full XML syntax using <grammar>, <start>, <element>, <define>.
Compact syntax (what we saw before):
start = element lecture { slide+ }
slide = element slide {
attribute id { text },
text
}
Same schema in full XML syntax:
<grammar xmlns="http://relaxng.org/ns/structure/1.0">
<start>
<element name="lecture">
<oneOrMore>
<element name="slide">
<attribute name="id"><text/></attribute>
<text/>
</element>
</oneOrMore>
</element>
</start>
</grammar>
The compact syntax is preferred for human authoring; the XML syntax is preferred for tools that emit XML.
xmlns:dc="..." introduces the prefix:<el> with its closing </el>