Chapter 7Formal Languagesnotes.en.pdf:130-135

XML Document Trees

Markup as a tree grammar

Def 7.6.1XMLDef 7.6.5XHTMLDef 7.6.11Valid XML
Concept

XML (Extensible Markup Language, Definition 7.6.1, notes p. 130) is a
framework for markup formats. It is essentially HTML you can extend — and
the conceptual model is the same: a tree of nested elements.

Document trees (Definition 7.6.6, notes p. 132): an XML document is a tree
with three kinds of nodes:
1. Element nodes — labeled with a tag name; may have children.
2. Attribute nodes — extra metadata attached to an element.
3. Text nodes — the actual character data (leaves of the tree).

Serialization (Definition 7.6.7): the tree is encoded as a string with:
- Opening tag: <el>
- Closing tag: </el>
- Empty element: <el/>
- Attributes inside the opening tag: <el attr="value">
- Text content as raw characters.

There must be a unique document root — the outermost element.

Well-formed vs valid:
- A document is well-formed if it parses under the XML grammar GXMLG_{XML}
(Definition 7.6.9, notes p. 134 — 81 productions for the character-level syntax).
- A document is valid (Definition 7.6.11) if it is well-formed AND its
parse tree is accepted by a schema language.

Schema languages (Definition 7.6.10): tree grammars that specify which
element structures are allowed:
- DTD (Definition 7.6.12) — built into XML but outdated.
- XSD (Definition 7.6.13) — XML Schema Definition, based on regular tree
grammars, XSDs are themselves XML documents.
- RELAX NG (Definition 7.6.14) — Regular Language for XML Next Generation;
has both an XML syntax and a compact non-XML syntax.

XHTML (Definition 7.6.5) is the XML-compliant version of HTML — same tree
model, stricter syntax.

XML's contribution to symbolic AI: a standardized way to transfer trees instead
of strings
between systems.

Mixing XML vocabularies: namespaces

Real XML documents combine multiple vocabularies (e.g. PDF metadata that mixes Dublin Core for bibliographic data with RDF triples). Namespaces disambiguate element names:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:pdf="http://ns.adobe.com/pdf/1.3/"> <rdf:Description rdf:about=""> <dc:title>SMAI</dc:title> <dc:creator>Kohlhase</dc:creator> <pdf:CreationDate>2026-01-29</pdf:CreationDate> </rdf:Description> </rdf:RDF>

Here rdf:, dc:, and pdf: are namespace prefixes pointing to different XML schemas. The schema languages (DTD, XSD, RELAX NG) can then constrain each namespace independently.

RELAX NG: full XML syntax

RELAX NG comes in two flavours: a compact non-XML syntax and a full XML syntax using <grammar>, <start>, <element>, <define>.

Compact syntax (what we saw before):

start = element lecture { slide+ } slide = element slide { attribute id { text }, text }

Same schema in full XML syntax:

&lt;grammar xmlns="http://relaxng.org/ns/structure/1.0"&gt; &lt;start&gt; &lt;element name="lecture"&gt; &lt;oneOrMore&gt; &lt;element name="slide"&gt; &lt;attribute name="id"&gt;&lt;text/&gt;&lt;/attribute&gt; &lt;text/&gt; &lt;/element&gt; &lt;/oneOrMore&gt; &lt;/element&gt; &lt;/start&gt; &lt;/grammar&gt;

The compact syntax is preferred for human authoring; the XML syntax is preferred for tools that emit XML.

Worked example
Step 0 of 2
Practice — score 100% to advance
Multiple choice
Q1
An XML document tree has nodes of which kinds?
Q2
What does it mean for an XML document to be well-formed?
Q3
An XML document is valid iff…
Q4
XSD is based on…
Q5
The XML namespace declaration xmlns:dc="..." introduces the prefix:
Fill in the blank
Q1
An XML document is well-formed if it parses under the grammar G_{}.
Q2
XML stands for Extensible ___ Language.
Q3
An XML document is ___ if it is well-formed and accepted by a schema.
Match definitions
Match each concept on the left to its definition on the right.
Order the steps
Arrange these proof steps in the correct order using the arrows.
1
Identify text nodes as leaf content
2
Identify the root element (must be unique)
3
Extract attributes from opening tags
4
Pair each opening tag &lt;el&gt; with its closing &lt;/el&gt;
Loading…