Well formed XML documents

The general structure of an XML document is as follows:

Figure 630. XML basic structure Slide presentation Create comment in forum
XML basic structure

Figure 631. Minimal XML document Slide presentation Create comment in forum
<minimum/>

Figure 632. Empty tags Slide presentation Create comment in forum
<p></p>

is equivalent to:

<p/>

We explore a simple XML document representing E-mail type messages:

Figure 633. Representing messages. Slide presentation Create comment in forum
<?xml❶ version="1.0"❷ encoding="UTF-8"❸?>
<memo><from>M. Goik</from><to>B. King</to>
 <to>A. June</to>
 <subject>Best whishes</subject>
 <content>Hi all, congratulations to your splendid party</content>
</memo>

The very first character group <?xml is actually a terminal to be conceived as a magic number string indicating the content type. It allows for distinguishing XML documents from other file types i.e. .gif, .jpg, .zip and so on.

Note:

  • The whole header line is optional with respect to the XML standard.

  • The document type XML notion is fairly generic: It may be further categorized i.e. containing SVG data.

The version="1.0" attribute declares subsequent lines will conform to the XML standard of version 1.0. This allows for XML standard evolution e.g. to version="2.1".

The attribute encoding="UTF-8" denotes the current document's content to be composed of UTF-8 Unicode characters being a widely accepted font encoding standard. This way European, Cyrillic and most Asian font codes are allowed to be used simultaneously.

Note: Proper visual rendering requires corresponding fonts to be installed: A document containing Chinese characters is of no use if the underlying rendering system lacks e.g. Chinese True Type fonts.

An XML document must have exactly one top level node. In contrast to the HTML standard when talking about XML documents nodes are commonly referred to as elements rather than tags. In this example <memo> is the top level root element.

Note: The document's root element may still appear as a nested descendant element again:

<memo>
  ...
 <content>Hi all, congratulations to your splendid party</content>
 <memo>Play it again, sam</memo>
</memo>

Each XML element like <from> has a corresponding counterpart </from>. In terms of XML we say each element being opened has to be closed accordingly. In conjunction with the preceding point this is equivalent to the fact that each XML document represents a tree structure as being shown in the tree graph representation.


Figure 634. Just plain XML? Slide presentation Create comment in forum
<math xmlns="http://www.w3.org/1998/Math/MathML">
  <apply>
    <in/>
    <cn type="complex-cartesian">17<sep/>29</cn>
    <complexes/>
  </apply>
</math>
  • Limited use: Office environment require rendering tools.

  • Database style file system representation.


Figure 635. MI department CLI parser Slide presentation Create comment in forum

Checking an XML document for well-formedness:

goik>xmlparse message.xml
Parsing was successful

We deliberately omit the closing element </from>:

Figure 636. Non-wellformed, missing </from> Slide presentation Create comment in forum
<memo>
 <from>M. Goik <!-- missing </from> -->
 <to>B. King</to>
 <to>A. June</to>
  <subject>Best whishes</subject>
  <content>... splendid party</content>
</memo>
goik>xmlparse omitfrom.xml
file://.../omitfrom.xml:7:3:
fatal error org.xml.sax.SAXParseException:
  The element type "from"
must be terminated by the matching end-tag "</from>". parsing error

Experienced HTML authors may be confused: Older HTML is not an XML standard. Instead HTML belongs to the set of SGML applications. SGML is a much older standard namely the Standard Generalized Markup Language being only of historic interest.

Even if every XML element has a closing counterpart the resulting XML may be invalid:

Figure 637. Improperly nested elements Slide presentation Create comment in forum
<memo>
 <from>M. Goik<to>B. King</from></to>
 <to>A. June</to>
 <subject>Best whishes</subject>
 <content>Hi all, congratulations to your splendid party</content>
</memo>

This type of error is caused by so called improper nesting of elements: The element <from>is being closed before the inner element <to> has been closed. This would contradict representing XML documents as a tree like structures. The parser thus echoes:

Figure 638. Improperly nested elements: Result Slide presentation Create comment in forum
file:///ma/goik/workspace/Vorlesungen/Input/Memo/nonest.xml:2:29:
fatal error org.xml.sax.SAXParseException: The element type "to" must be
terminated by the matching end-tag "</to>". parsing error

To be resolved by:

...<from>M. Goik<to>B. King</to></from>...

We provide two examples illustrating proper and improper nesting of XML documents:

Figure 639. Proper nesting of XML elements Slide presentation Create comment in forum
Proper nesting of XML elements

The following example violates the XML proper nesting constraint and thus does not represent a well-formed document:

Figure 640. Improperly nested elements Slide presentation Create comment in forum
Improperly nested elements

XML elements may have attributes like date in the following example:

Figure 641. date and priority attributes. Slide presentation Create comment in forum
<memo date="10.02.2026" priority="high">
  <from>M. Goik</from>
  <to>B. King</to>
  <to>A. June</to>
  <subject>Best whishes</subject>
  <content>Hi all, congratulations to your splendid party</content>
</memo>

Figure 642. Unique attribute names Slide presentation Create comment in forum
<!-- Error: Attribute date must be
            unique within element
            <memo> -->
<memo date = "10.02.2026"
      priority = "high"
      date = "10.02.2026">
  ...
public class Memo {
  Date date;
  Priority priority;
  Date date;
  ...
}

Figure 643. Quotes required Slide presentation Create comment in forum
  ...
<img valign = 'top'>
  ...
  ...
<!-- Error: Attribute value
            must be quoted -->
<img valign = top>
  ...

exercise No. 2

Single and double attribute value quotes Create comment in forum

Q:

We recall the problem of nested quotes yielding non-well formed XML code:

<img src="bold.gif" alt="We may use "quotes" here" />

The XML specification defines legal attribute value definitions as:

Literals
[1] EntityValue ::= '"' ([^%&"] | PEReference | Reference)* '"' |  "'" ([^%&'] | PEReference | Reference)* "'"  
[2] AttValue ::= '"' ([^<&"] | Reference)* '"' |  "'" ([^<&'] | Reference)* "'"  
[3] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")  
[4] PubidLiteral ::= '"' PubidChar* '"' | "'" (PubidChar - "'")* "'"  
[5] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'()+,./:=?;!*#@$_%]  

Find out how it is possible to set the attribute alt's value to the string We may use "quotes" here.

A:

The production rule for attribute values reads:

[2] AttValue ::= '"' ([^<&"] | Reference)* '"' |  "'" ([^<&'] | Reference)* "'"  

This allows us to use either of two alternatives to delimit attribute values:

<img ... alt="..."/>

Validity constraint: do not use " inside the value string.

<img ... alt='...'/>

Validity constraint: do not use ' inside the value string.

We may take advantage of the second rule:

<img src="bold.gif" alt='We may use "quotes" here' />

Notice that according to ??? the delimiting quotes must not be mixed. The following code is thus not well formed:

<img src="bold.gif'/>

exercise No. 3

A graphical representation of a memo. Create comment in forum

Q:

Draw a graphical representation similar as in Figure 626, “MathML tree graph representation ” of the memo document being given in Figure 641, “date and priority attributes. ”.

A:

The memo document's structure may be visualized as:

A graphical representation of Figure 641, “date and priority attributes. ”:

The sequence of element child nodes is important in XML and has to be preserved. Only the order of the two attributes date and priority is undefined: They actually belong to the <memo> node serving as a dictionary with the attribute names being the keys and the attribute values being the values of the dictionary.

Attributes and quotes

As stated before XML attributes have to be enclosed in single or double quotes. Construct an XML document with mixed quotes like <date day="monday'>. Find the corresponding syntax definition of legal attribute values in the XML standard W3C Recommendation to explain your result.

A:

The parser flags a mixture of single and double quotes for a given attribute as an error. The XML standard defines the syntax of attribute values as follows: An attribute value has to be enclosed either in two single or in two double quotes as being defined in http://www.w3.org/TR/xml/#NT-AttValue. Mixed quotes are disallowed.

Quotes as part of an attribute value?

Single and double quote are used to delimit an attribute value. May quotes appear themselves as part of an at tribute's value, e.g. like in a person's name Gary "King" Mandelson?

A:

Attribute values may contain double quotes if the attributes value is enclosed in single quotes and vice versa. Thus an attributes value may not contain single and double quotes at the same time:

Quotes as part of attribute values.

<?xml version="1.0" encoding="UTF-8"?>
<test>
  <person name='Gary "King" Mandelson'/> <!-- o.k.: Double quotes inside single quotes. -->
  <person name="Gary 'King' Mandelson"/> <!-- o.k.: Single quotes inside double quotes. -->
  <person name="Gary 'King 'S.' "Mandelson"'/> <!-- Oops: Just either of! -->
</test>

Constraints being imposed on XML documents:


These constraints are part of the definition of a well formed document. The specification imposes additional constraints for a document to be well-formed.

Figure 645. XML markup collision I Slide presentation Create comment in forum
Wrong Replacement
<p> if (a < b) return true;</p>
<p> if (a &lt; b) return true;</p>
  • XML parser assumes opening element <b>.

  • Operator < interferes with XML markup.

Replacement entity &lt;

Figure 646. XML markup collision II Slide presentation Create comment in forum
Error Replacement
<p>Smith & Wesson</p>
<p>Smith &amp; Wesson</p>
<img ... alt='a 'good' fellow'/>
<img ... alt="a &apos;good&apos; fellow"/>
<img ... alt="a "good" fellow"/>
<img ... alt="a &quot;good&quot; fellow"/>

Figure 647. XML standard replacement entities Slide presentation Create comment in forum
Character Replacement
< &lt;
> &gt;
& &amp;
' &apos;
" &quot;

Figure 648. Using CDATA sections Slide presentation Create comment in forum
<!-- Avoiding &amp; -->
<h3><![CDATA[HTML & XML]]></h3>

<!-- Display markup code
     «as is» -->
<pre><![CDATA[<ul>
  <li>One</li>
  <li>Two</li>
</ul>]]></pre>

Hint: Possibly useful when exporting e.g. RDBMS data to XML.

Using CDATA sections

Figure 649. Using replacement entities Slide presentation Create comment in forum
<h3>HTML &amp; XML</h3>

<pre>&lt;ul&gt;
  &lt;li&gt;One&lt;/li&gt;
  &lt;li&gt;Two&lt;/li&gt;
&lt;/ul&gt;</pre>
Using replacement entities

exercise No. 4

CDATA usage limitation Create comment in forum

Q:

State the obvious limitation of CDATA sections with respect to representing document content. Hint: Is there any content you may not be allowed to use?

A:

The CDATA termination symbol ]]> itself cannot be represented:

<h3><![CDATA[A CDATA section is being terminated by «]]>».]]></h3>
xmlparse /tmp/pre.xhtml
file:///tmp/pre.xhtml:1:63: fatal error org.xml.sax.SAXParseException;
systemId: file:///tmp/pre.xhtml; lineNumber: 2; columnNumber: 63;
The character sequence "]]>" must not appear in content unless used to
mark the end of a CDATA section.

Note

A CDATA's closing terminal is exactly ]]> : Using e.g. ]] > containing a space does not cause any parsing problem.

Figure 650. Processing instruction (PI) Slide presentation Create comment in forum
<?format src="styles.css"?>
  • XML level: Just a comment

  • Application level: Processing parameter