First steps

Our first SAX toy application sax.stat.v1.ElementCount shall simply count the number of elements it finds in an arbitrary XML document. In addition the SAX events shall be written to standard output generating output sketched in Figure 899, “Parsing a XML document creates a corresponding sequence of events. ”. The application's central implementation reads:

Figure 901. Counting XML elements. Create comment in forum
package sax.stat.v1;
...

public class ElementCount {

  public void parse(final String uri) {
    try {
      final SAXParserFactory saxPf = SAXParserFactory.newInstance();
      final SAXParser saxParser = saxPf.newSAXParser();
      saxParser.parse(uri, eventHandler);
    } catch (ParserConfigurationException e){
      e.printStackTrace(System.err);
    } catch (org.xml.sax.SAXException e) {
      e.printStackTrace(System.err);
    } catch (IOException e){
      e.printStackTrace(System.err);
    }
  }

  public int getElementCount() {
    return eventHandler.getElementCount();
  }
  private final MyEventHandler eventHandler = new MyEventHandler();
}

This application works for arbitrary well-formed XML documents.


We now explain this application in detail. The first part deals with the instantiation of a parser:

try {
   final SAXParserFactory saxPf = SAXParserFactory.newInstance();
   final SAXParser saxParser = saxPf.newSAXParser();
   saxParser.parse(uri, eventHandler);
} catch (ParserConfigurationException e){
   e.printStackTrace(System.err);
} ...

In order to keep an application independent from a specific parser implementation the SAX uses the so called Abstract Factory Pattern instead of simply calling a constructor from a vendor specific parser class.

In order to be useful the parser has to be instructed to do something meaningful when a XML document gets parsed. For this purpose our application supplies an event handler instance:

public void parse(final String uri) {
  try {
    final SAXParserFactory saxPf = SAXParserFactory.newInstance();
    final SAXParser saxParser = saxPf.newSAXParser();
    saxParser.parse(uri, eventHandler);
  } catch (org.xml.sax.SAXException e) {
 ...
  private final MyEventHandler eventHandler = new MyEventHandler();
}

What does the event handler actually do? It offers methods to the parser being callable during the parsing process:

package sax.stat.v1;
...
public class MyEventHandler extends org.xml.sax.helpers.DefaultHandler {

  public void startDocument()❶ {
    System.out.println("Opening Document");
  }
  public void endDocument()❷ {
    System.out.println("Closing Document");
  }
  public void startElement(String namespaceUri, String localName, String rawName,
                     Attributes attrs) ❸{
    System.out.println("Opening \"" + rawName + "\"");
    elementCount++;
  }
  public void endElement(String namespaceUri, String localName,
    String rawName)❹{
    System.out.println("Closing \"" + rawName + "\"");
  }
  public void characters(char[] ch, int start, int length)❺{
    System.out.println("Content \"" + new String(ch, start, length) + '"');
  }
  public int getElementCount() ❻{
    return elementCount;
  }
  private int elementCount = 0;
}

This method gets called exactly once namely when opening the XML document as a whole.

After successfully parsing the whole document instance this method will finally be called.

This method gets called each time a new element is parsed. In the given catalog.xml example it will be called three times: First when the <catalog> appears and then two times upon each <item ... >. The supplied parameters depend whether or not name space processing is enabled.

Called each time an element like <item ...> gets closed by its counterpart </item>.

This method is responsible for the treatment of textual content i.e. handling #PCDATA element content. We will explain its uncommon signature a little bit later.

getElementCount() is a getter method to read only access the private field elementCount which gets incremented in ❸ each time an XML element opens.

The call saxParser.parse(uri, eventHandler) actually initiates the parsing process and tells the parser to:

  • Open the XML document being referenced by the URI argument.

  • Forward XML events to the event handler instance supplied by the second argument.

A driver class containing a main(...) method may start the whole process and print out the desired number of elements upon completion of a parsing run:

package sax.stat.v1;

public class ElementCountDriver {
  public static void main(String argv[]) {
    ElementCount xmlStats = new ElementCount();
    xmlStats.parse("Input/Sax/catalog.xml");
    System.out.println("Document contains " + xmlStats.getElementCount() + " elements");
  }
}

Processing the catalog example instance yields:

Opening Document
Opening "catalog" ❶
Content "
  "
Opening "item" ❷
Content "Swinging headset"
Closing "item"
Content "
  "
Opening "item"  ❸
Content "200W Stereo Amplifier"
Closing "item"
Content "
"
Closing "catalog"
Closing Document
Document contains 3 elements 

Start parsing element <catalog>.

Start parsing element <item orderNo="3218">Swinging headset</item>.

Start parsing element <item orderNo="9921">200W Stereo Amplifier</item>.

After the parsing process has completed the application outputs the number of elements being counted so far.

The output contains some lines of empty content. This content is due to whitespace being located between elements. For example a newline appears between the the <catalog> and the first <item> element. The parser encapsulates this whitespace in a call to the characters() method. In an application this call will typically be ignored. XML document instances in a professional context will typically not contain any newline characters at all. Instead the whole document is represented as a single line. This inhibits human readability which is not required if the processing applications work well. In this case empty content as above will not appear.

The characters(char[] ch, int start, int length) method's signature looks somewhat strange regarding Java conventions. One might expect characters(String s). But this way the SAX API allows efficient parser implementations: A parser may initially allocate a reasonable large char array of say 128 bytes sufficient to hold 64 (Unicode) characters. If this buffer gets exhausted the parser might allocate a second buffer of double size thus implementing an amortized doubling algorithm:

In this example the first element content fits in the first buffer. The second content 200W Stereo Amplifier and the third content Earphone both fit in the second buffer. Subsequent content may require further buffer allocations. Such a strategy minimizes the number of time consuming new String (...) constructor calls being necessary for the more convenient API variant characters(String s).