The set of element names

exercise No. 61

Element lists of arbitrary XML documents. Create comment in forum

Q:

We reconsider the simple application reading arbitrary XML documents and providing a list of XML Elements being contained within:

Opening Document
Opening "catalog"
Content "
  "
Opening "item"
Content "Swinging headset"
Closing "item"
Content " ...

If an element like e.g. <item> appears multiple times it will also be written to standard output multiple times.

We are now interested to get the list of all elements names being present in an arbitrary XML document. Consider the following example:

<memo>
  <from>
    <name>Martin</name>
    <surname>Goik</surname>
  </from>
  <to>
    <name>Adam</name>
    <surname>Hacker</surname>
  </to>
  <to>
    <name>Eve</name>
    <surname>Intruder</surname>
  </to>
  <date year="2005" month="1" day="6"/>
  <subject>Firewall problems</subject>
  <content>
    <para>Thanks for your excellent work.</para>
    <para>Our firewall is definitely broken!</para>
  </content>
</memo>

The elements <to> , <name>, <surname> and <para> both appear multiple times. Write a SAX application which processes arbitrary XML documents and creates an alphabetically sorted list of elements being contained excluding duplicates. The intended output for the above example is:

List of elements: {content date from memo name para subject surname to }

The corresponding handler should be implemented in a re-usable way. Thus if different XML documents are being handled in succession the list of elements should be erased prior to processing the current document. Hints:

  • Use a java.util.SortedSet instance to collect element names thereby excluding duplicates.

  • The method sax.count.ListTagNamesHandler.startDocument() may be used to initialize your handler.

A:

A suitable handler reads:

package sax.count;

import java.util.SortedSet;
import java.util.TreeSet;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

/** Reading attributes from element events */
public class ListTagNamesHandler extends DefaultHandler {

  // A SortedSet by definition does not contain any duplicates.
  private SortedSet<String> elementNames = new TreeSet<>();

  @Override
  public void startDocument() throws SAXException {
    elementNames.clear(); // May contain elements from a previous run.
  }

  public void startElement(String namespaceUri, String localName,
      String rawName, Attributes attrs) {
    // In case the current element name has already been inserted
    // this method call will be silently ignored.
    elementNames.add(rawName);
  }

  /**
   * @return A sorted list of element names of he currently processed XML
   *         document without duplicates.
   */
  public String[] getTagNames() {
    return elementNames.toArray(new String[0]);
  }
}

A complete application requires a driver:

package sax.count;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.XMLReader;

import sax.stat.v2.MyErrorHandler;

public class Driver {

  public static void main(String argv[]) throws Exception {

    final SAXParserFactory saxPf = SAXParserFactory.newInstance();
    final SAXParser saxParser = saxPf.newSAXParser();
    final XMLReader xmlReader = saxParser.getXMLReader();
    final ListTagNamesHandler handler = new ListTagNamesHandler();
    xmlReader.setContentHandler(handler);
    xmlReader.setErrorHandler(new MyErrorHandler());
    xmlReader.parse("Input/Xml/Memo/message.xml");

    System.out.print("List of elements: {");
    for (String elementName : handler.getTagNames()) {
      System.out.print(elementName + " ");
    }
    System.out.println("}");
  }
}