DOM and XPath

Figure 758. Why using XPath ? Slide presentation Create comment in forum

Figure 759. XPath and Jdom Slide presentation Create comment in forum
  • Addressing node sets in XML trees.

  • Conceptional SQL similarity.

  • Collections representing result sets.


Figure 760. XPath on top of Jdom Slide presentation Create comment in forum
<dependency>                  <!-- Jdom itself -->
  <groupId>org.jdom</groupId>
  <artifactId>jdom2</artifactId>
  <version>2.0.6</version>
</dependency>

<dependency>                  <!-- XPath support for Jdom -->
  <groupId>jaxen</groupId>
  <artifactId>jaxen</artifactId>
  <version>1.1.6</version>
</dependency> ...

Figure 761. HTML containing <img> tags. Slide presentation Create comment in forum
<html xmlns="http://www.w3.org/1999/xhtml">
  <head><title>Picture gallery</title></head>
  <body>
    <h1>Picture gallery</h1>
    <p>Images may appear inline:<img src="inline.gif" alt="none"/></p>
    <table><tbody>
      <tr>
        <td>Number one:</td>
        <td><img src="one.gif" alt="none"/></td>
      </tr>
      <tr>
        <td>Number two:</td>
        <td><img src="http://www.hdm-stuttgart.de/favicon.ico" alt="none"/></td>
      </tr>
    </tbody></table>
  </body>
</html>

Figure 762. Objective: Find contained images Slide presentation Create comment in forum
  • (Nearly) arbitrary positions.

  • Possibly additional search restrictions e.g.: searching for <img/> elements missing an alt attribute.


Figure 763. XSL script extracting images. Slide presentation Create comment in forum
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:html="http://www.w3.org/1999/xhtml">
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:for-each select="//html:img">
      <xsl:value-of select="@src"/>
      <xsl:text> </xsl:text>
    </xsl:for-each>
  </xsl:template>

</xsl:stylesheet>

Result acting on Figure 761, “HTML containing <img> tags. ”:

inline.gif one.gif two.gif

Note the necessity for html namespace (by prefix) inclusion into the XPath expression in <xsl:for-each select="//html:img">. A simple select="//img"> results in an empty node set. Executing the XSL script yields a list of image filenames being contained in the HTML page i.e. inline.gif one.gif two.gif.

As a preparation for an application checking image accessibility we want to rewrite the above XSL as a Java application. A simple approach may pipe the XSL output to our application which then executes the readability checks. Instead we implement an XPath based search within our Java application. Trying to resemble the XSL actions as closely as possible our application will search for Element nodes using the XPath expression //html:img.

Figure 764. Setting up the parser Slide presentation Create comment in forum
public class DomXpath {
  private final SAXBuilder builder = new SAXBuilder();

  public List<Element> process(final String xhtmlFilename)
                throws JDOMException, IOException {

    final Document htmlInput = builder.build(xhtmlFilename);
     ...
   }
}

Tip

Complete code available here.


Figure 765. Search using XPath //img Slide presentation Create comment in forum
static final XPathExpression<Element> xpathSearchImg =
  XPathFactory.instance().compile(
    "//img" ,
    new ElementFilter() /* filter just elements */);

Figure 766. Search and namespace Slide presentation Create comment in forum
static final Namespace htmlNamespace  =
  Namespace.getNamespace("html", "http://www.w3.org/1999/xhtml");

static final XPathExpression<Element> xpathSearchImg =
  XPathFactory.instance().compile(
    "//html:img" ,
    new ElementFilter(),
    null ,
    htmlNamespace );

Associating prefix html and HTML namespace http://www.w3.org/1999/xhtml.

Searching <img> elements belonging to the namespace http://www.w3.org/1999/xhtml linked by the html prefix.

Selecting only Element instances rather than other sub classed objects below Content.

Using no parameters. See [jdom-interest] XPath examples for parameterized queries.

Using previously defined namespace. The ellipsis in compile supports multiple namespace definitions.

Figure 767. Searching for images Slide presentation Create comment in forum
public List<Element> process(final String xhtmlFilename)... {
  final Document htmlInput = builder.build(xhtmlFilename);
    return xpathSearchImg.evaluate(htmlInput);
}
new DomXpath().process("src/main/resources/gallery.html").
      stream().
      map(img -> img.getAttributeValue("src")).
      reduce((l, r) -> l.concat(", ").concat(r)).
      ifPresent(System.out::println);
inline.gif, one.gif, http://www.hdm-stuttgart.de/favicon.ico

exercise No. 19

Verification of referenced images readability Create comment in forum

Q:

We want to extend the example given in Figure 767, “Searching for images ” by testing the existence and checking for readability of referenced images. The following HTML document contains dead image references:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>External Pictures</title>
  </head>
  <body>
    <h1>External Pictures</h1>
    <p>A local image reference:<img src="inline.gif" alt="none"/></p>
    <p>What about ftp?<img src="ftp://inexistent.com/q.png" alt="none"/></p>
    <table>
      <tbody>
        <tr>
          <td>An existing picture:</td>
          <td><img
             src="https://www.hdm-stuttgart.de/bilder_navigation/laptop.gif"
             alt="none"/></td>
        </tr>
        <tr>
          <td>A non-existing picture:</td>
          <td><img
              src="http://www.hdm-stuttgart.de/rotfl.gif"
              alt="none"/></td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

Write an application which checks for readability of URL image references to external Servers starting either with http:// or https:// ignoring other protocol types. Internal image references referring to the current server typically look like <img src="/images/test.gif". So in order to distinguish these two types of references we may use the XSL built in function starts-with() testing for the http or https protocol definition part of an URL. ftp addresses shall be ignored completely. A possible output corresponding to the above example reads:

xpath.CheckUrl (CheckUrl.java:48) - Protocol 'ftp' not yet implemented
ftp://inexistent.com/q.png, HTTP Status: false
https://www.hdm-stuttgart.de/bilder_navigation/laptop.gif, HTTP Status: true
http://www.hdm-stuttgart.de/rotfl.gif, HTTP Status: false

Caution

Handling http response codes is tricky. Accessing http://www.hdm-stuttgart.de/rotfl.gif actually yields a 302 (found) status code redirecting to an error page. The resource is actually unavailable.

Moreover a web server may return misleading response codes if deciding your user agent is unable to handle the current resource's content type in question. You may catch a glimpse of related problems by reading How to check if a URL exists or returns 404 with Java?.

For the current exercise we will refrain from digging deeper into the subject: Your application shall regard all non - 200 responses as unsuccessful ignoring the possibility of successful redirects completely.

Do not forget to provide unit tests.

Tip

Using XPath expressions in conjunction with namespaces requires appropriate definitions. The following two pages are helpful:

A:

We are interested in the set of images within a given HTML document containing an URL reference starting with either of:

  • http://

  • https://

  • ftp://

This may be achieved by the following XPath expression:

//html:img[starts-with(@src, 'http://') or
        starts-with(@src, 'https://') or starts-with(@src, 'ftp://')]

Checking for reachability happens in:

case "http":
case "https":
  try {
    final HttpURLConnection huc =  (HttpURLConnection) url.openConnection();
    huc.setRequestMethod("GET");
    huc.setInstanceFollowRedirects(false);
    huc.connect();
    return 200 == huc.getResponseCode(); // ignore redirects
  } catch (final IOException e) {
    log.error("Unable to connect to " + urlRef, e);
}
break;
Figure 768. Parameterized search expressions Slide presentation Create comment in forum
Map<String, Object> xpathVarsNamespacePrefix = new HashMap<>();
xpathVarsNamespacePrefix.put("cssClass", null) ;
...
XPathExpression<Element> searchCssClass = XPathFactory.instance().compile(
  "//html:*[@class = $cssClass]",
  new ElementFilter(), xpathVarsNamespacePrefix, htmlNamespace);

searchCssClass.setVariable("cssClass", "header");
searchCssClass.evaluate(htmlInput) ...

// Reuse by changing $cssClass
searchCssClass.setVariable("cssClass", "footer");
searchCssClass.evaluate(htmlInput) ...

exercise No. 20

HTML internal reference verification Create comment in forum

Q:

Consider the the following sample document:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>Internal references sample</title>
  </head>

  <body><h1 id="start">Introduction</h1><p>We categorize for <a
  href="#nativeExec">native</a> and <a href="#vmBased">VM based</a> <a
  href="https://en.wikipedia.org/wiki/Runtime_system">runtimes</a>.</p><h1
  id="languages">Languages</h1><dl>
      <dt id="nativeExec">Native execution code</dt>

      <dd><ul>
          <li>C</li>

          <li>C++</li>

          <li>FORTRAN, see <a href="#noexist">new section</a>.</li>
        </ul></dd>

      <dt id="vmBased">Virtual machine based</dt>

      <dd><ul>
          <li>Java</li>

          <li>Python, see <a href="#newSection">second new section</a>.</li>

          <li>C#</li>
        </ul></dd>
    </dl></body>
</html>

This document defines both anchor (target) elements like <h1 id="start"> and local references like <a href="#vmBased">.

Notice <a href="https://en.wikipedia.org/wiki/Runtime_system"> not being highlighted: It does not start with a hash # and is thus not a document local but an external address.

Some of these local references like <a href="#noExist"> are ill-defined: There is no matching target element <... id="#vmBased">. Write an application which allows for identifying dead local references:

matching target id 'nativeExec' found
matching target id 'vmBased' found
Error: matching target id 'noexist' not found
Error: matching target id 'newSection' not found

One possible strategy is:

  1. Search for all local <a href="#..."> references.

  2. For each reference search for a corresponding anchor.

Both parts may be implemented using XPath expressions. For the second task you are asked to reuse your XPathExpression using the technique being described in Figure 768, “Parameterized search expressions ”.

A:

h denoting the HTML namespace prefix we search for local references using:

//h:*[starts-with(@href, '#')]

This task is quite similar to Verification of referenced images readability . We create a reusable XPath expression searching for targets:

//h:*[starts-with(@id, $" + ID_VAR_KEY + ")]

Resolving the variable ID_VAR_KEY this actually contains //h:*[starts-with(@id, $targetId)]. This latter query parameter $targetId will be set each time prior to executing the path expression in CheckLocalReferences:

searchTargetId.setVariable(ID_VAR_KEY, id);
final int targetCount = searchTargetId.evaluate(htmlInput).size();