DOM and XPath

Note the necessity for html namespace (by prefix) inclusion into the XPath expression in <xsl:for-each select="//html:img">. A simple select="//img"> results in an empty node set. Executing the XSL script yields a list of image filenames being contained in the HTML page i.e. inline.gif one.gif two.gif.

As a preparation for an application checking image accessibility we want to rewrite the above XSL as a Java application. A simple approach may pipe the XSL output to our application which then executes the readability checks. Instead we implement an XPath based search within our Java application. Trying to resemble the XSL actions as closely as possible our application will search for Element nodes using the XPath expression //html:img.

 Associating prefix html and HTML namespace http://www.w3.org/1999/xhtml. Searching  elements belonging to the namespace http://www.w3.org/1999/xhtml linked by the html prefix. Selecting only Element instances rather than other sub classed objects below Content. Using no parameters. See [jdom-interest] XPath examples for parameterized queries. Using previously defined namespace. The ellipsis in compile supports multiple namespace definitions.

No. 19

Q:

We want to extend the example given in Figure 800, “Searching for images ” by testing the existence and checking for readability of referenced images. The following HTML document contains dead image references:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<title>External Pictures</title>
<body>
<h1>External Pictures</h1>
<p>A local image reference:<img src="inline.gif" alt="none"/></p>
<table>
<tbody>
<tr>
<td>An existing picture:</td>
<td><img
alt="none"/></td>
</tr>
<tr>
<td>A non-existing picture:</td>
<td><img
src="http://www.hdm-stuttgart.de/rotfl.gif"
alt="none"/></td>
</tr>
</tbody>
</table>
</body>
</html>

Write an application which checks for readability of URL image references to external Servers starting either with http:// or https:// ignoring other protocol types. Internal image references referring to the current server typically look like <img src="/images/test.gif". So in order to distinguish these two types of references we may use the XSL built in function starts-with() testing for the http or https protocol definition part of an URL. ftp addresses shall be ignored completely. A possible output corresponding to the above example reads:

xpath.CheckUrl (CheckUrl.java:48) - Protocol 'ftp' not yet implemented
ftp://inexistent.com/q.png, HTTP Status: false
http://www.hdm-stuttgart.de/rotfl.gif, HTTP Status: false

Caution

Handling http response codes is tricky. Accessing http://www.hdm-stuttgart.de/rotfl.gif actually yields a 302 (found) status code redirecting to an error page. The resource is actually unavailable.

Moreover a web server may return misleading response codes if deciding your user agent is unable to handle the current resource's content type in question. You may catch a glimpse of related problems by reading How to check if a URL exists or returns 404 with Java?.

For the current exercise we will refrain from digging deeper into the subject: Your application shall regard all non - 200 responses as unsuccessful ignoring the possibility of successful redirects completely.

Do not forget to provide unit tests.

Tip

Using XPath expressions in conjunction with namespaces requires appropriate definitions. The following two pages are helpful:

A:

We are interested in the set of images within a given HTML document containing an URL reference starting with either of:

• http://

• https://

• ftp://

This may be achieved by the following XPath expression:

//html:img[starts-with(@src, 'http://') or
starts-with(@src, 'https://') or starts-with(@src, 'ftp://')]

Checking for reachability happens in:

case "http":
case "https":
try {
final HttpURLConnection huc =  (HttpURLConnection) url.openConnection();
huc.setRequestMethod("GET");
huc.setInstanceFollowRedirects(false);
huc.connect();
return 200 == huc.getResponseCode(); // ignore redirects
} catch (final IOException e) {
log.error("Unable to connect to " + urlRef, e);
}
break;

No. 20

HTML internal reference verification

 Q: Consider the the following sample document: Internal references sample

Introduction

We categorize for native and VM based runtimes.

Languages

Native execution code
Virtual machine based
This document defines both anchor (target) elements like 

and local references like . Notice  not being highlighted: It does not start with a hash # and is thus not a document local but an external address. Some of these local references like  are ill-defined: There is no matching target element <... id="#vmBased">. Write an application which allows for identifying “dead” local references: matching target id 'nativeExec' found matching target id 'vmBased' found Error: matching target id 'noexist' not found Error: matching target id 'newSection' not foundOne possible strategy is: Search for all local  references. For each reference search for a corresponding anchor. Both parts may be implemented using XPath expressions. For the second task you are asked to reuse your XPathExpression using the technique being described in Figure 801, “Parameterized search expressions ”. A: Maven module source code available at sub directory P/Sda1/VerifyInternalReferences below lecture notes' source code root, see hints regarding import. Online browsing of API and implementation. h denoting the HTML namespace prefix we search for local references using: //h:*[starts-with(@href, '#')]This task is quite similar to Verification of referenced images readability . We create a reusable XPath expression searching for targets: //h:*[starts-with(@id, $" + ID_VAR_KEY + ")]Resolving the variable ID_VAR_KEY this actually contains //h:*[starts-with(@id,$targetId)]. This latter query parameter \$targetId will be set each time prior to executing the path expression in CheckLocalReferences: searchTargetId.setVariable(ID_VAR_KEY, id); final int targetCount = searchTargetId.evaluate(htmlInput).size();