Cleaning up HTML.

exercise No. 18

Q:

Consider the following HTML legacy document:

<html xmlns='http://www.w3.org/1999/xhtml'>
  <head>
    <title>A simple image example</title>
  </head>
  <body>
    <img src='a.gif' align='right'/> <!-- Error:  pre- HTML5 style -->

    <p>Some inline image without alignment <img src="b.gif"/></p>
    <p>Some inline image with alignment <img src="c.gif" align="bottom"/><!-- Error:  pre- HTML5 style --></p>
  </body>
</html>

The pre-HTML5 align='...'attribute is deprecated and has been replaced by CSS style="vertical-align: ...;" or style="float: ...;" respectively:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>A simple image example</title>
  </head>
  <body>
    <img src="a.gif" style="float: right" />

    <p>Some inline image without alignment <img src="b.gif" /></p>
    <p>Some inline image with alignment <img src="c.gif" style="vertical-align: bottom;" /></p>
  </body>
</html>

Write a JDom based filter application which transforms these legacy declarations to CSS accordingly.

Tip

A possible road map:

  1. Start by an identity transformation: Parse your HTML document to a DOM tree and simply serialize this tree to standard output.

  2. Modify the intermediate DOM tree. The recursive descent method from the section called “Visualizing XML document elements” allows for retrieving all <img .../> elements.

  3. The getAttribute(...) method allows for identifying relevant <img align='...'/> elements.

  4. You may then modify <img align='...'/> elements using removeAttribute(...) and setAttribute(...).

Optional: Supply an XSLT doing the same job as your Java application and compare both solution variants. You may want to read The Identity Transform. This enables you to:

  1. Copy most HTML from input to output like in your Java solution.

  2. Handle relevant <img align='...'/> elements separately by defining an extra template.

A:

The above solution contains a remarkably small html2html.xsl style sheet.

Caution

Both solution variants do not account for elements <img ... align='...' style='...'/> already defining a style attribute. Any such existing value will be overridden. It is however straightforward extending the current solution to append to the style attributes value rather then overriding it.