Site navigation below

In this section

Place your text ad here.

Web hosting directory, find affordable web hosting

Dedicated Server Hosting and Reseller Web Hosting

Data recovery software tools & file recovery utilities to recover lost data

WestNIC provides reliable web hosting services

Brand Protection

Award Winning PST Repair and Mail Recovery Software

Mortgage Calculator

JTidy is a Java implementation of Dave Raggett's HTML Tidy program that "tidies" tag soup to produce well formed HTML. This article highlights some important integration issues for JTidy with notes and tips from the Code Style site log.

The JTidy how to page gives a simple example to process HTML from a Java InputStream to an OutputStream using one of several parse methods on the core JTidy class.

JTidy distribution contents

The standard JTidy distribution contains three Java packages as source and compiled classes in %jtidy%/build/Tidy.jar:

Class compatibility conflict with standard Java distributions

The JTidy package was developed before the W3C DOM and SAX packages were widely available through the JAXP package and Java Software Development Kit 1.4. Small implementation differences with the DOM and SAX packages in these standard Java distributions can cause class compatibility conflicts with those included in the distribution JAR for JTidy. To overcome any class compatibility conflicts, download a recent source snapshot and re-package the org.w3c.tidy directory tree by itself; compile, copy important configuration files and create your own JAR file.

The JTidy package class Configuration uses the variable name enum, which is a keyword in Java 1.5 and not permitted. If necessary, use the -source 1.4 compiler flag to compile without source code modification.

C:\dev\>D:\java\jdk150\bin\javac
           -source 1.4
           -classpath "c:\dev"
           C:\dev\org\w3c\tidy\*.java
           -d "c:\dev\classes"
      

An important set of Java property files should also be copied into the JTidy class directory before you make a JAR file. The TidyMessages.properties file is the key English language message configuration file, TidyMessages_de.properties and TidyMessages_es.properties are not critical.

copy "C:\dev\org\w3c\tidy\*.properties" "C:\dev\classes\org\w3c\tidy"
      

Change to the classes output directory and archive the contents of the org subdirectory.

cd C:\dev\classes

C:\dev\classes>D:\java\jdk150\bin\jar cf "C:\dev\JTidy.jar" org
      

Ant build target for re-packaging the JTidy JAR

The MKSearch Ant build file includes a target that compiles and packages the org.w3c.tidy package in isolation. The general build.properties and local.properties files define the build variables, the local properties file should be updated with your own file system configuration. The JTidy source is expected to be in a directory named %mksearch%/lib-src/jtidy.

<target
  description="Compile and archive JTidy from source"
  name="jar.jtidy" depends="prepare">
  <mkdir dir="${buildDir}/jtidy"/>
  <javac
    srcdir="${sourceLibDir}/jtidy"
    destdir="${buildDir}/jtidy"
    debug="${debug}"
    deprecation="off"
    optimize="${optimize}"
    verbose="${verbose}"
    source="${source.version}">
    <classpath refid="classpath"/>
    <include name="**/*.java"/>
  </javac>
  <copy
    file="${sourceLibDir}/jtidy/org/w3c/tidy/TidyMessages.properties"
    todir="${buildDir}/jtidy/org/w3c/tidy"
    preservelastmodified="true"
    overwrite="true"
    verbose="false"/>
  <jar
    jarfile="${libDir}/jtidy.jar"
    basedir="${buildDir}/jtidy"/>
</target>
      

Code Style site log entries

The site log entries below note various developments to do with JTidy, which is used as a component in the Metacentric Web Feed Generator system and MKSearch metadata search engine. In both systems, JTidy is used to clean up the HTML source and convert to XHTML so that it can be processed further as XML using XSLT and SAX call-backs.

JTidy wrapper class, 3 April 2004
A JTidy wrapper class can help encapsulate standard clean-up configuration settings and streamline the pre-processing of HTML documents. In this case, a JTidy "driver" class was developed to support thread based processing.
Custom document type declaration, 6th April 2004
JTidy cannot correct non-standard proprietary elements in HTML, so these are carried through to the tidied output. To process the output further using XML tools, it is advisable to validate the input. JTidy's setFpi(String) method enables you to set your own document type declaration and validate with a custom document type definition. Code Style uses a "Lax" version of the XHTML transitional document type with many non-standard elements and attributes and less rigorous validation of attribute values.
JTidy configuration file option, 15th April 2004
JTidy can take a plain text configuration file instead of multiple Java setter methods, configured with its setConfigurationFromFile(String) method. The string argument is the path to the configuration file.
Repackaged JTidy to avoid conflict with JAXP, 26 May 2004
As detailed above, the Code Style and MKSearch projects have re-packaged JTidy to avoid class compatibility problems with standard Java distributions.
JTidy methods for source broker classes, 2 October 2004
When JTidy is encapsulated in a wrapper class, it makes it much easier to build its features into existing applications by composition. In this case "un-trusted" HTML source from the Web is passed through JTidy before being handled by other applications.
TidyException handling, 12 March 2005
JTidy's parse method does not throw any exceptions, so the Code Style wrapper class captures its error output stream to detect whether processing was successful.
Upgrade to JTidy CVS version, 11th September 2005
The JTidy project is not very active, but it is worth working from the latest CVS version of the source code because it includes enhancements that may not be present in the standard release version (see software downloads). This update was to better handle poor markup around Javascript code blocks.
MKSearch beta 1 release, 2 November 2005
The MKSearch system uses JTidy to clean up HTML and convert to XHTML before extracting and indexing its metadata. This site log entry details the beta release of the system.
UTF-8 output triggers JTidy right single quote bug, 9th February 2006
The Metacentric Web Feed Generator and MKSearch system both process large quantities of source HTML and this can sometimes expose bugs in the JTidy code. This entry identifies a workaround for a problem handling right single quote characters.

JTidy with the GNU compiler for Java

The MKSearch project was developed with a view to run on free open source platforms, as well as the Sun Java platform. The project made a number of JTidy library modifications to work around known problems with the GNU compiler for Java (GCJ).

Article feedback

Your comments on this article will be appreciated, please use the form below to submit your views. If you would like a reply or article update notification, include your email address.

Information: Your email address will not be mis-used. If you include your address you may be sent a personal reply, you will not be added to any mailing list unless you request update notification. Read the site privacy statement for details.

Add this page to your chosen social bookmarking service

Style warning - please read

Home · CSS · Java · Javascript · HTML · Help · Log