For an R&D project I needed something that would help in scraping our existing web pages, specifically pages with forms. I wanted it to be preferably a java library to easily plug into ColdFusion. After trying a few (jTidy, Cobra and two HtmlParsers) I stumbled on Jericho. http://sourceforge.net/projects/jerichohtml/. It did all I needed to do and then some. It did great parsing HTML and getting the values I was after. Projects java docs helped a lot.
There is only one .jar file (as of this writing jericho-html-2.6.jar). So put it somewhere in the CLASSPATH. To check if you put it in the right place in ColdFusion 8 administrator look in Settings Summary and see if it's listed under Java Class Path.
I didn't go as far as to write full on wrapper for it. I was after form fields so here is the code that get's what I need.
Here is the code for parseFormValues() where you can see some of the API Jericho provides in action. As I looked through this, I noticed where I collect lists values with listAppend() if the values have commas it would create a problem.
So keep it in mind if you plan to use it.
I am sure there are some other improvements that can be made since it's a first pass at this.
Note on "this" scope usage. The component this code is in, extends BaseComponent (thank you Hal Helms)
with generic (can you say "lazy" :-) ) set and get implemented with onMissingMethod. You have to use "this" for it to work inside the component.
I did find accidentally later in the project that using this (ooh cool pun) technique is slower then actually creating a setter and a getter, which kind of makes sense.