Nov 23, 2009 at 4:05 AM

If you have any feature requests for this project, submit them here and I'll try to accommodate you.

Apr 8, 2010 at 9:57 AM

Would it be possible to implement the HtmlDocument.Normalize() method to format the html for better reading?

Apr 8, 2010 at 12:23 PM

You mean to put in all the tab spaces and what not? Originally the GetHtml method would extract the html in a way that it would preserve the white spacing and line breaks of the original. But if you want it fully tabbed and normalized that can be arranged.

Apr 8, 2010 at 2:05 PM

Thanks. I created a Normalize method myself and uploaded it as a patch. Hope it can be useful. Also, I made a change to the HtmlFormatter class to avoid unneeded spaces inside html elements that do not have attributes.

Thanks for the very useful code!

Apr 8, 2010 at 3:11 PM
Oh, well in that case: Much obliged, I'll take a look at it. Also you're welcome :)
Apr 9, 2010 at 9:47 AM
Hi, I also found an issue with with the parser which seems to loose
the last piece of 'text' of the html document. I logged an 'issue'
about it plus included a sample.
From what I've been able to investigate itr seems because the
HtmlParser's Tokenize method does not find another new html tag after
the plain text it assign the 'text' to a non existing attribute and
then it gets lost. It happens at the line

_match = _attributeRegEx.Match(html);

in the Tokenize method method. You probably need an additional if test
to see if there are any 'new' tags left in the document at that point.



Apr 25, 2010 at 2:08 AM

That last issue is fixed as of version 1.3. Enjoy everyone!