Simply reference the library in your project.
//Create an instance
HtmlDocument _document = new HtmlDocument(); 

//load html from the web
_document.Load(new Uri("http://www.google.co.jp/search?hl=ja&q=the&lr=&aq=f&oq=")); 

//save the resultant xml data
_document.Save("google_jp.xml");

The transform function HtmlDocument.Transform(string xsltTemplate) takes in text based xslt and returns a HtmlDocument

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
  <xsl:output method="xml" indent="yes" />
  <xsl:template match="/">
    <GoogleItem>
      <xsl:for-each select="/HTMLDocument[1]/body[1]/div[5]/div[4]/div[1]/h3">
        <Item>
          <Title>
            <xsl:value-of select="a[1]/em[1]/text()" />
          </Title>
          <Url>
            <xsl:value-of select="a[1]/@href" />
          </Url>
        </Item>
      </xsl:for-each>
    </GoogleItem>
  </xsl:template>
</xsl:stylesheet>


The HtmlDocument.GetHtml() function gets the html text of the document extracting it from the Xml. The Html produced however isn't necessarily xhtml but more Html 4.0

Late Breaking news: Seems that google has changed their layout system, so the above example may not work for all. Expect and update to this tutorial with a more reliable "broken" html site.

Last edited Mar 24, 2010 at 1:41 PM by kurtnelle, version 14

Comments

No comments yet.