Jan 022017 Tagged with , , 0 Responses

HtmlAgilityPack to parse HTML in .NET

Introduction

web-scraping-using-htmlagilitypackScreen Scraping also known as Data Scraping or Data Extraction is a technique of collecting different kind of data from a web page like meta tag information, titles, images, links, contact information(phone & email) and other important data like weather forecasts.

To make Web Scraping into action using .NET, we have very useful .NET library known as HTMLAgilityPack. It provides essential methods navigating, modifying and searching DOM(Document Object Model) Tree. HTMLAgilityPack parses anything you give it even if it’s malformed HTML having missing closing tags, very tolerant! It supports XPath and XSLT for navigating the web page.

“BeautifulSoup is to python as HTML Agility Pack is to .NET”

The big question is that Why HTMLAgilityPack? Doesn’t .NET support DOM parsing?

The .NET provides built in facility for manipulating XML but it required standards compliant markup as input which is not so common in case of websites. So if you don’t have XML standards compliant web page, you won’t be able to do screen scraping. To parse malformed and XML standards non-compliant document, HTMLAgilityPack is the option.

Applications of HTML Agility Pack:

  • Convert malformed HTML into well formed HTML means you can fix the page the way you want.
  • Modification: You can add, edit, delete, rename nodes of web page
  • Navigation: You can traverse the entire HTML document (DOM) with the help of XPath and XSLT. Example: You can easily extract all the links in the web page for link analysis.
  • Scrapper and Crawlers: HTMLAgilityPack is widely used for writing scraping software and web crawlers.

How to install HTML Agility pack in Visual Studio?

Now let us explore HTMLAgilityPack –  the .NET screen scraping library in practical.

Below are the steps to install HTML Agility Pack in Visual Studio.

  1. Download HtmlAgilityPack.dll from official website and store into your local system.
  2. Create sample windows application project.
  3. Project → Add Reference Browse to the ‘HtmlAgilityPack.dll’→OK.
  4. In project→ Reference folder ,we can find “HtmlAgilityPack“.
  5. Done.

If you are using .NET 4.0 or Later, you must have access to Nuget Packages with Visual Studio.

Below steps will guide you to install HtmlAgilityPack.dll using Nuget Package:

  1. Create sample windows application project.
  2. Right click on Solution Explorer →click on Manage Nuget Package
  3. In the Nuget Package window, type HtmlAgilityPack in the Search Box and Click the Install Button.
  4. Done.

Great! We install the library, now let’s see a practical example…

Understanding HTMLAgilityPack (HAP) with Simple Example

Example: I want to retrieve all the links (“href”) from URL ( http://tutorialspoint.com/ ).

Code:

HtmlWeb: It’s a class which provides Load() method to load webpage/URL/document and Load its Html content into HtmlDocument object.

HtmlDocument: HtmlDocument is a class which provides Select Nodes Method that accept Xpath expression.

Drawbacks of the Library:

  • The official document is not available so it’s difficult for the developer has no exposure to the screen scraping fundamentals.

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">