Introduction
Screen Scraping also known as Data Scraping or Data Extraction is a technique of collecting different kind of data from a web page like meta tag information, titles, images, links, contact information(phone & email) and other important data like weather forecasts.
To make Web Scraping into action using .NET, we have very useful .NET library known as HTMLAgilityPack. It provides essential methods navigating, modifying and searching DOM(Document Object Model) Tree. HTMLAgilityPack parses anything you give it even if it’s malformed HTML having missing closing tags, very tolerant! It supports XPath and XSLT for navigating the web page.
“BeautifulSoup is to python as HTML Agility Pack is to .NET”
The big question is that Why HTMLAgilityPack? Doesn’t .NET support DOM parsing?
The .NET provides built in facility for manipulating XML but it required standards compliant markup as input which is not so common in case of websites. So if you don’t have XML standards compliant web page, you won’t be able to do screen scraping. To parse malformed and XML standards non-compliant document, HTMLAgilityPack is the option.
Applications of HTML Agility Pack:
- Convert malformed HTML into well formed HTML means you can fix the page the way you want.
- Modification: You can add, edit, delete, rename nodes of web page
- Navigation: You can traverse the entire HTML document (DOM) with the help of XPath and XSLT. Example: You can easily extract all the links in the web page for link analysis.
- Scrapper and Crawlers: HTMLAgilityPack is widely used for writing scraping software and web crawlers.
How to install HTML Agility pack in Visual Studio?
Now let us explore HTMLAgilityPack – the .NET screen scraping library in practical.
Below are the steps to install HTML Agility Pack in Visual Studio.
- Download HtmlAgilityPack.dll from official website and store into your local system.
- Create sample windows application project.
- Project → Add Reference→ Browse to the ‘HtmlAgilityPack.dll’→OK.
- In project→ Reference folder ,we can find “HtmlAgilityPack“.
- Done.
If you are using .NET 4.0 or Later, you must have access to Nuget Packages with Visual Studio.
Below steps will guide you to install HtmlAgilityPack.dll using Nuget Package:
- Create sample windows application project.
- Right click on Solution Explorer →click on Manage Nuget Package
- In the Nuget Package window, type HtmlAgilityPack in the Search Box and Click the Install Button.
- Done.
Great! We install the library, now let’s see a practical example…
Understanding HTMLAgilityPack (HAP) with Simple Example
Example: I want to retrieve all the links (“href”) from URL ( http://tutorialspoint.com/ ).
Code:
HtmlWeb getHtmlWeb = new HtmlWeb(); // Creates an HtmlDocument object and Load Html content from that URL to document object HtmlAgilityPack.HtmlDocument document = getHtmlWeb.Load("http://tutorialspoint.com/"); // selects all the anchor tags from the downloaded document var aTags = document.DocumentNode.SelectNodes("//a"); int counter = 1; if (aTags != null) { //iterate through all the anchor tags and retrieve its href attribute value(link) foreach (var aTag in aTags) { //Extracts all links into textbox2 with href attribute textBox2.Text += counter + ".”+ aTag.Attributes ["href"].Value + " " + Environment.NewLine; counter++; } }
HtmlWeb: It’s a class which provides Load() method to load webpage/URL/document and Load its Html content into HtmlDocument object.
HtmlDocument: HtmlDocument is a class which provides Select Nodes Method that accept Xpath expression.
Drawbacks of the Library:
- The official document is not available so it’s difficult for the developer has no exposure to the screen scraping fundamentals.