Jan 022014 Tagged with , , , ,

Easy Web Scraping using PHP Simple HTML DOM Parser Library

Web Scraping PHPWeb scraping is only way to get data from website when  website don’t provide API to access it’s data. Web scraping involves following steps to get data:

  1. Make request to web page
  2. Parse/Extract data that you want to scrape from website.
  3. Store data for final output (excel, csv,mysql database etc).

Web scraping can be implemented in any language like PHP, Java, .Net, Python and any language that allows to make web request to get web page content (HTML text) in to variable. In this article I will show you how to use Simple HTML DOM PHP library to do web scraping using PHP.

PHP Simple HTML DOM Parser

Simple HTML DOM is a PHP library to parse data from webpages, in short you can use this library to do web scraping using PHP and even store data to MySQL database.  Simple HTML DOM has following features:

  • The parser library is written in PHP 5+
  • It requires PHP 5+ to run
  • Parser supports invalid HTML parsing.
  • It allows to select html tags like Jquery way.
  • Supports Xpath and CSS path based web extraction
  • Provides both the way – Object oriented way and procedure way to write code

Scrape All Links

<?php
include "simple_html_dom.php";

//create object
$html=new simple_html_dom();

//load specific URL
$html->load_file("http://www.google.com");

// This will Find all links
foreach($html->find('a') as $element) 
   echo $element->href . '<br>';

?>

 Scrape images

<?php
include "simple_html_dom.php";

//create object
$html=new simple_html_dom();

//load specific url
$html->load_file("http://www.google.com");

// This will Find all links
foreach($html->find('img') as $element) 
   echo $element->src . '<br>';

?>

 

This is just little idea how you can do web scraping using PHP.Keep in mind that Xpath can make your job simple and fast. You can find all methods available in SimpleHTMLDom documentation page.

http://simplehtmldom.sourceforge.net/manual.htm