Dec 092017 0 Responses

Puppeteer – Web Scraping using Headless Chrome Node API

Puppeteer is Headless Chrome browser developed by Google Team. A headless browser is a web browser without a graphical user interface(GUI) means that it has no visual components. Headless browsers enable you to control web page via programming without human intervention. In Programmer’s term, Puppeteer is a node library or API for Headless browsing as well as browser automation developed by Google Chrome team.
web scraping with chrome
Browser automation helps you to automate repetitive tasks and web application testing. For example, monitoring product pricing over period of time, form submission, automatically login to web app, perform some task and logout etc.

There are many libraries for browser automation and web scraping like PhantomJS, Selenium IDE etc. However Puppeteer runs faster and uses less memory. Puppeteer only works with Google Chrome browser.Puppeteer can be used for:

  • Headless WebApp/Website Testing: Develop a function test and execute it.
  • Screen Capture: Capture web content screenshots/generate PDFs
  • Web Scraping: Retrieve and manipulate web pages with the DOM API or libraries like jQuery
  • Web Automation: Automate form submission, UI Testing, Keyboard input etc.
  • Network Monitoring: Monitoring page loading, diagnose performance issues.

Puppeteer can be used for:Puppeteer provides great flexibility and features for Web Scraping. It provides all the features that a professional web scraper desires to have like

  • Setting up of the many browser options
  • Slowing down Puppeteer operations by the specified amount of milliseconds.
  • Creating individual Chromium user profile which it cleans up on every run.
  • Timeout: Maximum time in milliseconds to wait for the Chrome instance to start.
  • and many more
Our main focus here is exploring Puppeteer from web scraper’s perspective. Let’s take a some practical examples and try to learn it.
First of all we need to install the library:
Example #1: Capture screenshot of a web page
It can be accomplished by following three simple steps: Create/Open browser, Browser a web page, and take a screenshot. The below code is for capturing screenshot of a web page and is well documented and self-explanatory.

Example #2: Create a PDF document from a web page

 

Example #3: Browser Bing search engine and Search for web scraping

 

Some advance Web Scraping Tips & Tricks:

#1. Using Proxies with Puppeteer

You can read more about proxy setting here

#2. Prevent/Block image loading

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">