Dec 092017 0 Responses

Puppeteer – Web Scraping using Headless Chrome Node API

Puppeteer is Headless Chrome browser developed by Google Team. A headless browser is a web browser without a graphical user interface(GUI) means that it has no visual components. Headless browsers enable you to control web page via programming without human intervention. In Programmer’s term, Puppeteer is a node library or API for Headless browsing as well as browser automation developed by Google Chrome team.
web scraping with chrome
Browser automation helps you to automate repetitive tasks and web application testing. For example, monitoring product pricing over period of time, form submission, automatically login to web app, perform some task and logout etc.

There are many libraries for browser automation and web scraping like PhantomJS, Selenium IDE etc. However Puppeteer runs faster and uses less memory. Puppeteer only works with Google Chrome browser.Puppeteer can be used for:

  • Headless WebApp/Website Testing: Develop a function test and execute it.
  • Screen Capture: Capture web content screenshots/generate PDFs
  • Web Scraping: Retrieve and manipulate web pages with the DOM API or libraries like jQuery
  • Web Automation: Automate form submission, UI Testing, Keyboard input etc.
  • Network Monitoring: Monitoring page loading, diagnose performance issues.

Puppeteer can be used for:Puppeteer provides great flexibility and features for Web Scraping. It provides all the features that a professional web scraper desires to have like

  • Setting up of the many browser options
  • Slowing down Puppeteer operations by the specified amount of milliseconds.
  • Creating individual Chromium user profile which it cleans up on every run.
  • Timeout: Maximum time in milliseconds to wait for the Chrome instance to start.
  • and many more
Our main focus here is exploring Puppeteer from web scraper’s perspective. Let’s take a some practical examples and try to learn it.
First of all we need to install the library:
yarn add puppeteer
# or "npm i puppeteer"

Example #1: Capture screenshot of a web page
It can be accomplished by following three simple steps: Create/Open browser, Browser a web page, and take a screenshot. The below code is for capturing screenshot of a web page and is well documented and self-explanatory.

//open puppeteer library
const puppeteer = require('puppeteer');

(async () => {
//create browser instance and launch it
  const browser = await puppeteer.launch();
//create page instance in browser
  const page = await browser.newPage();
//load URL into the page
  await page.goto('https://en.wikipedia.org/wiki/Main_Page');
//capture screenshot and save it
  await page.screenshot({path: 'wikipedia.png'});
//finally close the browser
  await browser.close();
})();

Example #2: Create a PDF document from a web page

 

//open the puppeteer library
const puppeteer = require('puppeteer');

(async () => {
//create browser instance and launch it
  const browser = await puppeteer.launch();
//create page instance in browser
  const page = await browser.newPage();
//load URL into the page
  await page.goto('https://news.google.com/', {waitUntil: 'domcontentloaded'});
//save web page as A4 size PDF document in current directory (path can be specified)
  await page.pdf({path: 'googlenews.pdf', format: 'A4'});
//finally close the browser
  await browser.close();
})();

Example #3: Browser Bing search engine and Search for web scraping

//open the puppeteer library
const puppeteer = require('puppeteer');

(async() => {

//create browser instance and launch it
  const browser = await puppeteer.launch();
//create page instance in browser
  const page = await browser.newPage();
//Visit the bing search engine
await page.goto('https://www.bing.com/', {waitUntil: 'networkidle2'});
//Wait for textbox whose name equals 'q' (textbox to search)
await page.waitFor('input[name=q]');
// Type search term into the search bar
await page.type('input[name=q]', 'web scraping');
// Click the submit button
await page.click('input[type="submit"]');

// Wait for the results to show up
await page.waitForSelector('h2 a');

// Extract the list of links from the result page
const links = await page.evaluate(() => {
  const anchors = Array.from(document.querySelectorAll('h2 a'));
  return anchors.map(anchor => anchor.textContent);
});
//output the list of links to console
console.log(links.join('\n'));
//finally close the browser
await browser.close();

})();

 

Some advance Web Scraping Tips & Tricks:

#1. Using Proxies with Puppeteer

const browser = await puppeteer.launch({
  // Launch puppeteer browser using a proxy server on port 3600.
  args: [ '--proxy-server=127.0.0.1:3600' ]
});

You can read more about proxy setting here

#2. Prevent/Block image loading

const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
  if (request.resourceType === 'image')
    request.abort();
  else
    request.continue();
});
await page.goto('https://www.yahoo.com/news/');

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>