
Browser automation helps you to automate repetitive tasks and web application testing. For example, monitoring product pricing over period of time, form submission, automatically login to web app, perform some task and logout etc.
There are many libraries for browser automation and web scraping like PhantomJS, Selenium IDE etc. However Puppeteer runs faster and uses less memory. Puppeteer only works with Google Chrome browser.Puppeteer can be used for:
- Headless WebApp/Website Testing: Develop a function test and execute it.
- Screen Capture: Capture web content screenshots/generate PDFs
- Web Scraping: Retrieve and manipulate web pages with the DOM API or libraries like jQuery
- Web Automation: Automate form submission, UI Testing, Keyboard input etc.
- Network Monitoring: Monitoring page loading, diagnose performance issues.
Puppeteer can be used for:Puppeteer provides great flexibility and features for Web Scraping. It provides all the features that a professional web scraper desires to have like
- Setting up of the many browser options
- Slowing down Puppeteer operations by the specified amount of milliseconds.
- Creating individual Chromium user profile which it cleans up on every run.
- Timeout: Maximum time in milliseconds to wait for the Chrome instance to start.
- and many more
yarn add puppeteer # or "npm i puppeteer"
Example #1: Capture screenshot of a web page
It can be accomplished by following three simple steps: Create/Open browser, Browser a web page, and take a screenshot. The below code is for capturing screenshot of a web page and is well documented and self-explanatory.
//open puppeteer library const puppeteer = require('puppeteer'); (async () => { //create browser instance and launch it const browser = await puppeteer.launch(); //create page instance in browser const page = await browser.newPage(); //load URL into the page await page.goto('https://en.wikipedia.org/wiki/Main_Page'); //capture screenshot and save it await page.screenshot({path: 'wikipedia.png'}); //finally close the browser await browser.close(); })();
Example #2: Create a PDF document from a web page
//open the puppeteer library const puppeteer = require('puppeteer'); (async () => { //create browser instance and launch it const browser = await puppeteer.launch(); //create page instance in browser const page = await browser.newPage(); //load URL into the page await page.goto('https://news.google.com/', {waitUntil: 'domcontentloaded'}); //save web page as A4 size PDF document in current directory (path can be specified) await page.pdf({path: 'googlenews.pdf', format: 'A4'}); //finally close the browser await browser.close(); })();
Example #3: Browser Bing search engine and Search for web scraping
//open the puppeteer library const puppeteer = require('puppeteer'); (async() => { //create browser instance and launch it const browser = await puppeteer.launch(); //create page instance in browser const page = await browser.newPage(); //Visit the bing search engine await page.goto('https://www.bing.com/', {waitUntil: 'networkidle2'}); //Wait for textbox whose name equals 'q' (textbox to search) await page.waitFor('input[name=q]'); // Type search term into the search bar await page.type('input[name=q]', 'web scraping'); // Click the submit button await page.click('input[type="submit"]'); // Wait for the results to show up await page.waitForSelector('h2 a'); // Extract the list of links from the result page const links = await page.evaluate(() => { const anchors = Array.from(document.querySelectorAll('h2 a')); return anchors.map(anchor => anchor.textContent); }); //output the list of links to console console.log(links.join('\n')); //finally close the browser await browser.close(); })();
Some advance Web Scraping Tips & Tricks:
#1. Using Proxies with Puppeteer
const browser = await puppeteer.launch({ // Launch puppeteer browser using a proxy server on port 3600. args: [ '--proxy-server=127.0.0.1:3600' ] });
You can read more about proxy setting here
#2. Prevent/Block image loading
const page = await browser.newPage(); await page.setRequestInterception(true); page.on('request', request => { if (request.resourceType === 'image') request.abort(); else request.continue(); }); await page.goto('https://www.yahoo.com/news/');