Jan 042014 Tagged with , ,

How to do data scraping from PDF files using PHP?

pdf data scraping

PDF Scraping using PHP

Situations arise when you want to scrap data from PDF or want to search PDF files for matching text. Suppose you have website where users uploads PDF files and you want to give search functionality to user which searches all uploaded PDF file content for matching text and show all PDFs that contains matching search keywords.

Or you might have all London real estate properties details in PDF report file and you want to quickly grab scrape data from PDF reports then you might need PDF scraping library.

To integrate such functionality to web application is not similar to normal search functionality that we do with database search.

Here is the straight solution for this problem. This involves PDF Data Scraping to plain text and match search terms. I have written this post for the people who want to do PDF data scraping or want to make their PDF files to be Searchable.

We are going to use class named class.pdf2text.php which converts PDF text to into ASCII text, so the class is known for PDF extraction. This PHP class ignores anything in PDF that is not a text.

pdf-data-scraping

You can download it from  https://code.google.com/p/lucene-silverstripe-plugin/source/browse/trunk/thirdparty/class.pdf2text.php?r=19

Let’s see very basic example (Taken from author’s file):

<?php

include "class.pdf2text.php";

$a = new PDF2Text();
$a->setFilename('web-scraping-service.pdf'); //grab the pdf file reside in folder where PHP files resides. 

$a->decodePDF();//converts PDF content to text
echo $a->output(); 

?>

 

This example will simply read text content from web-scraping-service.pdf and will just echo on the page. The web-scraping-service.pdf file contains following text that code will echo.

PDF File: http://webdata-scraping.com/wp-content/uploads/2016/07/web-scraping-service.pdf

“Web Scraping is a technique using which programmer can automate the copy paste manual work and save the time. This is PDF w eb scraping using PHP. We at Web Data Scraping offer Web Scraping and Data Scraping Service. Vist our website www.webdata-scraping.com”

For more complex extraction you can apply regular expression on the text you get and can parse text that you want from PDF. But keep in mind this has limitation and do not work with all types of PDF extraction.

But the wonderful use of this class is to make utility that allow user to search inside PDF when they search on web search bar. Last but not least, You can also find many PDF scraping software available in market that can do complex scraping from PDF files.

 

2 Responses to How to do data scraping from PDF files using PHP?
  1. AD

    useful article on web data scraping from PDF file….It would be great if you put article on image scraping from a website…

    • Web Scraper

      Hi,

      Will going to make tutorial on ‘How to scrape images?’ using PHP and C#.

      Thanks,
      Keval