May 082019 Tagged with , , , 0 Responses

Extracting Data From PDFs Using Tabula

Tabula Technology for PDF Parsing

Are you looking for a way to extract data from PDF documents? Well, you’re at the right place. Manually typing PDF data is often the first solution but fails most of the time for a variety of reasons. In this article we talk about PDF data extraction tool Tabula and how to use it. Tabula provides a visual PDF data extraction interface to select which data fields to be gathered from PDF tables conveniently and automatically. Tabula is a free open-source tool build for scraping data from PDF tables. Tabula is an offline software, available under MIT open-source license for Windows, Mac and Linux operating systems.

PDF Data Parsing

Data extraction from PDFs can be applied in many applications:

  • Artificial Intelligence
  • Business Process Automation
  • E-Discovery
  • Machine Learning
  • Decision Logic Applications
  • Data Science And Analytics
  • Account Automation – Reading Invoices and Purchase Orders
  • Searching through thousands on PDFs
  • Extracting data from Shipping Notes and Automating Delivery tracking process

Tabula Features:

  • Visual PDF data extraction tool: Tabula provides a visual editor to allow you to select data to extract.
  • Copy result to Clipboard or Export to file format like CSV, TSV etc
  • Tabula Command Line Utility: Tabula Command Line Utility can be used as a Windows console utility that can be used to extract data from large number of PDF files.
  • You do not need copy PDF text information from hundreds of PDF files again and again. Free from copy-paste hassles.

Tabula in Practice:

Let us explore how Tabula can be used.

Download and Installation:

1. Tabula works in a java environment so Windows and Linux users will have to download java runtime environment if they don’t already have it. You can download Java here. Tabula for Mac OS X comes with Java. Download the version of Tabula for your operating system. Click here to download Tabula.

2. Extract the downloaded zip file

3. Inside the extracted folder, you will find file named “tabula.exe” and run it by double clicking it.

Tabula PDF Scraping

4. After running the program, a web browser will open. If it doesn’t, open a web browser of your choice, and go to http://localhost:8080. There’s Tabula!

Tabula Web Page

 

 

How to use Tabula:

1. Upload PDF File: In home screen you will find file selection option where you need to browse and upload PDF file from which you want to extract data. After selecting the file, click on the Import button. After submission, you will be shown uploaded PDF file as shown in the screenshot below:

Sample PDF File

2. Make table selection: Now you need to move to the page from where you want to extract table data, then select the table by clicking and dragging to draw a box around the table. You can also click on “Autodetect Tables” option which will select the tabular data automatically.

Autodetect Tabular Data

 

3. Preview Data: After table selection “Preview & Export Extracted Data” button will be enabled. Click on that button to preview the data. Preview is shown in the below screenshot:

Preview Extracted Data

Tabula have two types of extraction methods. Stream and Lattice. Stream looks for white space between columns, while Lattice looks for boundary lines between columns. If the data is not mapped to the correct cells, you can try the alternate method instead.

4. Export Data: Now you can copy data to the clipboard and paste it to anywhere you want or you can export the data to a variety of file formats like CSV, TSV, JSON etc. The following screenshot shows exported CSV file:

Extracted Data

 

Tabula Command Line:

Suppose you have thousands separate PDF files and each page is organized in identical way and you want the same table from the middle of the page. Trying Tabula web version will not be practical here because you need to go through “upload and process” approach again and again. You can use tabula-java — the engine that powers Tabula — as a standalone command-line tool to handle these situations.
Download a version of the tabula-java’s jar, with all dependencies included, that works on Mac, Windows and Linux from Github releases page.

Syntax

java -jar /path/to/tabula/tabula-1.0.2-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename
  • y1 = top coordinate of the table
  • x1 = left coordinate of the table
  • y2 = top + height
  • x2 = left + width
  • csvfile: Name of the target CSV file to which data is to be written
  • filename: Name of the source PDF file from which data is to be extracted

Example:

java -jar /path/to/tabula/tabula-1.0.2-jar-with-dependencies.jar -p all  -a 71.0,71.0,530.0,456.56 -o output.csv source.pdf

 

 

 

 

 

Command Line Execution

You can write a script like this to iterate over many identical-format PDFs in a directory:

for f in /path/to/dir/*.pdf; do
  java -jar /path/to/tabula/tabula-1.0.2-jar-with-dependencies.jar -p all -a 49.5,52.3285714,599.6571428,743.91428571 -o $f.csv $f
done

Tabula Limitations:

Tabula is an excellent PDF data extraction option, but it has certain limitations:

  • Tabula is not capable to extract data from multiline rows or merge cells.
  • Tabula is only able to process Text-based PDF. It is not possible to extract data from scanned PDF document using Tabula because it does not include OCR engines.

 

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>