Jun 112018 Tagged with , , , 0 Responses

Google OpenRefine : Opensource and Free Tool to Work With Messy Data

open refineIn all the data intensive fields like retail, banking, telecom, insurance etc. managing data without any error is a challenging task. Data cleaning thus becomes vital in modifying or removing data in a database that may be duplicated, incomplete, incorrect or poorly formatted. Every data wrangler wants to cleanup and transform the data into other formats in a quick manner and practicing a lot to refine and analyse the raw data. This practice is widely referred as Data Wrangling, sometimes referred as data munging or data cleansing.

Data quality is an important aspect in the overall success of decision making. Inaccurate data leads to wrong assumptions and analysis. Consequently it leads to failure of the campaign or project. Redundant data can cause various problems like slow load ups, increases inconsistency and decreases efficiency. A good data cleaning tool solves these problems and cleans your database of redundant data, incorrect information and bad entries.

Here is an open source data cleaning tool developed by Google known as OpenRefine that assists in clean up the data and let you analyse data to make informed decision visually and statistically. It was formerly known as Google Refine. It is a good choice for those hunting for open source and free data cleansing software. It is a simple-to-use tool for data cleaning and transformation. It allows interactive transformation of messy data into improved organized data for better decision making.

Google Refine is similar to spreadsheets applications like Microsoft Excel, Apache Calc or LibreOffice but it acts same like database. It can work with spreadsheets too. It is not a web service. It is a desktop application so you need not to upload data to server that’s the big advantage as our data is secure and our super-sensitive data never leaves our machine.

Uses of the Google Refine – The data cleaning tool:

Exploring data:

Data Scientists can explore the large messy dataset in row-column table structure. You can see your data and discover the inconsistency if any.

Cleaning messy data: You can convert semi structured data of text file into structured data using transformations, facets, filters and clustering. For example we are applying a log function on a specific column. We can identify the erroneous values like 0 or blank using facets and filters to correct it or remove it from the table.

Data Transformation: converting values to other formats, normalization and denormalization data. For example you can convert values to uppercase, lowercase or trim the white spaces from the values. You can convert raw text file into flat table. You can also undo/redo operations in any case.

Data Augmentation Or Data Parsing: You can parse data of a website. It has a URL fetch feature and built in jsoup HTML parser and DOM engine. Later you can manipulate the extracted data in your own way and export it to various formats like CSV, Spreadsheets, XML etc. For example you have a spreadsheets containing name of Hospitals and Street addresses but say we want longitude and latitude. There are web services like openstreetmap that provides longitude/latitude of a given address. We can set up Google Refine to call a web service on a specific column (Address in our case) to find the relevant data.

Data reconciliation:

OpenRefine can be used to link and extend your dataset with various web services. Some services also allow OpenRefine to upload your cleaned data to a central database.

Features of the OpenRefine:

General Refine Expression Language(GREL):

In simple terms GREL is the advanced version of Regular Expression with the manipulation operations. We can find the data and manipulate it too. It is a simple language designed for manipulating values easily. It offers a set of built-in functions as well as several variables.

Import and Export data:

You can import data from various formats like CSV, Text files, Web page, or Spreadsheets. Later you can export refined data into different formats such as CSV, TSV, Excel or HTML table, JSON, Wiki Text file etc.

Built in Data Scraper/Web Scraper to Extract data from a website:

Extract data from a web service/website and manipulate it using feature rich GREL expressions.

It also has a feature to set time interval between each web request (time delay). Suppose we want to detect language of a field values, we can call Google Language detection API and pass the substring of the value and detect the language of the cell.

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>