Sep 192016 Tagged with , , , 1 Response

Custom Scripting in Content Grabber

Custom Scraping ScriptWhile Content Grabber is very easy to use web scraping software, you shouldn’t make the mistake to think it is not also very flexible and powerful. Part of this flexibility comes from providing developers with a sophisticated scripting capability for controlling a user’s web scraping agent and managing the data being extracted.

Content Grabber provides scripting in different ways to customize Content Grabber behavior based on your specific needs or to extend and enhance standard functionality. Content Grabber scripts are .NET functions written in C# or VB.NET, or regular expressions.

Custom Scripting provides more power to control the Web Scraping Workflow. The Custom Scripting feature makes web scraping software more powerful, not all scraping software tools have this feature. Below is the comparison of a few popular web scraping software tools and lists to what extent they support scripting:

Scripting Level Language Supported Text Processing Command Script Extension Scripts
Content Grabber High C# / VB.NET Yes Yes Yes
Visual Web Ripper Moderate C# / VB.NET Yes Yes Yes
Fminer Moderate Python Yes No No
Web Content Extractor Low VB.NET Yes No No
WebHarvy Low Yes with Capture Text option No No

The Content Grabber Script Engine Supports C#, VB.Net and Regular expressions.  Regular expression can only be used for Content transformation while C# and VB.Net can be used to write more complex scripts that execute at run-time.

Content Grabber scripts can be categorized into three different types:

  • Content transformation scripts
  • Command scripts (Custom Script)
  • Extension scripts

 

1. Content transformation scripts:

Content Transformation Scripts are used to transform content after it has been scraped from a web page. For example: On some websites Telephone numbers are prefixed with “Tel:000000000” and we just want the telephone number. So we can use regex to get only the required telephone number.

The below script is good example of using a content transformation script to download the HTML of the page and save it to a local hard drive. This might be useful at the completion of scraping if you found that some data was missed. Then instead of making another request to the website, you can use the downloaded HTML files and parse data from them.

 

2. Command scripts (Custom Script)

Command scripts are used to change properties of the command at runtime. Command scripts are generally used to change the execution flow of an agent.

Below script is example of changing the Xpath of an Element at run time.

3. Extension scripts

Extension Scripts as its name suggests, are used to extend the functionality of Content Grabber.

Examples:

  • Data Export script

data export script

A script can be developed to export extracted data to the format that Content Grabber does not support.

  • Agent Initialization Scripts

initialization-script

If you want to execute some line of code before starting an agent, agent initialization script is the option. Agent Initialization Scripts are generally used to set certain properties and initialize data like a database connection string.

  • Data Input Scripts

Data Input Scripts are used to produce data for a data provider command. For example, you can generate a list of URLs having a certain pattern like a page URL ending with sequential page number.

  • Data Distribution Scripts

By default, Content Grabber provides data distribution to FTP or email. If you want to distribute (store) data to a cloud then the Data Distribution script is the best option.

  • Image OCR Scripts

image-to-text

Image OCR scripts are used to transform a image into text. Some websites display emails and phone numbers as images as a way of blocking web scraping. At that point of time this option can be very useful.

  • Convert Document to HTML Scripts

Sometimes there may be a need for extracting data from non-HTML documents like .DOC or .PDF documents. To extract data from these non-HTML documents, we can first convert them into a HTML document by using this script. After converting it into HTML, we can easily extract data from it.

 

One Response to Custom Scripting in Content Grabber
  1. Igor Savinkin Reply

    Thanks for the Content Grabber scripting guide. I think some may be interesting in a Content Grabber deeper review.

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">