While Content Grabber is very easy to use web scraping software, you shouldn’t make the mistake to think it is not also very flexible and powerful. Part of this flexibility comes from providing developers with a sophisticated scripting capability for controlling a user’s web scraping agent and managing the data being extracted.
Content Grabber provides scripting in different ways to customize Content Grabber behavior based on your specific needs or to extend and enhance standard functionality. Content Grabber scripts are .NET functions written in C# or VB.NET, or regular expressions.
Custom Scripting provides more power to control the Web Scraping Workflow. The Custom Scripting feature makes web scraping software more powerful, not all scraping software tools have this feature. Below is the comparison of a few popular web scraping software tools and lists to what extent they support scripting:
Scripting Level | Language Supported | Text Processing | Command Script | Extension Scripts | |
Content Grabber | High | C# / VB.NET | Yes | Yes | Yes |
Visual Web Ripper | Moderate | C# / VB.NET | Yes | Yes | Yes |
Fminer | Moderate | Python | Yes | No | No |
Web Content Extractor | Low | VB.NET | Yes | No | No |
WebHarvy | Low | – | Yes with Capture Text option | No | No |
The Content Grabber Script Engine Supports C#, VB.Net and Regular expressions. Regular expression can only be used for Content transformation while C# and VB.Net can be used to write more complex scripts that execute at run-time.
Content Grabber scripts can be categorized into three different types:
- Content transformation scripts
- Command scripts (Custom Script)
- Extension scripts
1. Content transformation scripts:
Content Transformation Scripts are used to transform content after it has been scraped from a web page. For example: On some websites Telephone numbers are prefixed with “Tel:000000000” and we just want the telephone number. So we can use regex to get only the required telephone number.
The below script is good example of using a content transformation script to download the HTML of the page and save it to a local hard drive. This might be useful at the completion of scraping if you found that some data was missed. Then instead of making another request to the website, you can use the downloaded HTML files and parse data from them.
using System; using System.IO; using Sequentum.ContentGrabber.Api; public class Script { //See help for a definition of ContentTransformationArguments. public static string TransformContent(ContentTransformationArguments args) { //Place your transformation code here. //This example just returns the input data String fname = args.DataRow.RowId.ToString(); DownloadHTML(args.Content,fname); return fname; } public static void DownloadHTML(String html,String file_name) { System.IO.File.WriteAllText("E:\\html\\"+file_name+".html",html,System.Text.Encoding.UTF8); } }
2. Command scripts (Custom Script)
Command scripts are used to change properties of the command at runtime. Command scripts are generally used to change the execution flow of an agent.
Below script is example of changing the Xpath of an Element at run time.
using System; using Sequentum.ContentGrabber.Api; using Sequentum.ContentGrabber.Commands; public class Script { //See help for a definition of CustomScriptArguments. publistatic bool TransformCommand(CommandTransfomationScriptArguments args) { //Modify args.Command here. ISelection selection = args.Command as ISelection; selection.Selection.SelectionPaths[0].Xpath ="//a[contains(@href,'javascript:')][text()='"+int.Parse(args.DataRow.GetDataValue("ID"))+1+"']"; return true; } }
3. Extension scripts
Extension Scripts as its name suggests, are used to extend the functionality of Content Grabber.
Examples:
- Data Export script
A script can be developed to export extracted data to the format that Content Grabber does not support.
- Agent Initialization Scripts
If you want to execute some line of code before starting an agent, agent initialization script is the option. Agent Initialization Scripts are generally used to set certain properties and initialize data like a database connection string.
- Data Input Scripts
Data Input Scripts are used to produce data for a data provider command. For example, you can generate a list of URLs having a certain pattern like a page URL ending with sequential page number.
- Data Distribution Scripts
By default, Content Grabber provides data distribution to FTP or email. If you want to distribute (store) data to a cloud then the Data Distribution script is the best option.
- Image OCR Scripts
Image OCR scripts are used to transform a image into text. Some websites display emails and phone numbers as images as a way of blocking web scraping. At that point of time this option can be very useful.
- Convert Document to HTML Scripts
Sometimes there may be a need for extracting data from non-HTML documents like .DOC or .PDF documents. To extract data from these non-HTML documents, we can first convert them into a HTML document by using this script. After converting it into HTML, we can easily extract data from it.
One Response to Custom Scripting in Content Grabber
Igor Savinkin October 17, 2016
Thanks for the Content Grabber scripting guide. I think some may be interesting in a Content Grabber deeper review.