Dec 242018 Tagged with , , , 0 Responses

Web Scraping using Content Grabber API

API stands for Application Programming Interface, which is a software intermediary that allows two applications to communicate with each other. The API defines the correct way for a web developer to manage content grabber agent via programming. It is like Program API which uses Remote Procedure Call(RPC) to access the component.

The Content Grabber programming interface (API) provides access to the Content Grabber run-time from your own web/desktop applications For example, If you want to access the result of the Content Grabber Agent in your web application and display it on the dashboard, you can do it easily using the Content Grabber API . The Content Grabber run-time can be distributed with your applications royalty free and does not require the Content Grabber application to be installed on the target computer. The Content Grabber run-time requires .NET version 4.5 or higher.

Content Grabber API

Consider a situation in which you want to control the Content Grabber Agent from a web application. You can do it with Content Grabber Proxy API. Meaning that you can easily control agent and access extracted data from non-windows environment like from ASP.Net Web Application or from Python web scraping module on Linux Server.

Content Grabber allows to build standalone agents that can be distributed royalty free and run without requiring the full Content Grabber application on the target computer. However standalone agent have some limitations like standard user interface and data can be exported to only file formats. If you want to design you own interface and export data to other formats other than provided by standalone agent, Content Grabber API is the option.

 

  • Accessing Content Grabber Agent in Windows Application:

1.1. Configure Visual Studio: You just need to add the following references to your project to use the The Content Grabber API: 1) AgentApi.dll and 2) AgentProxy.dll

1.2. The Content Grabber run-time files can be generated in the Content Grabber application by choosing Run-time Package in the Application menu. This will generate a zip file with all required files and folders. Copy-Paste these files and folders into your BIN folder.

  • Accessing Content Grabber Agent in Web Application:

One can use Content Grabber API in web applications using Content Grabber agent service. It is basically a Windows service that can be used to run agents. Your web app interacts with the Windows service using a small proxy assembly that requires no special security privileges and depends on no other files. You just need to add the proxy assembly to your web application’s assembly references and use the proxy to call the API functions.  The proxy can only execute agents, not load agents.

 

Limitations of Content Grabber API:

  • You can not alter Agent Command Structure using API
  • You can not change the Agent in a way that would possibly affect the output data structure of the Agent.
  • You can not perform Add, Delete, Move or Copy commands on Agent
  • You can not change the Disabled and Export command properties

Some useful functionalities of Content Grabber Web Scraping API

#1. Enable/Disable Content Grabber API:

Web scraper can enable/disable Content Grabber API functionalities by using Configure Service Option. It has following configurations options:

  1. Enable/Disable Remote Procedure Calls(SOAP) and set SOAP port
  2. Enable/Disable REST API and set REST API port
  3. Enable/Disable Resource Monitor and its interval
  4. Enable Disable Service Scheduler
  5. Enabling/disabling of service auto start

API Service

#2. Use case: Product Scraping API

Let us assume a case where we want to search and scrape products data for product information based on provided Barcode/UPC number. In this case we can develop an API using Content Grabber that takes Barcode/UPC as input and return JSON data as output.

Product Scraping

 

The Content Grabber Windows service supports simple web requests, so you can run agents and retrieve scraped data from non-Windows environments, such as from a Python script or PHP page on a Linux server.

RunAgentReturnJson is the method name and it has one mandatory argument “agent” which denotes the name of the agent which we are going to call. You can pass additional parameters along with request. In the this example UPC_SEARCH_API is the name of the agent and expects UPC as the input.

http://localhost:8004/ContentGrabber/RunAgentReturnJson?agent=UPC_SEARCH_API&pars={"upc":"020477122005"}

image1

 

 

 

The above API endpoint executes the Content Grabber agent synchronously and returns the scraped product information as a JSON string as given below

{
    "status": {
        "agentName": "UPC_SEARCH_API",
        "agentPath": "C:\\Users\\Public\\Documents\\Content Grabber 2\\Agents\\UPC_SEARCH_API\\UPC_SEARCH_API.scg",
        "version": "",
        "session": "6348191d-59d6-4461-b79e-61ee3327b020",
        "lastRunTime": "2018-09-05T07:21:34Z",
        "status": "Completed",
        "pageErrors": 0,
        "pageLoads": 1,
        "dataCount": 0,
        "lastDataCount": 0,
        "exportRowCount": 1,
        "lastExportRowCount": 0,
        "isScheduled": false,
        "nextRunTime": ""
    },
    "data": {
        "UPC_SEARCH_API": [
            {
                "UPC": "020477122005",
                "Product Name": "Chiavetta's Barbeque Marinade 64 Oz (Pack of 4)",
                "URL": "https://www.buycott.com/upc/020477122005",
                "Brand": "Chiavetta's",
                "Manufacturer": "Chiavetta's Catering Service",
                "EAN": "0020477122005"
            }
        ]
    }
}

 

 

 

 

 

 

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>