Apr 252015 Tagged with , , 0 Responses

Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

 

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>