Bots became more powerful in 2014. As the war continues, let’s take a closer look at why common strategies to prevent scraping didn’t pay off.
With the market for online businesses expanding rapidly, the development teams behind these online portals are under great amounts of pressure to keep up in the race. Scalability, availability and responsiveness are some of the commonly faced problems for a growing online business portal. As the value of content is increasing, content theft has become an increasing problem in the form of web scraping.
Competitors have learned to stay ahead of the race by using bots to scrape. While how these bots could be harmful is something worth talking about, it is not the main scope of this article. This article discusses some of the commonly used weapons to fight bots and brings to light their effectiveness in reality.
We come across many developers who claim to have taken measures to prevent their sites from being scraped. A common belief is that these below listed techniques reduce scraping activities significantly on a website. While some of these methods could actually work in concept, we were interested to explore how effective they were in practice.
Most Commonly used techniques to Prevent Scraping:
- Setting up robots.txt – Surprisingly, this technique is used against malicious bots! Why this wouldn’t work is pretty straight forward – robots.txt is an agreement between websites and search engine bots to prevent search engine bots from accessing sensitive information. No malicious bot (or the scraper behind it) in it’s right mind would obey robots.txt. This is the most ineffective method to prevent scraping.
- Filtering requests by User agent – The user agent string of a client is set by the client itself. One method is to obtain this from the HTTP header of a request. This way, a request can be filtered even before the content is served to the request. We observed that very few bots (approximately less than 10%), used the default user agent string which belonged to a scraping tool or was an empty string. Once their requests to the website were filtered based on the user agent, it didn’t take too long for scrapers to realize this and change their user agent to that of any well known browser. This method merely stops new bots written by inexperienced scrapers for a few hours.
- Blacklisting the IP address – Seeking out to an IP blacklisting service is much easier than having to perform the hectic process of capturing more metrics from page requests and analyzing server logs. There are plenty of third party services which maintain a database of blacklisted IPs. In our hunt for a suitable blacklisting service, we found that using a third party DNSBL/RBL service was not effective as these services blacklisted only email spambot servers and were not effective in preventing scraping bots. Less than 2% of scraping bots were detected for one of our customer’s when we did a trial run.
- Throwing CAPTCHA – A very well know practice to stop bots is to throw CAPTCHA on pages with sensitive content. Although effective against bots, CAPTCHA is thrown to all clients requesting the web page irrespective of whether it is a human or a bot. This method often antagonizes users and hence reduces traffic to the website. Some more insights to the new NO CAPTCHA Re-CAPTCHA by Google can be found in our previous blog post.
- Honey pot or Honey trap – Honey pots are a brilliant trap mechanism to capture new bots (scrapers who are not well versed with structure of every page) on the website. But, this approach poses a lesser known threat of reducing the page rank on search engines. Here’s why – Search engine bots visit these links and might get trapped accidentally. Even if exceptions to the page were made by disallowing a set of known user agents, the links to the traps might be indexed by a search engine bot. These links are interpreted as dead, irrelevant or fake links by search engines. With more such traps, the ranking of the website decreases considerably. Furthermore, filtering requests based on user agent can exploited as discussed above. In short, honey pots are risky business which must be handled very carefully.
To summarize, these prevention strategies listed are either weak or require constant monitoring and regular maintenance to keep them effective. In practice bots are far more challenging than they actually seem to be.
What to expect in 2015?
With increasing need for scraping, the number of scraping tools and expert scrapers are also increasing which simply means bots are going to be an increasing problem. In fact, the usage of headless browsers i.e, browser like bots which are used to scrape are increasing and scrapers are no longer relying on wget, curl and html parsers. Preventing malicious bots from stealing content without actually disturbing the genuine traffic from humans and search engine bots is just going get harder. By the end of the year, we could infer from our database that almost half of an average website’s traffic is caused by bots. And a whopping 30-40% is caused by malicious bots. We believe this is only going to increase if we do not step up to take action!
p.s. If you think you are facing similar problems, why not request for more information? Also, if you do not have the time or bandwidth for taking such actions, scraping prevention and stopping malicious bots is something we provide as a service. How about a free trial?