Within the past decade, digital transformation has truly revolutionized the way we live and do businesses. Thanks to smartphones and global internet penetration. According to ITU, around 3.5 billion people are using the internet this year (2016 data). For the sake of comparison, there were 1.5 billion individual internet users in 2008. This spurred various online businesses, right from online shopping to booking flight tickets. But, this is where the good things stop.
Automated internet programs, or bots, were created for various purposes. Some good, and many evil. The good includes search engine bots, social network bots, aggregator bots, and so on. On the other hand, malicious bots were created by hackers to perform automated tasks like scraping content, prices and product catalogs, create fake registrations, collect flight seat information, mass book tickets and sell elsewhere (scalping), and so on. These nefarious activities are endless and are on the rise. Well, almost half of the Web traffic is from bots, and most of these bots are created with malicious intent.
Suppose, you’re an online business owner wanting to block malicious bots on your website. What will you do? Certainly, not these 5 things if you truly want to stop bad bots from hackers and competitors.
1. Analyzing Server Logs – It’s laborious
Apache, NGINX and ISS server logs can be manually analyzed to find anomalous bot activities. Every time, the logs must be exported to a spreadsheet, sorted by columns to show the IP addresses and User Agents. When you identify bot activities by finding the number of hits from User Agents from certain IPs, you will automatically isolate that IP, and block it with your firewall. Unfortunately, this process is laborious, consuming many man-hours that can actually be allocated for other high priority activities. The biggest downside of this process is that hackers these days may use multiple genuine IPs to send malicious bots to your site, and you may end up blocking real users accessing your site from those IPs.
2. Showing CAPTCHA – Even to real users?
A common practice to block bad bots on important pages is to show CAPTCHA. Though effective against bad bots, CAPTCHA should not be shown to everyone requesting the Web page without ascertaining if it’s a human or a bot. Genuine users are frustrated looking at those skewed alphabets inside the CAPTCHA box, and can quickly bounce off your web page. Mindlessly showing CAPTCHA to humans and bots impacts user experience, and the brand perception of the website in the long run.
3. Robots.txt – Scraper bots just don’t care!
This is a basic misunderstanding that many website owners have – setting the robots.txt to Disallow URLs, thinking that the crawlers/bots, good or bad will not traverse through their website. Unfortunately, this method does not shield a website completely from bots, as the people who run them are really not bothered about the rules mentioned in the robots.txt file. In short, tweaking the robots.txt file doesn’t stop scrapers from stealing your content. Interestingly, in this conversation, some still find robots.txt a good tool to block scraper bots.
4. Honeypot – Want to lose your search engine rankings?
Honeypots are a good trap mechanism to capture new bots (sent by scrapers who are not well versed with structure of every page) on the website. But this approach poses a lesser known threat of reducing the page rank on search engines. Search engine bots fall for this trap, and interpret the links as dead, irrelevant or fake. With more such traps, the ranking of the website decreases considerably. Setting up honeypots is risky and needs to be managed very carefully.
5. In-house bot prevention – Bots evolve faster!
All or some of the aforementioned techniques can be run by having an in-house bot prevention team. Sure, they will be able to detect and block bots. However, the accuracy and consistency varies drastically as it’s still a manual, error-prone process. The key thing to consider here is that when bots are blocked, the scrapers always try to find a way in by tweaking bot behavior and IPs, and can, in many instances, emulate human behavior. This presents a huge challenge to the internal team, and they may not even know that they’re being attacked with even more sophistication.
How you’re dealing with bots right now. In the first place, do you know how much of your traffic is from real users (humans)?