As most web users know by now, there are good bots and bad bots crawling the web. The good bots perform useful functions — such as GoogleBot that indexes websites, and Pingdom’s bots that monitor website uptime, availability, and so on. The bad bots, on the other hand, commit a range of illegal activities such as ad fraud, card cracking, content scraping, and much more.
Until a few years ago, bot herders and nefarious players would simply use desktop-class computers as command and control centers to deploy their bots. Then came the IaaS (Infrastructure-as-a-Service) revolution, with internet giants such as Amazon Web Services (AWS) and Microsoft Azure providing affordable, easily deployable and scalable cloud services to anyone who needed them.
In short order, AWS data centers became the largest source of bad bots — to the extent that many webmasters were often (wrongly) advised to block all AWS IP address ranges. However, the increased prevalence of Secure Web Gateways (SWG), corporate data centers, web proxies, and VPNs means that substantial numbers of web users are using data centers as launchpads to the internet. Consequently, traffic from such users is routed through IP addresses allocated to data centers. This corroborates ShieldSquare’s research findings that show that 30 – 40% of traffic from data centers is indeed genuine (based on traffic visiting our customer base across over 70 countries). A recent study that we conducted on spoofed search engine crawlers attacking our customer base showed that Hetzner, a German ISP, hosted over twice the volume of spoofed crawlers as compared to AWS.
Clearly, blocking all data center traffic is a counterproductive approach. This poses a conundrum to webmasters who want to allow access to genuine users while blocking cloud-deployed bots from IP ranges known to be used by the biggest data center networks such as AWS and Azure. While nefarious parties do leverage data centers and anonymous proxies to carry out large-scale attacks, blocking all data center traffic could result also block large numbers of genuine visitors using data centers as internet gateways.
It’s certainly a challenge to detect and block the most sophisticated bots that can mimic human behavior and commit ‘low and slow’ attacks. While many bad bots do leverage data centers such as those from AWS, a significant fraction of them also use smartphones and PCs that have been infected with malware or compromised browsers, browser add-ons/ plugins, and apps. Also, the dynamic nature of IP address allocation means that simply blocking datacenter IPs is unwise, as an IP address that was allocated to a bot could be allocated to a genuine user a few minutes later.
Blocking data center traffic from AWS, Azure, and similar ISPs only ends up blocking many real visitors. A far better approach to preventing bot traffic would be to deploy a real-time bot mitigation solution that’s backed by Machine Learning technologies to detect the intent of visitors and accordingly take action. Web users are frustrated enough with having to solve CAPTCHAs and similar Turing tests just to prove that they’re not bots. However, as bot detection technology advances, the need to prove one’s humanity to access a website will hopefully soon become a relic of the past.