Web scraping, also known as web data extraction, is a (usually) automated technique to extract information from a website. In this context, ‘information’ means content such as images, text, descriptions, reviews, prices, and any other desired content that can be used by a competitor or unscrupulous parties to gain a business advantage. The Open Web Application Security Project lists Scraping under reference number OAT -011 and defines it as “Collect application content and/or other data for use elsewhere.”
Web scraping and web crawling are very closely related. Web crawling, also known as web indexing, scans and collates information on the web using a bot or a web crawler ─ key examples being search engines such as Google and Bing. In contrast, web scraping concentrates on converting web pages into structured data, which can be analyzed and even reproduced elsewhere.
Content scraping has grown to become a big industry and has created a range of problems for impacted businesses in industries as diverse as Education, Entertainment, Finance, Healthcare, E-commerce, Media and Publishing, Online Marketplaces, Technology, and Social Networking. Even Governments are targeted by scrapers. ‘Scraping as a Service’ is being offered by companies through which your competitors can scrape your entire site content at a single go, as well as scrape your new and time-sensitive content such as prices on an hourly basis.
The Impact of Web Scraping on Your Business
- Scraping of unique content: You’ve invested considerably in professional content design and development to showcase your brand and attract an audience. If your competitor scrapes your valuable content soon after you publish it and reproduces it on his website, it negates the uniqueness of your content, puts your competitive edge at stake, and diminishes your brand value.
- Revenue loss: When your competitive advantage is impacted by third-party scrapers and competitor bots, it’s quite likely that your customer base will shrink over time. Further, if you monetize your website with advertising, the drop in traffic will significantly impact ad revenue. Eventually, advertisers that partner with you may lower their bids or consider other publishers to place ads with.
- Poor user experience: Bot traffic performing content and price scraping heavily loads your server infrastructure and slows down page loads and user access to APIs that carry out inventory availability checks, user authentication, location mapping, shopping carts and payment processing. Moreover, scraper bots fill shopping carts and abandon them, rendering products unavailable to genuine users.
- Form spam and fake leads: Bots are capable of filling web forms with fake data, and it can be difficult to differentiate between actual leads and spam leads.
- Drop in SEO rankings: Your content is your company’s intellectual property of your business, and when it is scraped or misused, it harms your SEO efforts and search engine visibility. Because Google prioritizes original content, scraped content downgrades your search engine rankings, and the scraper using your content can often end up ranking higher than your business in search results.
- Distorted analytics: Your marketing and web teams rely on accurate analytics data such as page views, bounce rates, user demographics, and much more. Scraper bot traffic skews your analytics data and prevents you from being able to properly measure and forecast trends, ultimately hindering decision making.
How You Can Protect Your Web Content
It’s not surprising that producers of original content are systematically targeted by scrapers. The exponential increase in online content and the growing demand for quality content have brought into prominence the scale of the scraping problems extant today. Owners of proprietary content are stepping up their efforts to safeguard their content and hone their competitive edge. While your business is hard at work developing timely and valuable content for your users, scrapers are also busy copying and reproducing your content using bad bots. With an investment of just a few hundred dollars, scrapers can set up proxies and software to start scraping your website, resulting in several negative impacts to your business.
While laws exist to penalize scraping and unauthorized use and theft of content, they are difficult to enforce given the distributed nature of the Web and the difficulty in finding the parties responsible for scraping in the first place. After the Stop Online Privacy Act (SOPA) failed to get passed in the United States, the Digital Millennium Copyright Act (DMCA) came into force later on, but violations are hard to enforce. The DMCA was proposed to check theft of copyrighted content, and allowed content owners to file a complaint if their content was stolen or misused. However, the problem with the DMCA is that different search engines have different DMCA forms, and it is a laborious and time-consuming task to track each link and complaint. In addition, jurisdictional limits ─ and the anonymity that the Web provides ─ make it hard to pursue scrapers and bring them to book for their offences. In a recent high-profile scraping lawsuit filed by the professional networking site LinkedIn against HiQ, an employer analytics company, an appeals court ruled that scraping a public website without its owner’s permission does not violate the Computer Fraud and Abuse Act, as LinkedIn users had intentionally made their profile information public. The ruling does not apply to LinkedIn profiles that are not public.
Though there are services that can alert you when your content has been plagiarized, they can only help you assess what has already been scraped from your site. However, finding the responsible parties and taking legal measures against scrapers are mostly ineffective and futile, as scrapers can simply create another website and continue their nefarious activities. In-house bot management solutions have proven to be ineffective against sophisticated bots that are becoming ever more adept at mimicking human users. As the threat landscape for the malicious bots expands, so does the need for a robust anti-bot solution that can provide continuous protection against malicious bots without affecting the user experience. This is why Forrester Research, recognizing the growing necessity for dedicated bot management solutions, stated in their ‘New Wave™: Bot Management, Q3, 2018’ Report that organizations should evaluate and adopt specialized solutions that can “…determine the intent of automated traffic in real time to distinguish between good bots and bad bots.”