Web scraping, also known as web data extraction, is a (usually) automated technique to extract information from a website. In this context, ‘information’ means content such as images, text, descriptions, reviews, prices, and any other desired content that can be used by a competitor or unscrupulous parties to gain a business advantage. The Open Web Application Security Project lists Scraping under reference number OAT -011 and defines it as “Collect application content and/or other data for use elsewhere.”
Web scraping and web crawling are very closely related. Web crawling, also known as web indexing, scans and collates information on the web using a bot or a web crawler ─ key examples being search engines such as Google and Bing. In contrast, web scraping concentrates on converting web pages into structured data, which can be analyzed and even reproduced elsewhere.
Content scraping has grown to become a big industry and has created a range of problems for impacted businesses in industries as diverse as Education, Entertainment, Finance, Healthcare, E-commerce, Media and Publishing, Online Marketplaces, Technology, and Social Networking. Even Governments are targeted by scrapers. ‘Scraping as a Service’ is being offered by companies through which your competitors can scrape your entire site content at a single go, as well as scrape your new and time-sensitive content such as prices on an hourly basis.
The Impact of Web Scraping on Your Business
How You Can Protect Your Web Content
It’s not surprising that producers of original content are systematically targeted by scrapers. The exponential increase in online content and the growing demand for quality content have brought into prominence the scale of the scraping problems extant today. Owners of proprietary content are stepping up their efforts to safeguard their content and hone their competitive edge. While your business is hard at work developing timely and valuable content for your users, scrapers are also busy copying and reproducing your content using bad bots. With an investment of just a few hundred dollars, scrapers can set up proxies and software to start scraping your website, resulting in several negative impacts to your business.
While laws exist to penalize scraping and unauthorized use and theft of content, they are difficult to enforce given the distributed nature of the Web and the difficulty in finding the parties responsible for scraping in the first place. After the Stop Online Privacy Act (SOPA) failed to get passed in the United States, the Digital Millennium Copyright Act (DMCA) came into force later on, but violations are hard to enforce. The DMCA was proposed to check theft of copyrighted content, and allowed content owners to file a complaint if their content was stolen or misused. However, the problem with the DMCA is that different search engines have different DMCA forms, and it is a laborious and time-consuming task to track each link and complaint. In addition, jurisdictional limits ─ and the anonymity that the Web provides ─ make it hard to pursue scrapers and bring them to book for their offences. In a recent high-profile scraping lawsuit filed by the professional networking site LinkedIn against HiQ, an employer analytics company, an appeals court ruled that scraping a public website without its owner’s permission does not violate the Computer Fraud and Abuse Act, as LinkedIn users had intentionally made their profile information public. The ruling does not apply to LinkedIn profiles that are not public.
Though there are services that can alert you when your content has been plagiarized, they can only help you assess what has already been scraped from your site. However, finding the responsible parties and taking legal measures against scrapers are mostly ineffective and futile, as scrapers can simply create another website and continue their nefarious activities. In-house bot management solutions have proven to be ineffective against sophisticated bots that are becoming ever more adept at mimicking human users. As the threat landscape for the malicious bots expands, so does the need for a robust anti-bot solution that can provide continuous protection against malicious bots without affecting the user experience. This is why Forrester Research, recognizing the growing necessity for dedicated bot management solutions, stated in their ‘New Wave™: Bot Management, Q3, 2018’ Report that organizations should evaluate and adopt specialized solutions that can “…determine the intent of automated traffic in real time to distinguish between good bots and bad bots.”