Meet Us at AWS re:Invent 2019 | Book a Meeting Now

An Overview of Web Scraping Techniques

Scraping is the act of extracting data or information from websites with or without the consent of the website owner. Scraping can be done manually, but in most cases it’s done automatically because of its efficiency. Scraping of content or prices is mostly carried out with malicious intent, and there are several techniques used to scrape content.

Manual scraping

Copy-pasting

Manual scraping involves copying and pasting web content, which takes a lot of effort and is highly repetitive in the way it is carried out. This is an effective way of stealing content when the website’s defense mechanisms are tuned to only detect automated scraping bots. However, manual scraping is rarely seen in practice, due to the fact that automated scraping is far quicker and cheaper to carry out.

Automated Scraping

HTML Parsing: HTML parsing is done using JavaScript, and targets linear or nested HTML pages. This fast and robust method is used for text extraction, link extraction (such as nested links or email addresses), screen scraping, resource extraction, and so on.

DOM Parsing: The Document Object Model, or DOM, defines the style, structure and the content within XML files. DOM parsers are generally used by scrapers that want to get an in-depth view of the structure of a web page. Scrapers can use a DOM parser to get the nodes containing information, and then use a tool such as XPath to scrape web pages. Full-fledged web browsers like Internet Explorer or Firefox can be embedded to extract the entire web page or just parts of it, even if the content generated is dynamic in nature.

Vertical Aggregation: Vertical aggregation platforms are created by companies with access to large-scale computing power to target specific verticals. Some companies even run these data harvesting platforms on the cloud. Creation and monitoring of bots for specific verticals is carried out by these platforms with virtually no human intervention. Since the bots are created automatically based on the knowledge base for the specific vertical, their efficiency is measured by the quality of data extracted.

XPath: XML Path Language, or XPath, is a query language that works on XML documents. Since XML documents are based on a tree-like structure, XPath can be used to navigate across the tree by selecting nodes based on a variety of parameters. XPath can be used in conjunction with DOM parsing to extract an entire web page and publish it at the destination website.

Google Sheets: Google Sheets can be used as a scraping tool, and it’s quite popular among scrapers. From within Sheets, a scraper can use the IMPORTXML (,) function to scrape data from websites. This is useful when the scraper wants specific data or patterns to be extracted from a website. You can use this command to check if your website is scrape-proof.

Text Pattern Matching: This is a regular expression-matching technique using the UNIX grep command and is usually clubbed with popular programming languages like Perl or Python.

There are several web scraping tools and services available online, and scrapers need not know all of the above techniques unless they want to do the scraping themselves. There are tools such as cURL, Wget, HTTrack, Import.io, Node.js, and several others that are highly automated. Scrapers also use automated headless browsers such as Phantom.js, Slimmer.js, Casper.js for scraping purposes.

How to Prevent Web Scraping

Scraping with the intent of stealing data is illegal and unethical. Owners of online businesses should take every possible step to continuously protect their websites and apps from scraper bots to protect their brand and retain their competitive edge. The sophisticated, human-like bots prevalent today give scrapers many options to scrape web content without detection. This is why a dedicated bot management solution is essential to deter scraping and even more serious threats such as account takeover and application DDoS on websites and apps. For a more in-depth look at scraping attacks, read our e-book ‘Deconstructing Large Scale Distributed Scraping Attacks’.



Related Content

WhitePaper

WHITEPAPER

Combating Web Scraping in Online Businesses

The-Big-Bad-Bot-Report

WHITE PAPER

Development of In-house Bot Management Solutions and their Pitfalls

Product_Brief

BLOG

How Scraping Attacks Can Compromise Web Security and Impact Business Continuity

PROTECT AGAINST WEB SCRAPING WITH SHIELDSQUARE

Powered by Think201