Meet Us at AWS re:Invent 2019 | Book a Meeting Now

What does Scraping, Crawling and Indexing

It’s now known that over 50% of your website traffic is comprised of bots. Bots are automated programs that are executed to perform repetitive tasks on your websites. Some of them will be with good intent, while most bots are malicious in nature. In the search engine world, crawling and indexing are used interchangeably, and these terms are generally looked at with good perspective. However, it’s important to understand their meanings so that you, as a business owner, can be wary of suspicious activity on your website.

Crawling

When a search engine bot take a path on your website by following the links this is termed as crawling. Think of crawling as a trail followed by the search engine bot or spider. If you have a sitemap, you’re making it easier for the search engine crawler bots to find all the links in your website. You can restrict this activity by tweaking the robots.txt file. This is done by specifically allowing to crawl only certain sections of your website, while putting a disallow rule to block the crawler from visiting specific URLs. Search engine crawlers will abide by the rules defined in the robots.txt.

Indexing

When the crawling is done, all the links and the content contained in them are indexed by search engines like Google, so that they can be pulled out when a web search is done. Even here, search engines like Google or Bing will adhere to the rules defined by the webmaster. For example, one can define no-follow or no-index attributes to instruct search engines when they crawl and index your web pages. Defining no-index will instruct the search engine not to index the page for web search, and no-follow will instruct them not to influence the ranking of the target website that’s linked to in the post. By and large, almost all of the content management systems allow indexing by default, unless otherwise specified explicitly by the webmaster.

Scraping

When automated programs are used to abstract data from web pages, it is termed as scraping. With crawling and indexing, search engines like Google or Bing will scrape your pages, but with a good intention of creating visibility to your website or content - the source is preserved. For example, when you search ‘best running shoes’, Google lists the search results. Additionally, you may also see a snippet that has a summary of the text that Google believes to be the best answer to the search query. This is done by programmatically extracting the content from the web page, and presenting it with the source URL and title. Obviously, this is good for the website that created this content.

Protect your Product prices and Catalogue from Competitor Bots

On the other hand, a malicious scraper will send bots to steal original content from the website; content like news, product reviews, blog posts, opinion pieces, product prices, classifieds listings, and so on. Unlike search engine bots, scraper bots do not comply with the rules set in the robots.txt file. The goal is to steal data and publish it elsewhere, or sell it to the competition. This type of unscrupulous scraping impacts online businesses in several ways, like:

Destroying the competitive edge

Decreasing SEO rankings

Increasing server loads

Increasing bandwidth costs

Impacting user experience

Loss of revenue

Search engines weigh websites based on page ranks, and if your website’s ranking is impacted due to scraping, you take the right action to block malicious bots.



Step Up and Take Action

Powered by Think201