How to Identify and Prevent Web Scraping Bots?

What are Web Scraping Bots?

There are computer programs that automatically visit a website and copy information like content, price, etc. on some/all web pages and pass it on to its owners. These are web scraping bots and this data collected by them may be used to gain difficult-to-create content without much efforts or give stealers a competitive advantage. Besides, these ‘bad’ bots drain the server/connection resources of the websites they visit.

What are some of the techniques used to identify and prevent Web Scraping Bots? 

  • It is possible to analyze the IP addresses requesting access to a web page to find out if they are genuine readers or bots. Factors like location, ISP, connection types, whether it is Proxy IP/ToR network, known violator network, etc. can help identify if the requester is a user or a bot. It’s possible to prevent access to those IP addresses or rate-limit their access.
  • The user behavior on a web page can be analyzed and compared with known human behavior patterns to identify bots.
  • Some bot detection services have developed collective intelligence gathered from multiple bot behaviors which they use to compare and identify bots from humans. Some also use machine learning technologies.
  • It is possible to analyze (fingerprint) the device, software, browser and connection profiles of the entity accessing the web page to determine if the request is from a human or a bot.
  • Some web pages (generally hidden from real users) can invalidate the data gathered by bots by deliberately providing them false data.
  • It is possible to employ technologies like CAPTCHA to verify if the user is human, especially in critical areas of the site like form filling, login, etc. but this might hinder user experience.
  • It is difficult for scraping software imitating browsers to store cookies or process complex javascript code. This can be used to identify bots.
  • Text or data can be created as images or complex scripts which are generally ignored by scrapers.
  • Making minor alterations to HTML tags might prevent bots that repeatedly scrap data from a particular website.
  • Website analytics programs like Google Analytics can be used to monitor and identify suspicious bot activities.
  • Luring scraping bots into honeypots/dummy articles & links and then analyzing the IP addresses that access these articles (generally hidden from regular users) might help identify bots.
  • Critical web pages can be hidden behind a login or paywall but these might hinder user experience.

Limitations of Bot Prevention Technologies/Services

  • It is essential to allow some bots (for example search engine bots) to crawl websites. So false positives or thwarting such ‘good’ bots accidentally might create issues.
  • These technologies might consume more resources or increase page access time for readers.
  • There are sophisticated bots that mimic human behavior and even fill out CAPTCHAs or create new usernames/passwords. It’s difficult to track and prevent them.
  • Web Application Firewalls are generally deployed for a different reason, but some of them might have capabilities to identify and prevent bots as well. If not, bot prevention is a stand-alone (on-premise or cloud-based) solution that involves cost, time & resources.

References:

https://www.shieldsquare.com/bot-prevention-technology/

https://www.shieldsquare.com/web/wp-content/uploads/Whitepaper-Web-Scraping.pdf

https://www.infisecure.com/infisecure-technology.html

https://www.imperva.com/docs/gated/WP_Detecting_and_Blocking_Site_Scraping_Attacks.pdf

https://blog.hartleybrody.com/prevent-scrapers/

excITingIP.com

You could stay up to date on Computer Networking & IT Technologies by subscribing to this blog with your email address on the right-hand sidebar box that says, ‘Get email updates when new articles are published’.