Back

Anti-Scraping Part 2: Implementing Protections

July 13, 2023 | Noah Dunn

Continuing our series on Anti-Scraping techniques, this blog covers implementation of Anti-Scraping protections in a fake message board and examination of how scrapers can adapt to these changes.

While completely preventing scraping is likely impossible, implementing a defense in depth strategy can significantly increase the time and effort required to scrape an application. Give the first blog in this series a read, Anti-Scraping Part 1: Core Principles to Deter Scraping, and then continue on with how implementing these core principles affects scrapers’ ability to exfiltrate data.

Hardening the Application

Multiple changes have been made to the Fake Message Board site to try and prevent scraping. The first major change is that rate limiting has been enforced throughout the application. Previously, the search bar had a “limit” parameter with no maximum enforced value which allowed all users of the application to be scraped in only a few requests. Now the “limit” parameter has a maximum enforced value of 100. All error messages have been converted to generic messages that do not leak information.

Not all the recommended changes have been applied to the application. Account creation has still not been locked down. Additionally, the /search endpoint still does not require authentication. As illustrated in “The Scraper’s Response” section, these gaps undermine most of the “fixes” made to the application.

Implementing Rate Limiting

There are many design decisions that need to be made when implementing rate limits. Here are some of the considerations:

Should rate limits be specific to endpoints or constant across the application?
How many requests should be allowed per unit of time? (ex: 100 requests/second)
What are the consequences of reaching a rate limit?

Additionally, there is a distinction between logged-in and logged-out rate limits. Logged-in rate limits are applied to user sessions. When scrapers create fake accounts and use them for scraping, logged-in rate limits can suspend or block those accounts. If creating fake accounts is difficult, then this can significantly hinder a scraper’s progress. Logged-out rate limits are applied to endpoints where having an account isn’t required. From a defensive perspective, there are limited signals that can be used when a scraper is logged-out. The most common signal is an IP address. There are additional signals such as user-agent, cookies, additional HTTP headers, and even TCP fields. In the next sections we’ll review how the Fake Message Board application implemented both logged-in and logged-out rate limits.

Logged-in Rate Limits

In this case the logged-in rate limits are applied across the application. This probably isn’t the best decision for most organizations, but it is the easiest to implement. Each user is allowed to send 1000 requests/minute, 5000 requests/hour, and 10,000 requests/day. If a user violates any of those rate limits, then they receive a strike. If a user receives 3 strikes, then they are blocked from the platform. Implementing an unforgiving strategy like this is probably too strict for most applications.

Determining the number of requests allowed per unit of time is a very difficult problem. The last thing we want is to impact legitimate users of the application. Analyzing traffic logs to see how many requests the average user is sending will be critical to this process. When viewing the users sending the most traffic, it may be difficult to distinguish authentic traffic from user’s misusing the platform via automation. This will more than likely require manual investigations and extensive testing in pre-production environments.

It is also worth noting that bugs in applications can result in requests being repeatedly sent on the user’s behalf without their knowledge. Punishing these users with rate limits would be a huge mistake. If a user is sending a large number of requests to the same endpoint and the data returned isn’t changing, then it seems more likely that this is a bug in the application as opposed to scraping behavior.

Logged-out Rate Limits

In this case the logged-out rate limits are also applied across the application. Each IP address is allowed to send 1000 requests/minute, 5000 requests/hour, and 10,000 requests/day. If an IP address violates any of those rate limits, then it receives a strike. If the IP receives 3 strikes, then it’s blocked from the platform. This isn’t a realistic solution for multiple reasons, but the primary is that some IP addresses can be used by multiple users such as public WiFi IPs or university IPs. In those cases, the IPs would be blocked unfairly.

One of the main weaknesses of logged-out rate limits is that almost every signal can be controlled by the scraper. For example, the scraper can change their user agent with every request. If that signal is used in isolation, then you’ll just see a small number of requests from a bunch of different user agents. Similarly, a scraper can rotate their IP address. This is a devastating technique from a defensive perspective because our most reliable signal is the IP address. There are multiple open source libraries that make rotating IP addresses trivial. Implementing effective logged-out rate limits will probably require combining multiple of the mentioned signals in sophisticated ways.

Enforcing Data Limits

Ensuring that there’s a maximum limit to the amount of data that can be returned per response is a crucial piece in the fight against scraping. Previously, the search bar had a “limit” parameter which could return hundreds of thousands of users in one response. Now only up to 100 users can be returned per response. If more than 100 users are requested, then an error message is returned as shown below.

HTTP Request:

POST /search?limit=101 HTTP/1.1
Host: 127.0.0.1:12345
[TRUNCATED] 

{"input":"t"}

HTTP Response:

HTTP/1.0 400 Bad Request 
[TRUNCATED] 

{"error":"error"}

Return Generic Error Messages

Previously the forgot password and create account workflows leaked information useful to scrapers via error messages. The forgot password workflow used to reveal whether the provided username matched an existing account and would return the email address if it did. Now it always returns the same message, “If the account exists an email has been sent.”

HTTP Request:

POST /forgotPassword HTTP/1.1
Host: 127.0.0.1:12345 
[TRUNCATED]

username=doesnotexist

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"message":"if the account exists an email has been sent"}

The create account workflow used to reveal whether a specific username, email address, or phone number was already taken. Now it always returns the message, “If the account was created an email has been sent.”

HTTP Request:

POST /createAccount HTTP/1.1 
Host: 127.0.0.1:12345 
[TRUNCATED] 

username=vfoster166&email=test@test.com&phone=&password=[REDACTED]

HTTP Response:

HTTP/1.0 200 OK 
[TRUNCATED] 

{"response":"if the account was created an email has been sent"}

The Scraper’s Response

Since the application has been updated, let’s review each piece of functionality and figure out how we can bypass the protections:

Recommended Posts Functionality: Requires authentication, and rate limits your account after 1,000 requests.
Search Bar: Does not require authentication, returns 100 users/response, and rate limits your IP address after 1,000 requests.
User Profile Pages: Requires authentication and rate limits your account after 1,000 requests.
Account Creation: No longer useful for leaking data, but fake accounts can be made easily.
Forgot Password: No longer useful for leaking data.

Overall, the application has had multiple improvements since the previous scrape in Part 1. The introduction of rate limits forces us to adapt our techniques. For Logged-In scraping the tactic will be to rotate sessions. Since we’re able to easily make many fake accounts, we will send a small number of requests from each account and as a result never get rate limited. For Logged-out scraping the tactic will be to rotate IP addresses. By only sending a few requests from each IP address we will hopefully never get rate limited.

Logged-In Scraping

The goal is to extract all 500,000 users using either the recommended posts functionality or user profile pages. Since user profile pages return more data about the user (ID, name, username, email, birthday, phone number, posts, etc…) let’s focus on that endpoint. Assuming we only get one user per response, each of our fake accounts can send 999 requests/minute without being rate limited. If we make 500 fake accounts, then each account will only have to send ~999 requests in order to scrape every user on the platform. This means in only a couple minutes every user can be scraped and no rate limiting will be encountered.

Pseudo Scraper Code:

user_id_counter = 1  # User IDs start at 1 and are sequential 
fake_accounts = [session1, session2, …, session500]  # make a list with session                  cookies for all the fake accounts 

while(user_id_counter < 500,000):  # go through all 500,000 users 
    for session in fake_accounts:  # rotate your session 
        user_data = requests.get(“/profile/” + str(user_id_counter)) 
        user_id_counter += 1

Logged-Out Scraping

Extracting all 500,000 users using the search functionality is not as easy as it was last time. Now only 100 users are returned per response. Since each IP address can send 999 requests/minute without being rate limited, each IP can scrape 99,900 users without being blocked. This means only a few IP addresses are required. Most IP rotation tools leverage tens of thousands of IPs so getting a few is not an issue.

Pseudo Scraper Code:

while(users_scraped < 500,000): 
    user_data = requests.get(“/search?limit=100”, data=random_search_string, proxy=Some IP rotation service)

Summary

With multiple protections in place, scraping all 500,000 fake users of the application in a few minutes is still pretty easy. As a quick recap, the five core principles of scraping in our view are:

Require Authentication
Enforce Rate Limits
Lock Down Account Creation
Enforce Data Limits
Return Generic Error Messages

As demonstrated in this blog, failing to implement one of these principles can completely undermine protections implemented in the others. Although this application enforced rate limits, data limits, and returned generic error messages, by not enforcing authentication and by allowing fake accounts to easily be created, the other protections barely hindered the scraper.

Additional Protections

There are several additional Anti-Scraping techniques that are worth bringing to your attention before wrapping up.

When locking down account creation, implementing CAPTCHAs is a critical step in the right direction.
Requiring the user to provide a phone number can be a gigantic annoyance to scrapers. If there’s a limit on the number of accounts per phone number, and virtual phone numbers are blocked, then it’s much more difficult to create fake accounts at scale.
If user data is being returned in HTML, then adding randomness to the responses is another great deterrent. By changing the number and type of HTML tags around the user data, without affecting the UI, it becomes significantly harder to parse the data.
On the rate limiting front, make sure to apply rate limits to the user as opposed to the user’s session. If rate limits are applied to the session, then a scraper can just log out and log back in which grants them a new session and thus bypasses the rate limit.
For logged-out rate limits blocking certain IP ranges such as TOR IPs, or proxy/VPN providers may make a lot of sense especially if the majority of the traffic is inauthentic. Each organization with ranges of IP addresses will be assigned autonomous system numbers (ASNs). There are several publicly available ASN -> IP range tools/APIs that may be very useful in these cases.
Lastly, user IDs should not be generated sequentially. A lot of scrapes first require a list of user IDs. If they are hard to guess and there isn’t an easy way to collect a large list of them then this can also significantly slow down a scraper.

Conclusion

Hopefully some of the Anti-Scraping tactics and techniques shared in this series were useful to you. Although completely preventing scraping probably isn’t possible, implementing a defense in depth approach which follows our five core principles of Anti-Scraping will go a long way in deterring and wasting time of scrapers.

Noah Dunn

Name	Domain	Purpose	Expiry	Type
VISITOR_INFO1_LIVE	youtube.com	YouTube cookie.	6 months	HTTP
Test	test.com	Testing	7 days	HTTP

Featured

Anti-Scraping Part 2: Implementing Protections

Hardening the Application

Implementing Rate Limiting

Logged-in Rate Limits

Logged-out Rate Limits

Enforcing Data Limits

Return Generic Error Messages

The Scraper’s Response

Logged-In Scraping

Logged-Out Scraping

Summary

Additional Protections

Conclusion

Recent Posts

Elevating Privileges with Azure Site Recovery Services

Web2 Bugs in Web3 Systems

Azure Deployment Scripts: Assuming User-Assigned Managed Identities

Services

Resources

Company

Get in Touch