Back

CAPTCHAs Done Right?

March 9, 2018 | Hans Petrich

Web App assessments are probably one of the most popular penetration tests performed today. These are so popular that public bug bounty sites such as Hacker One and Bug Crowd offer hundreds of programs for companies wanting to fix vulnerabilities such as XSS, SQL Injection, CSRF, etc. Many companies also host their own bounty programs for reporting web vulnerabilities to a security team. Follow us in our 4-part mini series of blog posts about web security:

CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are an anti-automation control that are becoming more and more important in protecting forms from automated submissions. However, just because you have a CAPTCHA on your form does not mean that you “did it right”. Let’s review some of the important parts about implementing a CAPTCHA:

Check #1: Can it be re-used?

Here’s an example we recently ran into:

Everything looks good right? The application would not allow a user to login without solving the CAPTCHA and the CAPTCHA was difficult for a computer to guess but relatively easy for a human to solve. Here’s a sample login POST body as captured in Burp’s Proxy:

You can see that the user’s username and password are being sent along with the CAPTCHA’s “hash” and the CAPTCHA’s solution. If the CAPTCHA’s “hash” is stored in the back-end database as a one-time-use entry that gets deleted when the user submits this login request, everything should be fine. In order to test this, all we need to do is repeat this request back to the server, if the server accepts the replayed CAPTCHA submission, then we know that we can reuse this CAPTCHA submission in order to brute-force guess a user’s login credentials.

(Burp Intruder was used to replay the login POST request while changing the submitted password. The user’s password was successfully guessed on the 86th try even though we were replaying the same CAPTCHA on each request.)

In the end, this particular login portal wasn’t protecting their users from anything other than a more simplified login process. Each user was inconvenienced by the need to solve the CAPTCHA, but an attacker could bypass the CAPTCHA by solving it once and then replaying it multiple times. Along with having a one-time-use database entry for the CAPTCHA, the application could tie the CAPTCHA’s id to the user’s current session cookie value and ensure that each request requiring a CAPTCHA to be solved performs a lookup on the user’s session to determine which CAPTCHA the user is supposed to solve. If the request is submitted without a session cookie, or the same CAPTCHA id is replayed, the request should automatically fail.

Check #2: Do I need to solve it?

Here’s another example of a failed CAPTCHA implementation on an account registration page:

What can we notice from this example? The new account registration page has implemented a CAPTCHA, but the form performs a check on the new account’s email address as soon as the attacker enters it into the form…no CAPTCHA solving required. An attacker can use this unprotected functionality to brute-force the discovery of email addresses that have already registered on the site. This particular application was using the email address as the username in their login requests (and was also using a weak password policy) so we were able to login to over a dozen separate accounts by trying a common password on our discovered email addresses.

Check #3: Is it difficult for a computer to solve?

Here are some sample CAPTCHAs taken from a recent assessment:

What sticks out about these CAPTCHAs? They seems to be using some sort of pre-defined template for letter placement (letters always in the same spot), always contain 8 characters, and only use letters of the english alphabet (26 possibilities). OCR programs do not do very well on the character analysis for these CAPTCHAs, but with a few simple transformations, here’s some results we were able to obtain:

OCR still didn’t perform very well on these transformed CAPTCHAs (less than 2% match rate using google’s TESSARACT library) but there is another possibility open to a determined attacker: Solve the letters in each position and store them in a database to use in solving future CAPTCHAs. A simple program can show the attacker the letter in a particular position and then store the individual letter/position combinations in a database to be matched against for all subsequent CAPTCHAs. This would require an attacker to manually input 26 x 8 = 208 individual letter/position combinations, but then the computer could use the attacker’s stored information to solve a majority of the CAPTCHAs possible on this particular application.

Solving 208 separate character positions didn’t take more than a few minutes (although writing the code to do it took a few hours), so the payoff to an attacker would only be worth it if the CAPTCHA’d request were of high value.

Overall, the script was able to solve these CAPTCHAs with 70% accuracy. Here are some example results:

The majority of attackers wouldn’t spend the time required to solve this type of CAPTCHA unless they were protecting something pretty sensitive, so of the three bad examples we’ve shown, this was the best because the application DID require the CAPTCHA to be solved on each submission and did NOT accept replayed CAPTCHA requests. This type of CAPTCHA provides a “good enough” protection by adding in an extra layer of complexity that would prevent automated submissions by bots.

Another Transformation Example

Even though the CAPTCHAs from our first example had a flaw which made it unnecessary to solve them by OCR, here’s an example of how the CAPTCHAs could be transformed to make them easier to identify in an automated attack:

Original Image

Transformed Image

Here’s the relevant python code used to transform the second set of CAPTCHAs:

def change_to_white(data, r, g, b):
    target_color = (red == r) & (blue == b) & (green == g)
    data[..., :-1][target_color.T] = (255, 255, 255)
    return data
im = Image.open(startimagepath)
im = im.convert('RGBA')

data = np.array(im)   # "data" is a height x width x 4 numpy array
red, green, blue, alpha = data.T # Temporarily unpack the bands for readability
colors = []
for x in range(0, len(red[0])): # Cycle through the numpy array and store individual color combinations
    for i in range(0, len(red)):
        color = (red[i][x], green[i][x], blue[i][x])
        colors.append(color)
unique_colors = set(colors) # Grab out the unique colors from the image

# Cycle through the unique colors in the image and replace all "light" colors with white
for u in unique_colors: 
    if sum(u) > 220: # Consider a color "light" if the sum of its red, green, and blue values are greater than 220
        data = change_to_white(data, u[0], u[1], u[2])

# save the new image
im2 = Image.fromarray(data)
im2.save(imagepath)

Google Solves their own reCAPTCHAs

While Google has since made improvements to their reCAPTCHAs, there was a proof of concept github project that would use Google’s Speech Recognition service to solve the audio portion of their reCAPTCHAs. Project info can be found here: https://github.com/ecthros/uncaptcha

Wrap-Up

Web application vulnerabilities continue to be a significant risk to organizations. Silent Break Security understands the offensive and defensive mindset of security, and can help your organization better prevent, detect, and defend against attackers through more sophisticated testing and collaborative remediation.