Google recently launched a new version of reCaptcha which claims to be more robust to bots and easy going on the humans.
While this video on youtube by Google is pretty convincing too, things got a little interesting when we dug deeper. The new approach which seems to be a sophisticated bot identification algorithm, is nothing but a mere usage of browser cookies.
So here’s what happens when you are thrown a reCAPTCHA challenge:
- You are asked to solve a reCAPTCHA image the first time.
- The response to the evaluation of the text string entered by you, is cached in your browser’s cookies.
- The next time you visit the page, or any page which requires you to pass reCAPTCHA, the information from these cookies is used to identify whether you have passed the test before.
A simple test can be done here: https://wordpress.org/support/register.php.
After solving the reCAPTCHA image for the first time, it does not require you to solve an image when you visit again. But, once you delete your cookies, and try again … there! Back to square one, you are required to solve the image to succeed the form submission. Google has simply used cookies to retain information about your authenticity.
What does this mean for bots? Now bots can use an OCR tool to solve the information or require somebody to solve the image initially, post which, the bot can retain the cookies and continue scraping!
P.S: Well, We haven’t got to the main course yet!
The new version of reCAPTCHA can also be bypassed by another technique. This can be done by using the website’s public key (called data-sitekey). Wait, what? Yes! Let’s say a bot wanted to bypass a website X’s reCAPTCHA without actually letting a user (on website Y) know that he is allowing a bot to do so. More technically, this is called clickjacking or UI redress attack. The bot could use the data-sitekey of website X and disable the Referer header on a web page in Y where the user would be asked solve reCAPTCHA.
Once the user solves the CAPTCHA, the response (called “g-recaptcha-response”) can be used by a bot running in the background to submit a form on website X. This way, the bot could trick Google into thinking that the solved reCAPTCHA response was originating from website X (while it is actually coming from Y). Hence, the bot is able to proceed scraping on webiste X. This magically works because Google doesn’t validate the referer header if it has been disabled by the client or is empty. A genuine user just contributed to a bot scraping website X without actually realizing that he was being used as an access card.
This post has been inspired from the original blog article by @homakov. A sample of this has already been implemented and hosted on github.