User:SelimLakhdar/sandbox

Human interaction proof (HIP), also known as CAPTCHA, or reverse Turing test, is a system used to tell human and computer bots apart, semi-automatically.

HIP mainly relies on unresolved AI problems to generate challenges that are easily solvable by humans, while remaining too hard for computer bots. This challenge depends on the fact that some calculations are still too hard to solve for a bot, so theoretically if a bot succeed on the test it could be used for solving AI problems.

Designing such systems requires a trade-off between security and accessibility.

Many other implementations can be found, especially when the accessibility is more important than the security. Designing an accessible HIP is a new major concern.

Many HIPs have been released over the years; this intense activity is due to different attacks that have broken most of them. Machine learning is the most advanced and most used attack.

History and evolution


In the 1950s, Alan Turing was the first who was wondering about how to differentiate between human and machine behavior. He was trying to determine if a computer can think. He invented the Turing Test which consists in trying to differentiate between human and robot by a challenge–response test. This first consideration was not meant to be automatized. In fact, the test was administrated and verified by a human.

With the growth of Internet users and the apparition of the first web services, the necessity of automatically distinguish between normal human actions, and automated actions through bots was necessary.

The first attempt lights up to different attacks to break the system which light up new ways to build a strong HIP. This revealed that building a HIP is not an easy task. The generated challenge has to be as easy as possible for humans to pass, and hard as possible for computers to solve it.

Over the years, the most common representation of a HIP is visually distorted images of a string of letters and numbers that can be understood by human but not by bots.

This approach was criticized due to its negligence for disabled persons. Other alternatives were introduced like, the audio based CAPTCHA, the puzzle based CAPTCHA.

One of the main idea behind building new HIPs systems is using unsolved AI problems for generating and verifying the challenge. Indeed, using problems that cannot be solved effectively/efficiently by any feasible machine could be a great progress in that field if hackers could break it. This idea was also encouraged by the popularizing of the 1024 integer factoring problem for cryptographic protocols.

Usage and utility
HIPs systems are used to secure services from automated attacks. Generally, they are deployed in the front-end layer, where the interaction of the user is checked to perform a request to the server. This point shows that HIPs are used as a preventive method to control access to the back-end layer.

Some notorious usages are :


 * Preventing comment spam.
 * Protecting website registration.
 * Protecting email addresses from crawlers.
 * Preventing dictionary attacks (Brute-force attack).
 * Traking online bots: interactions of social networks.
 * Search Engine Bots.

CAPTCHAs are also used to mitigate the risk of password eavesdropping attack to discourage password phishing from some malwares. Specifically, in the TLS protocol to counter the MITM attack.

Usability/accessibility
Designing a HIP is a complex problem. In fact, it's a trade-off between security and accessibility. Finding the right equilibrium between accessibility and strength against attacks is difficult. The HIP has to be difficult to solve for a computer, while remaining easy for a human. An automated script should not be successful more than 1 in 10,000 tries or have success rate of 0.1%, and that a human should be successful at least 90% of the time.

Samaras et al. leaded a study in the human recognition field to understand how the human brain analyses and understands an image. Various researchers attempted to explain the functioning of the human mind in terms of more basic processes, such as speed of processing, controlled attention and working memory capacity to build more resistant HIPs.

The CAPTCHA is the most used system despite it doesn't provide an acceptable trade-off between security and accessibility. According to a survey led in the US, 37 millions users are blind and that's an important concern about text-based CAPTCHA. Indeed, CAPTCHA is the greatest security-related problem for users with disabilities, especially for blind users. Even the new audio-based CAPTCHA are still inaccessible for certain users.

Security
The security aspect in a HIP system is very important. It's the key to prevent computer bots from bypassing the system, while remaining timely resolvable for humans. The properties that make a problem hard to solve, and resistant to bot attacks, are discussed by Bergadano et al. A CAPTCHA is considered robust to attacks if the success rate of attacks is less than 0.01%. However, it is also desired that the CAPTCHA be usable, i.e. the human success rate should be at least 90%. Other studies revised the value of robustness of CAPTCHA to bot attacks from 0.01% to 1%, citing it as more meaningful.

AI usage
Through the time, using AI to build such systems was explored. Using hard unsolved AI problems to generate challenges was also a way to advance in that field. But, finding a suitable AI problem that can automate the generation of the challenge is not an easy task. Text recognition is a field of interest in AI, well-oriented (aligned) text is already recognized by computer programs, researchers worked on the distorted text, or more commonly the handwriting text recognition. The difficulty of recognizing distorted text came from the segmentation problem, the challenge for AI is to break the interlacing between words, but this seems to not be relevant anymore.

Segmentation resistance
The most used technique to bypass a text-based CAPTCHA is segmentation. The more effects for designing a strong HIP are combined, more the HIP is secured. . Adding noise, lines, random arcs, rotation, scaling and distortion are common used techniques

Challenge generation
The capability of generating many instances of the problem is also an aspect of interest for scalability. Another concern is trying to avoid parallel attacks (Brute Force).

Public Sources
Beyond the preceding rules to design a strong HIP system, publishing the source code seems to be the most effective way to improve the system against attacks. CAPTCHAs systems which rely on private databases or algorithms to generate their challenges are prohibited. This rise a risk of an adversary generating all possible tests and using a hash function to look up the answer in a pre-computed database or trying to do reverse engineering.

CAPTCHAs
First implementation of CAPTCHA (/kæp.tʃə/, an acronym for "completely automated public Turing test to tell computers and humans apart") was trying to achieve some goals like an easy generation of multiple instance of the challenge and an easy usage. The most wide used HIP scheme is the CAPTCHA, which was introduced in 2000. It relies on the gap between human and robot for analyzing visual information. It uses text/image deformation and distortion to build the challenge.

Since then, HIPs evolved and many other implementations appeared. Many fields were studied, like gender recognition, facial expression understanding, body parts finding, nudity deciding, naive drawing understanding, handwriting understanding, speech recognition, filling in words.

We can classify CAPTCHAs in different categories :

CAPTCHA that relies on the generation of a visual challenge. They are not adapted for disabled persons.
 * Visual CAPTCHA 

It relies on text deformation, distortion, adding noise like arcs, to generate the challenge. This scheme is the most used one because it uses alphanumeric symbols which are directly accessible with the keyboard. Some notorious implementation are Pessimal Print, BaffleText, ScatterType, GIMPY, EZ-GIMPY.
 * Text-Based CAPTCHA 

It relies on image recognition. Indeed, it based on the difficulty for bots to understand images. This is usually performed through recognizing some aspects of an image, or grouping same images. Implementation of such CAPTCHAs are Bongo, ESP-Pix, Asirra, Imagination, and ARTiFACIAL.
 * Image-Based CAPTCHA 

This kind of CAPTCHAs is recent. It uses animation to display CAPTCHA, where users are asked to type what they have seen or perceived. The generation of challenge remain hard to implement.
 * Moving Objects CAPTCHA 

Those CAPTCHAs were introduced for disabled users. The challenge relies on sound recognition or semantic understanding.
 * Non-Visual CAPTCHA 

It relies on the gap in sentence understanding between humans and bots. Example of generated challenge can be a simple question like "What's the color of the sky ?". Those CAPTCHAs are vulnerable to attack using a computational knowledge engine, such as Wolfram Al-pha or even a search engine.
 * Semantic CAPTCHA 

Use sound deformation of a sentence. Adding noise.
 * Audio CAPTCHA 

Other approaches can be done by combining those different technique to come with a hybrid one. One example of this type is HIPUU CAPTCHA which uses image and audio-based CAPTCHA.
 * Others approaches 

Attacks on HIPs
The security is a continuous game between hackers/researchers and security engineering. We can quote the PWNtcha project " PWNtcha stands for "Pretend We’re Not a Turing Computer but a Human Antagonist", as well as PWN capTCHAs. This project’s goal is to demonstrate the inefficiency of many captcha implementations. "

Work was also done on automatically recognize HIPs scheme to build a generic way to broke CAPTCHAs.

OCR
Optical character recognition is used to recognize/identify a content of a document. It relies on multiple techniques, like binarization to removes noise pixels. After pixel cleaning, edge detection is more effective. Another important technique is the segmentation to separate and detect letters. If those techniques aren't efficient to break the HIP, Using the segmentation result with an SVM for character recognition can work.

Machine learning
Machine learning is a widely used technique to break CAPTCHA. It consists in designing an automated solver. Most HIPs are pure recognition tasks that can easily be broken using machine learning.

The use of machine learning based attacks is a concern in building HIPs. In August 2014, Bursztein et al. presented the first generic CAPTCHA-solving algorithm based on reinforcement learning and demonstrated its efficiency against many popular CAPTCHA schemas. They concluded that text-distortion-based CAPTCHAs schemes should be considered insecure moving forward.

Stealing cycles (redirection)
One of possible attacks on a CAPTCHA system is to redirect the challenge to another user to solve it. This technique was firstly used on pornographic websites.