User:Sabinaazim/Hash collision

In computer science, a hash collision or clash is when two pieces of data share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed set of bits.

Hash functions can map different data to the same hash (by virtue of the pigeonhole principle), malicious users can take advantage of this to mimic data.

Due to the possible negative applications of hash collisions in data management and computer security (in particular, cryptographic hash functions), collision avoidance has become a fundamental topic in computer science.

Overview
Hash collisions are sometimes unavoidable depending on the number of objects in a set and whether or not the bit string they are mapped to is long enough in length. When there is a set of n objects, if n is greater than |R|, which in this case R is the range of the data, the probability that there will a hash collision is 1, meaning it is guaranteed to occur.

The impact of collisions depends on the application. When hash functions and fingerprints are used to identify similar data, such as homologous DNA sequences or similar audio files, the functions are designed so as to maximize the probability of collision between distinct but similar data, using techniques like locality-sensitive hashing. Checksums, on the other hand, are designed to minimize the probability of collisions between similar inputs, without regard for collisions between very different inputs. Instances where bad actors attempt to create or find hash collisions are known as collision attacks.

In practice, security-related applications use cryptographic hash algorithms, which are designed to be long enough for random matches to be unlikely, fast enough that they can be used anywhere, and safe enough that it would be extremely hard to find collisions.

Probability of Hash Collisions
The probability of a hash collision could vary depending on the hash function selected to generate a hash value. Take into account the following hash functions - CRC-32, MD5, and SHA-1. These are common hash functions with varying levels of collision risk.

CRC-32[edit]
CRC-32 poses the highest risk for hash collisions. This hash function is generally not recommended for use. If a hub were to contain 77163 hash values, the chance of a hash collision occurring is 50%, which is extremely high compared to other methods.

MD5[edit]
MD5 is the most used hash function and when comparing the 3 hash function mentioned previously, it is the middle ground for hash collision risk. In order to get s 50% chance of a hash collision occurring, there would have to be over 5.06 billion records in the hub

SHA-1[edit]
SHA-1 posed the lowest risk for hash collisions, however it is not often not available through certain tools, which is why most people resort to the MD5 function. For a SHA-1 function to have a 50% chance of a hash collision occurring, there would have to be 1.42 x 10²⁴ records in a the hub. Note, these number of records mentioned in this examples would have to be in the same hub.

Having a hub with a smaller number of records could decrease the probability of a hash collision in all of these hash functions, although there will always be a minor risk present no matter what.

Resolving Collisions[edit]
Since hash collisions are inevitable, hash tables have mechanisms of dealing with hash collisions, known as collision resolutions. Two common strategies are open addressing and separate chaining.

Open Addressing[edit]
Cells in the has table are assigned one of three states in this method - occupied, empty, or deleted. If a hash collision occurs, the table will be probed to move the record to an alternate cell that is stated as empty. There are different types of probing that take place when a hash collision happens and this method is implemented. Some types of probing are linear probing, double hashing, and quadratic probing.

Separate Chaining[edit]
This strategy allows more than one record to be 'chained' to multiple cells in a hash table. If two records are being directed to the same cell, both would go into that cell as a linked list. This efficiently prevents a hash collision from occurring since records with the same hash values can go into the same cell, but it has its disadvantages. Keeping track of so many lists is difficult and can cause whatever tool that is being used to become very slow.