This is a small collection of string hash function comparison.

## What is a Hash Function

Before checking string hash comparision table, we need to know what a hash function is. A hash function is any function that can map data of any size to values of a fixed size. Hash values, hash codes, digests, or just “hashes” are the names for the values that a hash function gives back. Most of the time, the values are used to index a table with a fixed size called a “hash table.” Hashing or scatter storage addressing is the use of a hash function to index a hash table.

Hash functions and the hash tables that go with them are used in applications that store and retrieve data to get to the data quickly and in almost the same amount of time each time. They only need a little bit more storage space than the amount needed for the data or records themselves. Hashing is a way to access data that is easy to compute and takes up little storage space. It avoids the non-constant access times of ordered and unordered lists and structured trees, as well as the often exponential storage needs of direct access to state spaces with large or variable-length keys.

Commonly used string Hash functions and ELFHash, APHash, etc., are very simple and effective methods. These functions use bitwise operations so that each character contributes to the final function value. There are also hash functions represented by MD5 and SHA1, which are almost impossible to find collisions.

Commonly used string hash functions are BKDRHash, APHash, DJBHash, JSHash, RSHash, SDBMHash, PJWHash, ELFHash and so on. For the above hash functions, I made a small evaluation of them.

### String hash function comparison

Among them,

**Data 1**is the number of hash collisions of random strings composed of 100,000 letters and numbers.**Data 2**is the number of hash collisions of 100,000 meaningful English sentences.**Data 3**is the number of conflicts stored in the linear table after the hash value of data 1 is modulo 1000003 (a large prime number).**Data 4**is the number of conflicts stored in the linear table after the hash value of data 1 is modulo 10000019 (a larger prime number).

Hash Function | Data1 | Data2 | Data3 | Data4 | D1 Score | D2 Score | D3 Score | D4 Score | Average |
---|---|---|---|---|---|---|---|---|---|

BKDRHash | 2 | 0 | 4774 | 481 | 96.55 | 100 | 90.95 | 82.05 | 92.64 |

APHash | 2 | 3 | 4754 | 493 | 96.55 | 88.46 | 100 | 51.28 | 86.28 |

DJBHash | 2 | 2 | 4975 | 474 | 96.55 | 92.31 | 0 | 100 | 83.43 |

JSHash | 1 | 4 | 4761 | 506 | 100 | 84.62 | 96.83 | 17.95 | 81.94 |

RSHash | 1 | 0 | 4861 | 505 | 100 | 100 | 51.58 | 20.51 | 75.96 |

SDBMHash | 3 | 2 | 4849 | 504 | 93.1 | 92.31 | 57.01 | 23.08 | 72.41 |

PJWHash | 30 | 26 | 4878 | 513 | 0 | 0 | 43.89 | 0 | 21.95 |

ELFHash | 30 | 26 | 4878 | 513 | 0 | 0 | 43.89 | 0 | 21.95 |

After comparison, the above average scores are obtained. The mean is the mean of squares. It can be found that the effect of BKDRHash is the most prominent in both the actual effect and the encoding implementation. APHash is also a relatively good algorithm. DJBHash, JSHash, RSHash and SDBMHash each have their own merits. PJWHash and ELFHash have the worst effect, but have similar scores, and their algorithmic essences are similar.

In the information repair competition, based on the principle of easy coding and debugging, I personally think that BKDRHash is the most suitable for memory and use.