Hashing is a practical way of maintaining a dictionary, taking advantage of the fact that looking an item up in an array takes constant time once you have its index. With a hash function, we can create a hash table for fast look-ups.
Hash functions also provide useful tools for many things beyond powering hash tables, since the fundamental idea is just of many-to-one mappings, where “many” is controlled so it is very unlikely to be “too many”. Some examples of this are duplicate detection, canonicalization, and compaction.
Hash Function
Let’s say we have a key and a value.
The hash function maps the key to an integer. Then, the output of the hash function (hash value) is used as index, such that the value is stored in the array (hash table) at this position.
The first step of the hash function is usually to map each key to a big integer. For example, let’s say we are writing a hash function for a string . Let be the size of the alphabet on which is written. Let char(c)
be a function that maps each symbol of the alphabet to a unique integer from to . The function
maps each string to a unique large integer by treating the characters of the string as “digits” in the base- number system. We have to make sure the numbers produced by this function are small enough to fit in our hash table, so we do .
- Hash functions must always return the same hash value for two items representing the same values.
- Two different items hashing to the same value is a hash collision.
Some properties of good hash functions:
- Easy to calculate the hash value (function has low time complexity)
- Low chance of hash collision
- All possible values are utilized approximately equally
Hash Function Example
Hash functions are an abstract concept that can be applied to any arbitrary data type. For example, here is a simple hash function for an integer array:
- Sum up each entry in the array and modulo the sum by 100
[5, 23, 84] -> (5 + 23 + 84) % 100 = 12
Efficiency
Average time complexity is for each insertion/access operation on the table, where is the number of entries in the hash tables and is the array size of the table. This is due to the pigeonhole principle.
For the worst case (everything is hashed to the same key), the time complexity becomes , which is why it’s important to have a good hash function where collision is unlikely.
Dynamic hash table – size dynamically changes as more items are added. Each access to the table has an average time complexity of , which is close to accessing an array. When you need an non-integer index, you can use a hash table instead
- Python has
dict
class - Java has
HashMap
class