University of Michigan at Ann Arbor
Last Edit Date: 05/30/2023
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Marcus Darden and David Paoletti. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
i. Introduction
A container of items (key / value pairs) that supports two basic operations:
Insert a new item
Search (retrieve) and item with a given key
Two primary uses:
A set of things: check if something is in the set
Key-Value storage: look up values by keys
ii. Dictionary ADT Operations
Desiable Operations:
Insert a new item
Search for item(s) having a given key
Remove a specified item
Sort the dictionary
Select the k-th largest item in a dictionary
Join two dictionaries
Other basic container operations: construct, test if empty, destory, copy, ...
iii. Containers with key lookup
Identifying a container with fast search and fast insert for arbitrary <key, value>
pairs
Sorted Vectors: Insert will be $O(n)$
Unsorted vectors: Search will be $O(n)$
Sorted / Unsorted Linked Lists: Search is $O(n)$
Binary Search Tree (STL map<>
)
Search: average $O(n \log n)$, worst case $O(n)$
Insert: average $O(n \log n)$, worst case $O(n)$
Hash Table (STL unordered_map<>
)
Search: average $O(1)$, worst case $O(n)$
Insert: average $O(1)$, worst case $O(n)$
iv. What is the set of keys is small?
Example: calendar for 1 ... n days
n could be 365 or 366
Can look up a particular day in $O(1)$ time
Every day is represented by a bucket (i.e. some containers)
If we have a range of integers that fits into memory, everything is easy.
i. Introduction
Locate items in a table by key
Need
Translation: converts a search key into an integer
Compression: limits an interger to a valid index
Collision ressolution: resoves search keys that hash to same table index
ii. Direct Addressing
Each key maps to an element of the array
iii. Hashed Addressing
Only actual keys map to an element of the array
iv. Hash Function
Translation: $t(\text{key}) \Rightarrow \text{hashint}$
Compression: $c(\text{hashint}) \Rightarrow \text{table index}$
A hash function combines translation and compression:
$$h(\text{key}) \Rightarrow c(\text{key}) \Rightarrow \text{table index}$$Note: std::hash<>
provides only translation, not compression.
Example:
v. Translating Floating Points
Example: $\lfloor 0.6 \times 10 \rfloor = 6$. $M$ is the size of the hash table.
Note: $(\text{key} - \text{s}) / (\text{t} - \text{s})$ transforms to $[0, 1)$
Example: range = $[1.38, 6.75)$, $M = 13$, compute $h(3.65)$
$h(\text{3.65}) = \lfloor (3.65 - 1.38) / (6.75 - 1.38) \times 13 \rfloor = \lfloor (2.27 / 5.37) \times 13 \rfloor = \lfloor 5.495 \rfloor \rightarrow \text{index } 5$
vi. Translating Integers
Modular hash function:
$t(\text{key}) = key$
$h(\text{key}) = c(t(\text{key})) = \text{key} \mod M$
Great if keys randomly distributed
Often, keys are not randomly distributed
Example: midterm bubbles scores all multiple of 2.5
Example: pick a number from 1 to 10
Do not want to pick a bad M
vii. Translating Strings
Simple has sums ASCII charactrt codes
Problem case: stops, tops, pots, spot
Sum of each is equivalent
All with map to same hash table address
Multiple collisions
Solution: Character posisiton is important
Consider decimal numbers
$123 = 1 \times 10^2 + 2 \times 10^1 + 3 \times 10^0$
$321 = 3 \times 10^2 + 2 \times 10^1 + 1 \times 10^0$
Better Translation: Rabin Fingerprint
Instead of adding up character codes, view strings as decimal numbers
Characters: 'T', '0', 'M', ' ', 'M', ...
ASCII codes: 84, 79, 77, 32, 77, ...
Running fingerprints
T $\rightarrow$ 84
TO $\rightarrow$ 10 * 84 + 79
TOM $\rightarrow$ 10 * (10 * 84 + 79) + 77
TOM' ' $\rightarrow$ 10 * (10 * (10 * 84 + 79)) + 32
Base 10 is used for illustration only (use larger numbers)
Shuffling the chars usually changes result
vii. Compression
$c(\text{hashint}) \Rightarrow \text{index in range }[0, M)$
Division Method
$|\text{hashint}| \mod M$, where M is prime
MAD (multiply and divide) Method
$|a \times \text{hashint} + b| \mod M$, where a and b are prime. Use this method when we cannot control M.
Note: $a \mod M$ must not equal 0.
Example:
viii. Hash Table Size
Table of size M
About the number if element expected
If unsure, guess high
Hash function must return keys as integers in range $[0, M)$
In visual studio unordered_map
grows like this:
ix. Collision
x. Good Hash Functions
Benefits of hash tables depend on having "good" hashing functions
Must be easy to compute
Will compute a hash for every key
Will compute same hash for same key (Deterministic)
Should distribute keys evenly in table
Will minimize collisions
Trivial, poor hash function: h(key) {return 0;}
xi. Complexity of Hashing
For simplicity, assume perfect hasing (no collisions)
insert | search | remove |
---|---|---|
$O(1)$ | $O(1)$ | $O(1)$ |
Recall the Pigeonhole Principle to do the perfect hashing.
xii. Real-World Hash Tables
C++ hash table containers (STL C++11+)
unordered_set<>
, unordered_multiset<>
unordered_map<>
, unordered_multimap<>
Instead of doing multi
, we can declare an unordered_map<string, vector<int>>
.
Database indices are built on hash tables
Compilers use a hash table for identifiers
Try it out:
xiii. Major Uses of Hahs Tables
Sets of ints, strings, images, class objects
Check set membership
Detect / filter duplicates
Find unique elements
Find matching elements (ex: a + b = 0)
Key-value storage: look up values by keys
Count things: value_type = int
Maintain sparse vectors: key_type = int
Maintain linked structures: key_type = value_type
Hashing is not human readable, but machine readable. That's why it is called unordered
.
Hashing is an efficient implementation of:
Insert a new item
Search for an item (or items) having a given key
Remove a specified item
Hashing is an inefficient implementation of:
Select the k-th largest item in a dictionary
Sort the items in the dictionary
Data structure for a Book Index
Compisite Hash Functions
Example: hang some information over geo-locations - hash {long, lat} pairs
$H({x, y}) = (h(x) + p \times h(y)) \mod M$
xiv. When Not to Use Hash Tables
In many applications, it is tempting to use unordered_set<>
or unordered_map<>
, but
Significant space overhead of nested containers
Every access computes hash function
$O(n)$ worst case time (STL implementation)
When keys are are small ints, use bucket arrays
Some keys can be coerced to small integers: enums, simple fractions, months
For static data, set membership, or loopkup
For key-value storage where traverals are needed but not lookup (ex: sparse vector)