EECS 281: Data Structures and Algorithms

Section 12 Dictionaries and Hash Tables

University of Michigan at Ann Arbor

Last Edit Date: 05/30/2023

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Marcus Darden and David Paoletti. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

1. Dictionary ADT¶

i. Introduction

A container of items (key / value pairs) that supports two basic operations:

Insert a new item
Search (retrieve) and item with a given key

Two primary uses:

A set of things: check if something is in the set
Key-Value storage: look up values by keys

ii. Dictionary ADT Operations

Desiable Operations:

Insert a new item
Search for item(s) having a given key
Remove a specified item
Sort the dictionary
Select the k-th largest item in a dictionary
Join two dictionaries

Other basic container operations: construct, test if empty, destory, copy, ...

iii. Containers with key lookup

Identifying a container with fast search and fast insert for arbitrary <key, value> pairs

Sorted Vectors: Insert will be $O(n)$
Unsorted vectors: Search will be $O(n)$
Sorted / Unsorted Linked Lists: Search is $O(n)$
Binary Search Tree (STL map<>)
- Search: average $O(n \log n)$, worst case $O(n)$
- Insert: average $O(n \log n)$, worst case $O(n)$
Hash Table (STL unordered_map<>)
- Search: average $O(1)$, worst case $O(n)$
- Insert: average $O(1)$, worst case $O(n)$

iv. What is the set of keys is small?

Example: calendar for 1 ... n days

n could be 365 or 366
Can look up a particular day in $O(1)$ time
Every day is represented by a bucket (i.e. some containers)

If we have a range of integers that fits into memory, everything is easy.

2. Hashing Defined¶

i. Introduction

Locate items in a table by key

Use arithmetic operations to calculate table index (bucket) from a given key

Need

Translation: converts a search key into an integer
Compression: limits an interger to a valid index
Collision ressolution: resoves search keys that hash to same table index

ii. Direct Addressing

Each key maps to an element of the array

iii. Hashed Addressing

Only actual keys map to an element of the array

iv. Hash Function

Translation: $t(\text{key}) \Rightarrow \text{hashint}$

Converts the key into an integer

Compression: $c(\text{hashint}) \Rightarrow \text{table index}$

Maps the hashed integer into the range $[0, M)$

A hash function combines translation and compression:

$$h(\text{key}) \Rightarrow c(\text{key}) \Rightarrow \text{table index}$$

Note: std::hash<> provides only translation, not compression.

Example:

v. Translating Floating Points

Key between 0 and 1: $[0, 1)$

$$h(\text{key}) = \lfloor key \times M \rfloor$$

Example: $\lfloor 0.6 \times 10 \rfloor = 6$. $M$ is the size of the hash table.

Key between s and t: $[s, t)$

$$h(\text{key}) = \lfloor (\text{key} - \text{s}) / (\text{t} - \text{s}) \times M \rfloor$$

Note: $(\text{key} - \text{s}) / (\text{t} - \text{s})$ transforms to $[0, 1)$

Example: range = $[1.38, 6.75)$, $M = 13$, compute $h(3.65)$

$h(\text{3.65}) = \lfloor (3.65 - 1.38) / (6.75 - 1.38) \times 13 \rfloor = \lfloor (2.27 / 5.37) \times 13 \rfloor = \lfloor 5.495 \rfloor \rightarrow \text{index } 5$

vi. Translating Integers

Modular hash function:

$t(\text{key}) = key$

$h(\text{key}) = c(t(\text{key})) = \text{key} \mod M$

Great if keys randomly distributed
- Often, keys are not randomly distributed
- Example: midterm bubbles scores all multiple of 2.5
- Example: pick a number from 1 to 10
Do not want to pick a bad M
- BAD: M and key have common factors

vii. Translating Strings

Simple has sums ASCII charactrt codes
Problem case: stops, tops, pots, spot
- Sum of each is equivalent
- All with map to same hash table address
- Multiple collisions
Solution: Character posisiton is important
- Consider decimal numbers
  
  $123 = 1 \times 10^2 + 2 \times 10^1 + 3 \times 10^0$
  
  $321 = 3 \times 10^2 + 2 \times 10^1 + 1 \times 10^0$

Better Translation: Rabin Fingerprint

Instead of adding up character codes, view strings as decimal numbers

Characters: 'T', '0', 'M', ' ', 'M', ...
ASCII codes: 84, 79, 77, 32, 77, ...
Running fingerprints

T $\rightarrow$ 84

TO $\rightarrow$ 10 * 84 + 79

TOM $\rightarrow$ 10 * (10 * 84 + 79) + 77

TOM' ' $\rightarrow$ 10 * (10 * (10 * 84 + 79)) + 32
Base 10 is used for illustration only (use larger numbers)
Shuffling the chars usually changes result

vii. Compression

$c(\text{hashint}) \Rightarrow \text{index in range }[0, M)$

hashint my be less than 0 or greater than to equal to M

Division Method

$|\text{hashint}| \mod M$, where M is prime

MAD (multiply and divide) Method

$|a \times \text{hashint} + b| \mod M$, where a and b are prime. Use this method when we cannot control M.

Note: $a \mod M$ must not equal 0.

Example:

viii. Hash Table Size

Table of size M
- About the number if element expected
- If unsure, guess high
Hash function must return keys as integers in range $[0, M)$

In visual studio unordered_map grows like this:

$$8 \rightarrow 64 \rightarrow 512 \rightarrow ... \text{(DOUBLE)}$$

x. Good Hash Functions

Benefits of hash tables depend on having "good" hashing functions
Must be easy to compute
- Will compute a hash for every key
- Will compute same hash for same key (Deterministic)
Should distribute keys evenly in table
- Will minimize collisions
  - Collision: two keys map to same index
Trivial, poor hash function: h(key) {return 0;}
- Easy to compute, but poor distribution maximizes collisions

xi. Complexity of Hashing

For simplicity, assume perfect hasing (no collisions)

insert	search	remove
$O(1)$	$O(1)$	$O(1)$

Recall the Pigeonhole Principle to do the perfect hashing.

xii. Real-World Hash Tables

C++ hash table containers (STL C++11+)
- unordered_set<>, unordered_multiset<>
- unordered_map<>, unordered_multimap<>
Instead of doing multi, we can declare an unordered_map<string, vector<int>>.
Database indices are built on hash tables
Compilers use a hash table for identifiers

Try it out:

Open in new window

xiii. Major Uses of Hahs Tables

Sets of ints, strings, images, class objects
- Check set membership
- Detect / filter duplicates
- Find unique elements
- Find matching elements (ex: a + b = 0)
Key-value storage: look up values by keys
- Count things: value_type = int
- Maintain sparse vectors: key_type = int
- Maintain linked structures: key_type = value_type

Hashing is not human readable, but machine readable. That's why it is called unordered.

Hashing is an efficient implementation of:
- Insert a new item
- Search for an item (or items) having a given key
- Remove a specified item
Hashing is an inefficient implementation of:
- Select the k-th largest item in a dictionary
- Sort the items in the dictionary
Data structure for a Book Index
- For each term / phrase store page number(s) where it is introduced
Compisite Hash Functions
- Example: hang some information over geo-locations - hash {long, lat} pairs
- $H({x, y}) = (h(x) + p \times h(y)) \mod M$

xiv. When Not to Use Hash Tables

In many applications, it is tempting to use unordered_set<> or unordered_map<>, but

Significant space overhead of nested containers
Every access computes hash function
$O(n)$ worst case time (STL implementation)

When keys are are small ints, use bucket arrays

Some keys can be coerced to small integers: enums, simple fractions, months
For static data, set membership, or loopkup
- Consider sorting + binary search
For key-value storage where traverals are needed but not lookup (ex: sparse vector)
- Store key-value pairs in a list (or vector)