EECS 281: Data Structures and Algorithms

Section 13 Hash Collision Resolution

University of Michigan at Ann Arbor

Last Edit Date: 05/30/2023

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Marcus Darden and David Paoletti. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Collision Resolution¶

Needed to handle insert, search, and removal, when multiple keys has to the same table index.

Methods of Collision Resolution

Separate Chaining
Linear Probing (open addressing method)
Quandratic Probing (open addressing method)
Double Hashing (open addressing method)

Separate Chaining¶

Resolve collisions by maintaining M linked lists, one for each table index

Separate chaining reduces the number of comparisons for sequential search by a factor of M (on average), using extra space for M links.
In a table with M lists (table indices) and N keys, the probability that the number of keys in each list is within a small constant factor of N/M is extremely close to 1, if the hash function is good

i. Load Factor ($\alpha$)

$\alpha = N / M$, where $N$ keys are placed in an M-sized table
Separate Chaining ($\alpha \le 0.5$ by default)
- $\alpha$ is average number of items per list
- $\alpha$ can be $> 1$
Open Addressing
- $\alpha$ is percentage of table indices occupied
- $\alpha$ must be $\le 1$

ii. Separate Chaining Complexity

Insert: constant time
- $O(1)$ if duplicate keys are allowed, $O(\alpha)$ is not
Search: time proportional to $\alpha$
- $O(\alpha)$
Remove: dependent upon search
- $O(\alpha)$

This is why we choose $M \approx N$: $O(\alpha) = O(1)$

iii. Speeding up the Worse Case

Use something other than linked lists

Use M vectors
- Could keep constants sorted, which is $O(\alpha)$ insert but $O(\log \alpha)$ search
Use M binary search trees, which is $O(\log \alpha)$ insert and search (as long as tree is not a stick)
- Large memory overhead

Open Addressing¶

Resolve collisions by using empty locations in the table

i. Probing

Check a location, by index, in the hash table
Used to search for data or find available locations
Possible probe outcomes (basic)
- Empty: no data found at probed location
- Hit: probe find occupied location that contains an item whose key matches the search key
- Full: probe finds occupied location but item key does not match key
If probe result is full, then probe table at "next higher" cell unit hit (search ends successfully) or empty (search ends unsuccessfully)

ii. Linear Probing

Hash search key to get initial table index
Probe buckets at sequential indices, starting at table index, until search key or an open address is found

Collision resolution

$t(\text{key}) \mod M \Rightarrow$ table index
If bucket at table index is full, then try
- $(t(\text{key}) + j) \mod M$ for $j = 1, 2, ...$

iii. Clusters

A contiguous group of occupied table cells
Consider a table taht is half-full (N / M / 2)
- What is the best-case distribution of half-full table?
  
  Best case situation: $0.5 \Rightarrow \Theta(1)$
- What is worst-case distribution of half-full table?
  
  Worst case situation: $0.5 \times 1 + 0.5 \times (N / 2) = 0.5 + N / 4 \Rightarrow \Theta(n)$

iv. Open Addressing Deletions

Option 1: remove it, re-hash rest of cluster
Option 2: use a deleted "item"
- Not a date item, not empty either
- We will call this "deleted"
"Deleted" Items

New probing outcome
- Deleted: probe finds a location that once held an item, but is not currently holding one
When a probe finds an item marked deleted
- Insert considers it an empty spot (miss) and thus usable (if the key does not exist elsewhere)
- Search considers it occupied (full) and keeps searching
Maintaining Bucket Status
- To determine whether a bucket is used or ununsed required a bool for each bucket
- When a third state is added: used, unused, or deleted
  - Option 1: Use two bools
  - Option 2: Declare an enum
  enum: A variable type that can only take on certain "enumerated" values

v. Counting Probes

When collisions are resolved with linear probing, the average number of probes required to search in a hash table of size M that contains $N = \alpha M$ keys is about:

$$\frac{1}{2} (1 + \frac{1}{1 - \alpha})~~\text{for hits}$$$$\frac{1}{2} (1 + \frac{1}{(1 - \alpha)^2})~~\text{for misses}$$

Effect of Load on Number of Probes Linear Probing

$\alpha$	Average Probes: Successful Search	Average Probes: Unsuccessful Search
0.1	1.1	1.1
0.2	1.1	1.3
0.3	1.2	1.5
0.4	1.3	1.9
0.5	1.5	2.5
0.6	1.8	3.6
0.7	2.2	6.1
0.8	3.0	13.0
0.9	5.5	50.5

vi. Quadratic Probing

Hash search key to get initial table index
Probe buckets at increasing "distance" from the table index, until search key or an open address is found

Collision resolution

$t(\text{key}) \mod \Rightarrow$ tabel index
If bucket at table index is full, then try
- $(t(\text{key}) + j^2) \mod M$ for $j = 1, 2, ...$

Counting Probes

When collisions are resolved with linear probing, the average number of probes required to search in a hash table of size M that contains $N = \alpha M$ keys is about:

$$1 - \frac{\alpha}{2} + \ln (\frac{1}{1 - \alpha})~~\text{for hits}$$$$\frac{1}{1 - \alpha} - \alpha + \ln (\frac{1}{1 - \alpha})~~\text{for misses}$$

Effect of Load on Number of Probes Linear Probing

$\alpha$	Average Probes: Successful Search	Average Probes: Unsuccessful Search
0.1	1.06	1.12
0.2	1.12	1.27
0.3	1.21	1.49
0.4	1.44	2.19
0.5	1.62	2.82
0.6	1.62	2.82
0.7	1.85	3.84
0.8	2.21	5.81
0.9	2.85	11.40

vii. Double Hashing

Hash search key to get initial table index
Use a second has function to calculate an interval used to probe buckets after the table index, until search key or an open address is found

Collision resolution

$t(\text{key}) \mod M \Rightarrow$ table index
If bucket at table index is full, then try
- $(t(\text{key}) + j \times t'(key)) \mod M$ for $j = 1, 2, 3, ...$
- $t'(\text{key}) = q - (t(\text{key}) \mod q)$ for some prime number $q < M$

Dynamic Hashing¶

i. Introduction

As the number of keys in a hash table increases, search performance degrades
Separate chaining
- Search time increases gradually
- Doubling number of keys means building (on average) list length at each of M table indices
Linear Probing
- Search time increases dramatically as table fills
- May reach point when no more keys can be inserted

ii. Simple Dynamic Hashing

Double the size of table when it "fill up"
Choose a load factor of 0.5
Each item must be rehashed into the table
Expensive, but infrequent

Amortized Analysis¶

Cannot guarantee that each and every operation will be fast, but can guarantee that average cost per operation will be low
Total cost is low, but performance profile is erratic
Most operations are extremely fast, but some operations require much more time to rebuild the table

Concept:

Each insert
- Pays (small constant) cost to actually insert
- Deposits other small constant ("balance") in a bank
First M / 2 - 1: build up "balance"
(M / 2)-th insertion
- Faced with a bug

Application:

Assume linear probing
Start with table of size M
Each insertion in a table $\le \frac{1}{2}$ full
- Average cost is $\le 2.5$ probes
Insert $M / 2 - 1$ keys
- $2.5 * (M / 2 - 1) = O(M)$
Insert $(M / 2)^{th}$ key
Build new table, size 2M
- Remove keys from old table, insert new
- Each insert $\le \frac{1}{4}$ full, with average cost less than 1.5 probes
- $1.5 * M / 2 = O(M)$
$O(M) + O(M) = O(M)$
- Linear time to insert $M / 2$ keys, but last one is at a higher cost than the previous inserts.