EECS 281: Data Structures and Algorithms

Section 21 Strings and Sequences

University of Michigan at Ann Arbor

Last Edit Date: 06/15/2023

Disclaimer and Term of Use:

We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file

This personal note is adapted from Professor Marcus Darden and David Paoletti. Please contact us to delete this file if you think your rights have been violated.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Why Study String Algorithms?¶

Strings are character sequences
- Chracters taken from an "alphabet"
- Algorithms on string are array / sequence algorithms
What makes those arrays / sequences special?
- Typical tasks to solve, defined by typical applications
Applications of interest
- Human-readable text (ASCII, Unicode)
- Names and labels (people, files, license plates, etc)
- DNA analysis

Working with Strings / Sequences¶

Given two strings / sequences
- Are they the equal? (==)
- Which would go first in a dictionary? (<=)
- What do they have in common? (substrings)
- Find where the strings differ
Given many strings, order them in the lexicographic (dictionary) order
- Ordering helps with search (fast look-up)

Working with Strings and Texts¶

Given strings p and s
- "Needle" and "haystack"
Find instances of p in s: first, last next, all
- Variant: preprocess s before p is known, optimiza search time for many independent queries (p)
Text: a string with a separator character, which delimits words
- Given a word (or a phrase), find instances in text
A texr corpus: a collection of documents, each containing a text (+ title and other attributes, such as URL)
- Given a search string consisting of 1+ words, find all documents containing those words
- Exact matches (name, ID numbers)
- Approximate matches (misspelled)
- A huge industry is built on this
Given project submissions
- Find pairs of similar programs
Text compression
- Find repeated words
- Replace them by numbers

DNA and Genomic Sequences¶

DNA structure
- Defined by sequences of nucleotide base-pairs
- Alphabet: A, C, G, T
- In people: 23 $\times$ 2 chromosomes, 3B base-pairs (characters)
- Genes == subsequences
String algorithms for
- Comparing chromosomes
- Looking up genes
- Assembling long DNA strands from short sequences

Individual strings: C vs C++

	Null-terminated	Object-oriented
Overhead	1 ptr + 1 char	2+ ptrs: begin / end
Complexity of `.size()`	$O(n)$	$O(1)$
Alphabet	ASCII	Configurable
Operations, algorithms	pointer arithmetics, stdlibc functions	methods, operators, stdlibc++ functions, STL

Sequences (iterator ranges)
- C++ strings support begin() and end()
Specialized containers for multiple strings
- Distionaries: add, remove, look up words
- Strings with shared fragments (many words starting with "anti-", "pre-", "over-", "semi-", etc.)

Sequence Equality in STL¶

std::equal

Returns true if the range [first1, last1) is equal to the range [first2, first2 + (last1 - first1)), and false otherwise
Two ranges are considered equal if for every iterator i in the range [first1, last1), *i == *(first2 + (i - first1))

A std::equal() Implementation

 1     template<class InputIt1, class InputIt2>
 2     bool equal(InputIt1 first1, InputIt1 last1, InputIt2 first2) {
 3         for (; first != last1; ++first1, ++first2) {
 4             if (!(*first1 == *first2)) {
 5                 return false;
 6             }
 7             return true;
 8         }
 9     }

Example: Usage of std::equal on sequence

 1     #include <algorithm>
 2     #include <iostream>
 3     #include <string>
 4
 5     bool is_palidrome(const std::string& s) {
 6         return std::equal(begin(s), begin(s) + s.size() / 2, rbegin(s)); 
 7     }
 8
 9     void test(const std::string &s) {
10         std::cout << '"' << s << '"'
11           << (is_palidrome(s) ? " is" : " is not") << " a palindrome\n"
12     }
13
14     int main() {
15         test("radar");
16         test("hello");
17     }

Lexicographic Comparison¶

i. Rules of Lex-compare

Two ranges are compared element by element
The first mismatching element defines which range is lexicographically less or greater than the other
If one range is a prefix of another, the shorter range is lexicographically less than the other
If two ranges have equivalent elements and are of the same length, then the ranges are lexicographically equal
An empty range is lexicographically less than any non-empty range
Two empty ranges are lexicographically equal

ii. Dictionary Order

"" < "a" < "ab" < "b" < "ba" < "bc" < "bc0"
Overloading comparison operators in C++
- Do not implement all 6 as independent operators
- Suffices to implement < and == (STL uses these)
- (a != b) is the same as !(a == b)
- (a > b) is the same as (b < a)
- (a <= b) is the same as !(b < a)
- (a >= b) is the same as !(a < b)
All other comparisons use 1 operator (plus !)
- Bad idea: implementing <= using < and ==

iii. Applied Strings

Common implementation trick for <class T>
- int compareHelper(const T& x, const T& y)
- Returns 0 for x == y, -1 for x < y, 1 for x > y
- Comparison operators are implemented by post-processing
Optimizations for strings
- Short strings (< 16B) do not need dynamic memory
- In C, null-termination simplifies operator<: "abc" < "abcd"
- In c++, strings of different size are !=
- Fertile ground for vectorized CPU instructions

iv. Three-way String Compare

 1     int compareHelper(const string& s0, const string& s1) {
 2         const size_t len0 = s0.length(), len1 = s1.length();
 3         for (size_t i = 0; i != std::min(len0, len1); i++) {
 4             if (s0[i] < s1[i])
 5                 return -1;
 6             if (s0[i] > s1[i])
 7                 return 1;
 8         }
 9         if (len0 < len1)
10             return -1;
11         if (len0 > len1)
12             return 1;
13         return 0;
14     }

Runtime analysis

$O(n)$ worst case time, $O(1)$ extra space (cannot do better)
Average case complexity for random string is $O(1)$ time: only need th efirst few characters
Strings are often not random

v. Lex-comparisons in STL

Implementation

 1     template<class ForIt1, class ForIt2>
 2     bool lexicographic_compare(ForIt1 first1, ForIt1 last1,
 3                                ForIt2 first2, ForIt2 last2) {
 4          while ((first1 != last1) && (first2 != last2)) {
 5              if (*first1 < *first2)
 6                  return true;
 7              if (*first2 < *first1)
 8                  return false;
 9              ++first1, ++first2;
10          }
11          return (first1 == last1) && (first2 != last2);
12     }

vi. Application: Rmoving Duplicates

Given: a container of string objects
Need: leave only one copy of each string
- Return all strings, if no duplicates present
Sort given strings
- Use STL std::sort() from STL with operator()
- Duplicates will be next to each other
Compare neighboring strings using operator==()
Copy only the first of duplicate strings
- Use std::unique() from STL with operator==()

Searching and String Fingerprints¶

Use string::find to approach.

i. Brute-force String Searching

Two nested loope - sliding window

Worst case time: $O(\text{len_needle * len_haystack})$
- Finding "aaaaab" in "aaaaaaaaaaaaaaaaaaaaaaaba"
Average case: $O(\text{len_haystack})$
- The inner loop performs $O(1)$ iterations
STL implementations often choose this algorithm
- Simple, learn implementation
- Usually fast in practice: $O(\text{len_haystack})$ time
However
- $O(\text{len_needle * len_haystack})$ worst case time
  - Worst cases do appear in realistic strings
- $O(\text{len_needle * len_haystack})$ worst case time possible with pre-processing (not as fast on most strings) ex: the Knuth-Morris-Pratt (KMP) algorithm

Better Than Brute-Force

Idea: Speed up the inner loop for known worst cases
- Perform most eq-comparisons for strings in $O(1)$ time
- $O(\text{len_needle * len_haystack})$ becomes $O(\text{len_haystack})$
Worst case complexity remains similar, but worst case inputs are not as obvious and rare in practice
How do we speed up eq-comparisons?

ii. String Fingerprints

For each string, compute a number (int)
- When fingerprints differ, strings must differ
- When fringerprints match, strings rarely differ
- The actual numbers do not mean anything
- Many different fingerprint functions exist
Ex: simple, but poor fingerprint function F()
- Replace each character by its ASCII code
- Add up all codes

F("tom marvolo riddle ") == F("i am lord voldemort")

Rabin Fingerprint

Instead of adding up character codes, view strings as decimal numbers
Shuffling the chars usually changes results

Sliding Rabin Fingerprint

Calculate fingerprint fp for first m characters
Removing a character on the left takes $O(1)$ time
- Precompute / store value of $10^{m - 1}$ once as p
- Subtract left-most character multiplied by p from fp
Adding a character on the right takes $O(1)$ time
- Multiply fp by 10 (for illustration only)
- Add ASCII code of new right-most character
Initial caluclation of fp takes $O(m)$ time
Each of $n - m$ slides takes $O(1)$ time

Rabin-Karp String Search

Compute the FP of the needle in linear time
Check if the haystack starts with the needle

window = prefix_of_the_haystack
If (FP(window) != FP(needle)) go Step 3
If (window == needle), then done

Incrementally update the FP of window

Remove one character on the left
Add one new character on the right

If (FP(window) == FP(needle)), if (window == needle), thne done
If more chars remain in haystack, go to Step 3

Avoiding Overflows

For long strings, performing +, * operations wil result in overflows
A common trick when constructing a fingerprint
- Replace conventional arithmetic (+, *)
- Pick some "largish prime" (ex: 3355439)
- (x * y) becomes (x * y) % prime, (x + y) becomes (x + y) % prime
- Important: conventional arithmetic rules still carry over

Rabin-Karp Time Complexity

If different substrings never produce equal FPs, runtime is $O(\text{len_haystack})$
Every FP collision incurs $O(\text{len_needle})$ time to perform equality check of substrings
In practice, collisions are very rare and do not correspond to meaningful pairs of strings
- It helps to choose larger primes
Choose base = $2^k$
- Coprime with the original prime number
- Multiplication by base simplifies to a binary shift (faster)

Code

 1     int rkSearch(const string &needle, const string &haystack, size_t prime) {
 2          constexpr int base = 128;
 3          const size_t N = needle.length(), H = haystack.length();
 4          const size_t z = static_cast<size_t>(pow(base, N - 1)) % prime;
 5          int n = 0, h = 0;
 6          for (size_t i = 0; i < N; ++i) {
 7               n = (base * n + needle[i]) % prime; // calculate needle fingerprint
 8               h = (base * h + haystack[i]) % prime; // calc. window fingerprint
 9          } // for
10          for (size_t i = 0; i <= H - N; ++i) {
11               if (n == h) { // check needle fp vs. current window
12                    for (size_t j = 0; j < N; ++j) if (haystack[i + j] != needle[j]) break;
13                    if (j == N) return i;
14               } // if
15               if (i < H - N) { // slide window
16                    h = ((h - haystack[i] * z) % prime * base + haystack[i + N]) % prime;
17                    if (h < 0) h = (h + prime);
18               } // if
19          } // for
20          return -1; // needle not found
21      } // rkSearch()