Content
Why Study String Algorithms? Working with Strings / Sequences Working with Strings and Texts DNA and Genomic Sequences String-related Data Structures Sequence Equality in STL Lexicographic Comparison Searching and String Fingerprints <- Go BackUniversity of Michigan at Ann Arbor
Last Edit Date: 06/15/2023
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Marcus Darden and David Paoletti. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Strings are character sequences
Chracters taken from an "alphabet"
Algorithms on string are array / sequence algorithms
What makes those arrays / sequences special?
Applications of interest
Human-readable text (ASCII, Unicode)
Names and labels (people, files, license plates, etc)
DNA analysis
Given two strings / sequences
Are they the equal? (==
)
Which would go first in a dictionary? (<=
)
What do they have in common? (substrings)
Find where the strings differ
Given many strings, order them in the lexicographic (dictionary) order
Given strings p and s
Find instances of p in s: first, last next, all
Text: a string with a separator character, which delimits words
A texr corpus: a collection of documents, each containing a text (+ title and other attributes, such as URL)
Given a search string consisting of 1+ words, find all documents containing those words
Exact matches (name, ID numbers)
Approximate matches (misspelled)
A huge industry is built on this
Given project submissions
Text compression
Find repeated words
Replace them by numbers
DNA structure
Defined by sequences of nucleotide base-pairs
Alphabet: A, C, G, T
In people: 23 $\times$ 2 chromosomes, 3B base-pairs (characters)
Genes == subsequences
String algorithms for
Comparing chromosomes
Looking up genes
Assembling long DNA strands from short sequences
Null-terminated | Object-oriented | |
---|---|---|
Overhead | 1 ptr + 1 char | 2+ ptrs: begin / end |
Complexity of .size() |
$O(n)$ | $O(1)$ |
Alphabet | ASCII | Configurable |
Operations, algorithms | pointer arithmetics, stdlibc functions | methods, operators, stdlibc++ functions, STL |
Sequences (iterator ranges)
begin()
and end()
Specialized containers for multiple strings
Distionaries: add, remove, look up words
Strings with shared fragments (many words starting with "anti-", "pre-", "over-", "semi-", etc.)
std::equal
Returns true if the range [first1, last1) is equal to the range [first2, first2 + (last1 - first1)), and false
otherwise
Two ranges are considered equal if for every iterator i in the range [first1, last1), *i == *(first2 + (i - first1))
A std::equal()
Implementation
1 template<class InputIt1, class InputIt2>
2 bool equal(InputIt1 first1, InputIt1 last1, InputIt2 first2) {
3 for (; first != last1; ++first1, ++first2) {
4 if (!(*first1 == *first2)) {
5 return false;
6 }
7 return true;
8 }
9 }
Example: Usage of std::equal
on sequence
1 #include <algorithm>
2 #include <iostream>
3 #include <string>
4
5 bool is_palidrome(const std::string& s) {
6 return std::equal(begin(s), begin(s) + s.size() / 2, rbegin(s));
7 }
8
9 void test(const std::string &s) {
10 std::cout << '"' << s << '"'
11 << (is_palidrome(s) ? " is" : " is not") << " a palindrome\n"
12 }
13
14 int main() {
15 test("radar");
16 test("hello");
17 }
i. Rules of Lex-compare
Two ranges are compared element by element
The first mismatching element defines which range is lexicographically less or greater than the other
If one range is a prefix of another, the shorter range is lexicographically less than the other
If two ranges have equivalent elements and are of the same length, then the ranges are lexicographically equal
An empty range is lexicographically less than any non-empty range
Two empty ranges are lexicographically equal
ii. Dictionary Order
"" < "a" < "ab" < "b" < "ba" < "bc" < "bc0"
Overloading comparison operators in C++
Do not implement all 6 as independent operators
Suffices to implement <
and ==
(STL uses these)
(a != b)
is the same as !(a == b)
(a > b)
is the same as (b < a)
(a <= b)
is the same as !(b < a)
(a >= b)
is the same as !(a < b)
All other comparisons use 1 operator (plus !
)
<=
using <
and ==
iii. Applied Strings
Common implementation trick for <class T>
int compareHelper(const T& x, const T& y)
Returns 0 for x == y
, -1 for x < y
, 1 for x > y
Comparison operators are implemented by post-processing
Optimizations for strings
Short strings (< 16B) do not need dynamic memory
In C, null-termination simplifies operator<
: "abc" < "abcd"
In c++, strings of different size are !=
Fertile ground for vectorized CPU instructions
iv. Three-way String Compare
1 int compareHelper(const string& s0, const string& s1) {
2 const size_t len0 = s0.length(), len1 = s1.length();
3 for (size_t i = 0; i != std::min(len0, len1); i++) {
4 if (s0[i] < s1[i])
5 return -1;
6 if (s0[i] > s1[i])
7 return 1;
8 }
9 if (len0 < len1)
10 return -1;
11 if (len0 > len1)
12 return 1;
13 return 0;
14 }
Runtime analysis
$O(n)$ worst case time, $O(1)$ extra space (cannot do better)
Average case complexity for random string is $O(1)$ time: only need th efirst few characters
Strings are often not random
v. Lex-comparisons in STL
Implementation
1 template<class ForIt1, class ForIt2>
2 bool lexicographic_compare(ForIt1 first1, ForIt1 last1,
3 ForIt2 first2, ForIt2 last2) {
4 while ((first1 != last1) && (first2 != last2)) {
5 if (*first1 < *first2)
6 return true;
7 if (*first2 < *first1)
8 return false;
9 ++first1, ++first2;
10 }
11 return (first1 == last1) && (first2 != last2);
12 }
vi. Application: Rmoving Duplicates
Given: a container of string objects
Need: leave only one copy of each string
Sort given strings
Use STL std::sort()
from STL with operator()
Duplicates will be next to each other
Compare neighboring strings using operator==()
Copy only the first of duplicate strings
std::unique()
from STL with operator==()
Use string::find
to approach.
i. Brute-force String Searching
Worst case time: $O(\text{len_needle * len_haystack})$
Average case: $O(\text{len_haystack})$
STL implementations often choose this algorithm
Simple, learn implementation
Usually fast in practice: $O(\text{len_haystack})$ time
However
$O(\text{len_needle * len_haystack})$ worst case time
$O(\text{len_needle * len_haystack})$ worst case time possible with pre-processing (not as fast on most strings) ex: the Knuth-Morris-Pratt (KMP) algorithm
Better Than Brute-Force
Idea: Speed up the inner loop for known worst cases
Perform most eq-comparisons for strings in $O(1)$ time
$O(\text{len_needle * len_haystack})$ becomes $O(\text{len_haystack})$
Worst case complexity remains similar, but worst case inputs are not as obvious and rare in practice
How do we speed up eq-comparisons?
ii. String Fingerprints
For each string, compute a number (int
)
When fingerprints differ, strings must differ
When fringerprints match, strings rarely differ
The actual numbers do not mean anything
Many different fingerprint functions exist
Ex: simple, but poor fingerprint function F()
Replace each character by its ASCII code
Add up all codes
F("tom marvolo riddle ") == F("i am lord voldemort")
Rabin Fingerprint
Instead of adding up character codes, view strings as decimal numbers
Shuffling the chars usually changes results
Sliding Rabin Fingerprint
Calculate fingerprint fp for first m characters
Removing a character on the left takes $O(1)$ time
Precompute / store value of $10^{m - 1}$ once as p
Subtract left-most character multiplied by p from fp
Adding a character on the right takes $O(1)$ time
Multiply fp by 10 (for illustration only)
Add ASCII code of new right-most character
Initial caluclation of fp takes $O(m)$ time
Each of $n - m$ slides takes $O(1)$ time
Rabin-Karp String Search
Compute the FP of the needle in linear time
Check if the haystack starts with the needle
window = prefix_of_the_haystack
If (FP(window) != FP(needle))
go Step 3
If (window == needle)
, then done
Remove one character on the left
Add one new character on the right
If (FP(window) == FP(needle))
, if (window == needle)
, thne done
If more chars remain in haystack, go to Step 3
Avoiding Overflows
For long strings, performing +
, *
operations wil result in overflows
A common trick when constructing a fingerprint
Replace conventional arithmetic (+
, *
)
Pick some "largish prime" (ex: 3355439)
(x * y)
becomes (x * y) % prime
, (x + y)
becomes (x + y) % prime
Important: conventional arithmetic rules still carry over
Rabin-Karp Time Complexity
If different substrings never produce equal FPs, runtime is $O(\text{len_haystack})$
Every FP collision incurs $O(\text{len_needle})$ time to perform equality check of substrings
In practice, collisions are very rare and do not correspond to meaningful pairs of strings
Choose base = $2^k$
Coprime with the original prime number
Multiplication by base simplifies to a binary shift (faster)
Code
1 int rkSearch(const string &needle, const string &haystack, size_t prime) {
2 constexpr int base = 128;
3 const size_t N = needle.length(), H = haystack.length();
4 const size_t z = static_cast<size_t>(pow(base, N - 1)) % prime;
5 int n = 0, h = 0;
6 for (size_t i = 0; i < N; ++i) {
7 n = (base * n + needle[i]) % prime; // calculate needle fingerprint
8 h = (base * h + haystack[i]) % prime; // calc. window fingerprint
9 } // for
10 for (size_t i = 0; i <= H - N; ++i) {
11 if (n == h) { // check needle fp vs. current window
12 for (size_t j = 0; j < N; ++j) if (haystack[i + j] != needle[j]) break;
13 if (j == N) return i;
14 } // if
15 if (i < H - N) { // slide window
16 h = ((h - haystack[i] * z) % prime * base + haystack[i + N]) % prime;
17 if (h < 0) h = (h + prime);
18 } // if
19 } // for
20 return -1; // needle not found
21 } // rkSearch()