Recently, I was interviewing for a software engineering position, and during one of the interviews, one of the interviewers asked me the following question (paraphrasing):
Given a substring “s” and a string “str”, write an algorithm to find the first occurrence of “s” in “str”.
At first glance, this looks like an easy question. In fact it’s so easy that this can be done in one line using Python’s string find function. However, once you clarify the implicit expectations with the interviewer, you find that, 1) You are not allowed to use any library functions such as Python’s string find, 2) the algorithm must be efficient, and 3) the function returns the index of the first occurrence of
str or -1 otherwise. These are all simple and fair requirements.
What is a substring anyways?
A substring of a string is another string that occurs in that string. For example, the string “together”, “to” is a substring of “together”. “get” is another substring of “together”. Another example is “foot” and “football”. “foot” is a substring of “football”.
Feeling pressured by the interviewer watching over me and the pressure of limited time, having not seen this problem before, I quickly devised and implemented a brute-force algorithm that uses two nested loops. The first loop would iterate through
str and the second loop checks if
s is in
str starting from that position of
str. I tested the function and it worked as expected. The time complexity for this algorithm is O(n^2). The
interviewer asked for a more efficient algorithm. I could not think of any other algorithms at the top of my head, and needless to say, I did not move to the next round.
After the interview, Googling this question, I found that there a few algorithms that solve exact problem. I assume the interviewer was expecting an answer using one of those algorithms, and I did not provide one, so I did not move to the next round. Anyways, one of those string searching algorithms is called the Rabin-Karp algorithm. This is an elegant algorithm that is very easy to understand.
What is Rabin-Karp Algorithm
Simply stated, the Rabin-Karp algorithm generates a hash of the substring(pattern) and checks the rolling hash of the string for a match.
- It is a string matching algorithm based on hashing
- It calculates hash value for the substring, and for each M-characters of the string.
- If the hash values are equal, then the algorithm compares the substring and M-characters of the string.
- If the hash values are not equal, then it calculates the hash value (i.e. the rolling hash) of the next M-characters.
At the end, there’s only one comparison per text subsequence, and character matching is only needed when the hash values match.
This is a simple additive hash function that adds the positional values of each character. Additive hash function makes it easy to remove the hash of the first character.
The multiplyer keeps track of the size of the substring. It’s equivalent to
pow(base, (sub.length() -1)).
str_hash -= str[i - sub.length()] * multiplyer; removes the first character from the string hash.
str_hash = str_hash * base + str[i]; Adds the next character to the hash
What’s the Time Complexity?
With a good hash function, this algorithm runs in linear time. Since each character of the string is checked at most once, the “big-O” complexity is O(m + n) where m is the size of the substring and n is the size of the string.