Suffix Array Implementation in C++
A Suffix Array is a data structure that provides a sorted array of all suffixes of a given string. It is widely used for efficient substring searches, as it allows quick lookup to find all occurrences of a substring in the string. The Suffix Array is often combined with other data structures, such as the Longest Common Prefix (LCP) array, to further optimize substring search operations.
Program Structure
The C++ implementation of the Suffix Array will include:
- Suffix Array Construction: Building the suffix array by generating all suffixes of a given string, sorting them, and storing their starting indices.
- Substring Search: Implementing binary search on the suffix array to efficiently find the occurrences of a substring.
Code Implementation
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
class SuffixArray {
private:
std::string text;
std::vector<int> suffixArray;
public:
// Constructor to build the suffix array
SuffixArray(const std::string &text) : text(text) {
buildSuffixArray();
}
// Function to build the suffix array
void buildSuffixArray() {
int n = text.length();
suffixArray.resize(n);
// Generate suffixes and their starting indices
std::vector<std::pair<std::string, int>> suffixes(n);
for (int i = 0; i < n; ++i) {
suffixes[i] = {text.substr(i), i};
}
// Sort suffixes lexicographically
std::sort(suffixes.begin(), suffixes.end());
// Store the sorted indices in the suffix array
for (int i = 0; i < n; ++i) {
suffixArray[i] = suffixes[i].second;
}
}
// Function to perform binary search on the suffix array
std::vector<int> search(const std::string &pattern) const {
std::vector<int> result;
int left = 0, right = suffixArray.size() - 1;
int n = pattern.length();
// Binary search for the pattern in the suffix array
while (left <= right) {
int mid = left + (right - left) / 2;
std::string suffix = text.substr(suffixArray[mid], n);
if (suffix == pattern) {
// Find all occurrences of the pattern
int index = mid;
while (index >= 0 && text.substr(suffixArray[index], n) == pattern) {
result.push_back(suffixArray[index]);
--index;
}
index = mid + 1;
while (index < suffixArray.size() && text.substr(suffixArray[index], n) == pattern) {
result.push_back(suffixArray[index]);
++index;
}
break;
} else if (suffix < pattern) {
left = mid + 1;
} else {
right = mid - 1;
}
}
std::sort(result.begin(), result.end());
return result;
}
// Function to print the suffix array
void printSuffixArray() const {
for (int i : suffixArray) {
std::cout << i << ": " << text.substr(i) << std::endl;
}
}
};
// Main function to demonstrate the Suffix Array operations
int main() {
std::string text = "banana";
SuffixArray sa(text);
std::cout << "Suffix Array:" << std::endl;
sa.printSuffixArray();
std::string pattern = "ana";
std::vector<int> result = sa.search(pattern);
std::cout << "Occurrences of '" << pattern << "':" << std::endl;
for (int i : result) {
std::cout << i << std::endl;
}
return 0;
}
Explanation
Suffix Array Construction
The constructor of the SuffixArray
class takes a string and builds the suffix array. This involves generating all suffixes of the string, sorting them lexicographically, and storing their starting indices in the suffixArray
vector.
For example, for the string "banana"
, the suffixes are:
"banana"
"anana"
"nana"
"ana"
"na"
"a"
After sorting, the suffix array contains the starting indices of these sorted suffixes.
Substring Search
The search()
function performs a binary search on the suffix array to find all occurrences of a given substring (pattern) in the text. It compares the substring with the prefixes of the suffixes stored in the suffix array. If a match is found, the function gathers all occurrences by checking adjacent entries in the suffix array.
Printing the Suffix Array
The printSuffixArray()
function prints the indices of the suffix array along with the corresponding suffixes of the string. This is useful for visualizing the structure and understanding the sorted order of the suffixes.
Conclusion
The Suffix Array is a powerful data structure for substring searching, offering efficient query times while maintaining a relatively simple structure. This C++ implementation provides a basic but effective way to build and utilize a Suffix Array for substring search in a given string.