Suffix Array Implementation in C++

A Suffix Array is a data structure that provides a sorted array of all suffixes of a given string. It is widely used for efficient substring searches, as it allows quick lookup to find all occurrences of a substring in the string. The Suffix Array is often combined with other data structures, such as the Longest Common Prefix (LCP) array, to further optimize substring search operations.

Program Structure

The C++ implementation of the Suffix Array will include:

  1. Suffix Array Construction: Building the suffix array by generating all suffixes of a given string, sorting them, and storing their starting indices.
  2. Substring Search: Implementing binary search on the suffix array to efficiently find the occurrences of a substring.

Code Implementation


#include <iostream>
#include <vector>
#include <string>
#include <algorithm>

class SuffixArray {
private:
    std::string text;
    std::vector<int> suffixArray;

public:
    // Constructor to build the suffix array
    SuffixArray(const std::string &text) : text(text) {
        buildSuffixArray();
    }

    // Function to build the suffix array
    void buildSuffixArray() {
        int n = text.length();
        suffixArray.resize(n);

        // Generate suffixes and their starting indices
        std::vector<std::pair<std::string, int>> suffixes(n);
        for (int i = 0; i < n; ++i) {
            suffixes[i] = {text.substr(i), i};
        }

        // Sort suffixes lexicographically
        std::sort(suffixes.begin(), suffixes.end());

        // Store the sorted indices in the suffix array
        for (int i = 0; i < n; ++i) {
            suffixArray[i] = suffixes[i].second;
        }
    }

    // Function to perform binary search on the suffix array
    std::vector<int> search(const std::string &pattern) const {
        std::vector<int> result;
        int left = 0, right = suffixArray.size() - 1;
        int n = pattern.length();

        // Binary search for the pattern in the suffix array
        while (left <= right) {
            int mid = left + (right - left) / 2;
            std::string suffix = text.substr(suffixArray[mid], n);

            if (suffix == pattern) {
                // Find all occurrences of the pattern
                int index = mid;
                while (index >= 0 && text.substr(suffixArray[index], n) == pattern) {
                    result.push_back(suffixArray[index]);
                    --index;
                }
                index = mid + 1;
                while (index < suffixArray.size() && text.substr(suffixArray[index], n) == pattern) {
                    result.push_back(suffixArray[index]);
                    ++index;
                }
                break;
            } else if (suffix < pattern) {
                left = mid + 1;
            } else {
                right = mid - 1;
            }
        }

        std::sort(result.begin(), result.end());
        return result;
    }

    // Function to print the suffix array
    void printSuffixArray() const {
        for (int i : suffixArray) {
            std::cout << i << ": " << text.substr(i) << std::endl;
        }
    }
};

// Main function to demonstrate the Suffix Array operations
int main() {
    std::string text = "banana";
    SuffixArray sa(text);

    std::cout << "Suffix Array:" << std::endl;
    sa.printSuffixArray();

    std::string pattern = "ana";
    std::vector<int> result = sa.search(pattern);

    std::cout << "Occurrences of '" << pattern << "':" << std::endl;
    for (int i : result) {
        std::cout << i << std::endl;
    }

    return 0;
}

Explanation

Suffix Array Construction

The constructor of the SuffixArray class takes a string and builds the suffix array. This involves generating all suffixes of the string, sorting them lexicographically, and storing their starting indices in the suffixArray vector.

For example, for the string "banana", the suffixes are:

  • "banana"
  • "anana"
  • "nana"
  • "ana"
  • "na"
  • "a"

After sorting, the suffix array contains the starting indices of these sorted suffixes.

Substring Search

The search() function performs a binary search on the suffix array to find all occurrences of a given substring (pattern) in the text. It compares the substring with the prefixes of the suffixes stored in the suffix array. If a match is found, the function gathers all occurrences by checking adjacent entries in the suffix array.

Printing the Suffix Array

The printSuffixArray() function prints the indices of the suffix array along with the corresponding suffixes of the string. This is useful for visualizing the structure and understanding the sorted order of the suffixes.

Conclusion

The Suffix Array is a powerful data structure for substring searching, offering efficient query times while maintaining a relatively simple structure. This C++ implementation provides a basic but effective way to build and utilize a Suffix Array for substring search in a given string.

 

By Aditya Bhuyan

I work as a cloud specialist. In addition to being an architect and SRE specialist, I work as a cloud engineer and developer. I have assisted my clients in converting their antiquated programmes into contemporary microservices that operate on various cloud computing platforms such as AWS, GCP, Azure, or VMware Tanzu, as well as orchestration systems such as Docker Swarm or Kubernetes. For over twenty years, I have been employed in the IT sector as a Java developer, J2EE architect, scrum master, and instructor. I write about Cloud Native and Cloud often. Bangalore, India is where my family and I call home. I maintain my physical and mental fitness by doing a lot of yoga and meditation.

Leave a Reply

Your email address will not be published. Required fields are marked *

error

Enjoy this blog? Please spread the word :)