Suffix Array Implementation in C

A suffix array is a powerful data structure that is used for efficient substring searching in a given string. It is an array of all the suffixes of a string sorted in lexicographical order. Once built, a suffix array allows for quick searches of substrings within the original string.

Program Structure

This implementation of a suffix array in C will include the following components:

  1. Struct definition for holding suffixes and their indices.
  2. A comparison function for sorting suffixes.
  3. Functions for constructing the suffix array.
  4. A binary search function to search for a substring using the suffix array.
  5. A main function to demonstrate the usage of the suffix array for substring search.

Code Implementation

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

// Structure to store a suffix and its index
typedef struct {
    int index;     // Original index of the suffix in the string
    char *suffix;  // The suffix itself
} Suffix;

// Comparison function to sort suffixes lexicographically
int compareSuffixes(const void *a, const void *b) {
    Suffix *suffixA = (Suffix*)a;
    Suffix *suffixB = (Suffix*)b;
    return strcmp(suffixA->suffix, suffixB->suffix);
}

// Function to build the suffix array
int* buildSuffixArray(char *text, int n) {
    Suffix *suffixes = (Suffix*)malloc(n * sizeof(Suffix));
    int *suffixArray = (int*)malloc(n * sizeof(int));

    // Create suffixes and their indices
    for (int i = 0; i < n; i++) {
        suffixes[i].index = i;
        suffixes[i].suffix = text + i;
    }

    // Sort the suffixes using the comparison function
    qsort(suffixes, n, sizeof(Suffix), compareSuffixes);

    // Store the sorted suffix indices in the suffix array
    for (int i = 0; i < n; i++) {
        suffixArray[i] = suffixes[i].index;
    }

    // Free the suffixes array
    free(suffixes);

    return suffixArray;
}

// Function to perform binary search for a pattern in the suffix array
int searchSuffixArray(char *text, int *suffixArray, int n, char *pattern) {
    int m = strlen(pattern);
    int low = 0, high = n - 1;

    while (low <= high) {
        int mid = (low + high) / 2;
        int res = strncmp(text + suffixArray[mid], pattern, m);

        if (res == 0) {
            return suffixArray[mid];  // Pattern found
        } else if (res < 0) {
            low = mid + 1;
        } else {
            high = mid - 1;
        }
    }

    return -1;  // Pattern not found
}

// Main function to demonstrate the usage of the suffix array
int main() {
    char text[] = "banana";
    int n = strlen(text);

    // Build the suffix array
    int *suffixArray = buildSuffixArray(text, n);

    // Print the suffix array
    printf("Suffix Array:\n");
    for (int i = 0; i < n; i++) {
        printf("%d: %s\n", suffixArray[i], text + suffixArray[i]);
    }

    // Search for a pattern in the text
    char pattern[] = "nan";
    int result = searchSuffixArray(text, suffixArray, n, pattern);

    if (result != -1) {
        printf("Pattern '%s' found at index %d\n", pattern, result);
    } else {
        printf("Pattern '%s' not found\n", pattern);
    }

    // Free allocated memory
    free(suffixArray);

    return 0;
}

Explanation

The program begins by defining a struct Suffix that contains an integer index to store the original position of the suffix in the text, and a pointer suffix that points to the start of the suffix in the string.

The compareSuffixes() function is a comparison function used by the qsort() function to sort the suffixes lexicographically. This is necessary because a suffix array is an array of suffixes sorted in lexicographical order.

The buildSuffixArray() function generates the suffix array for a given text. It creates all possible suffixes of the text, stores them along with their indices, sorts them using qsort(), and finally stores the indices of these sorted suffixes in the suffix array.

Once the suffix array is built, the searchSuffixArray() function allows for efficient searching of a substring within the text. It performs a binary search on the suffix array to locate the pattern. If the pattern is found, the function returns the starting index of the pattern in the text; otherwise, it returns -1.

In the main() function, we demonstrate the creation of a suffix array for the string “banana” and search for the substring “nan”. The suffix array is printed out, and the result of the substring search is displayed.

 

By Aditya Bhuyan

I work as a cloud specialist. In addition to being an architect and SRE specialist, I work as a cloud engineer and developer. I have assisted my clients in converting their antiquated programmes into contemporary microservices that operate on various cloud computing platforms such as AWS, GCP, Azure, or VMware Tanzu, as well as orchestration systems such as Docker Swarm or Kubernetes. For over twenty years, I have been employed in the IT sector as a Java developer, J2EE architect, scrum master, and instructor. I write about Cloud Native and Cloud often. Bangalore, India is where my family and I call home. I maintain my physical and mental fitness by doing a lot of yoga and meditation.

Leave a Reply

Your email address will not be published. Required fields are marked *

error

Enjoy this blog? Please spread the word :)