Suffix Array Implementation in C
A suffix array is a powerful data structure that is used for efficient substring searching in a given string. It is an array of all the suffixes of a string sorted in lexicographical order. Once built, a suffix array allows for quick searches of substrings within the original string.
Program Structure
This implementation of a suffix array in C will include the following components:
- Struct definition for holding suffixes and their indices.
- A comparison function for sorting suffixes.
- Functions for constructing the suffix array.
- A binary search function to search for a substring using the suffix array.
- A main function to demonstrate the usage of the suffix array for substring search.
Code Implementation
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
// Structure to store a suffix and its index
typedef struct {
int index; // Original index of the suffix in the string
char *suffix; // The suffix itself
} Suffix;
// Comparison function to sort suffixes lexicographically
int compareSuffixes(const void *a, const void *b) {
Suffix *suffixA = (Suffix*)a;
Suffix *suffixB = (Suffix*)b;
return strcmp(suffixA->suffix, suffixB->suffix);
}
// Function to build the suffix array
int* buildSuffixArray(char *text, int n) {
Suffix *suffixes = (Suffix*)malloc(n * sizeof(Suffix));
int *suffixArray = (int*)malloc(n * sizeof(int));
// Create suffixes and their indices
for (int i = 0; i < n; i++) {
suffixes[i].index = i;
suffixes[i].suffix = text + i;
}
// Sort the suffixes using the comparison function
qsort(suffixes, n, sizeof(Suffix), compareSuffixes);
// Store the sorted suffix indices in the suffix array
for (int i = 0; i < n; i++) {
suffixArray[i] = suffixes[i].index;
}
// Free the suffixes array
free(suffixes);
return suffixArray;
}
// Function to perform binary search for a pattern in the suffix array
int searchSuffixArray(char *text, int *suffixArray, int n, char *pattern) {
int m = strlen(pattern);
int low = 0, high = n - 1;
while (low <= high) {
int mid = (low + high) / 2;
int res = strncmp(text + suffixArray[mid], pattern, m);
if (res == 0) {
return suffixArray[mid]; // Pattern found
} else if (res < 0) {
low = mid + 1;
} else {
high = mid - 1;
}
}
return -1; // Pattern not found
}
// Main function to demonstrate the usage of the suffix array
int main() {
char text[] = "banana";
int n = strlen(text);
// Build the suffix array
int *suffixArray = buildSuffixArray(text, n);
// Print the suffix array
printf("Suffix Array:\n");
for (int i = 0; i < n; i++) {
printf("%d: %s\n", suffixArray[i], text + suffixArray[i]);
}
// Search for a pattern in the text
char pattern[] = "nan";
int result = searchSuffixArray(text, suffixArray, n, pattern);
if (result != -1) {
printf("Pattern '%s' found at index %d\n", pattern, result);
} else {
printf("Pattern '%s' not found\n", pattern);
}
// Free allocated memory
free(suffixArray);
return 0;
}
Explanation
The program begins by defining a struct Suffix
that contains an integer index
to store the original position of the suffix in the text, and a pointer suffix
that points to the start of the suffix in the string.
The compareSuffixes()
function is a comparison function used by the qsort()
function to sort the suffixes lexicographically. This is necessary because a suffix array is an array of suffixes sorted in lexicographical order.
The buildSuffixArray()
function generates the suffix array for a given text. It creates all possible suffixes of the text, stores them along with their indices, sorts them using qsort()
, and finally stores the indices of these sorted suffixes in the suffix array.
Once the suffix array is built, the searchSuffixArray()
function allows for efficient searching of a substring within the text. It performs a binary search on the suffix array to locate the pattern. If the pattern is found, the function returns the starting index of the pattern in the text; otherwise, it returns -1.
In the main()
function, we demonstrate the creation of a suffix array for the string “banana” and search for the substring “nan”. The suffix array is printed out, and the result of the substring search is displayed.