Bloom Filter Implementation in C
A Bloom filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. It can give false positives (indicating that an element might be in the set when it is not), but it never gives false negatives (indicating that an element is not in the set when it is).
Program Structure
This implementation of a Bloom filter in C will include the following components:
- Struct definition to hold the Bloom filter bit array and the size of the array.
- Hash functions for generating indices to set/check in the bit array.
- Initialization function to create the Bloom filter.
- Functions to add elements to the Bloom filter and to check for membership.
- A main function to demonstrate the usage of the Bloom filter.
Code Implementation
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#define BLOOM_SIZE 1000 // Size of the Bloom filter bit array
// Define the structure for the Bloom filter
typedef struct {
unsigned char *bit_array; // Array of bits
int size; // Size of the bit array
} BloomFilter;
// Hash function prototypes
unsigned int hash1(const char *str);
unsigned int hash2(const char *str);
// Function to initialize the Bloom filter
BloomFilter* createBloomFilter(int size) {
BloomFilter *filter = (BloomFilter*)malloc(sizeof(BloomFilter));
filter->bit_array = (unsigned char*)calloc(size, sizeof(unsigned char));
filter->size = size;
return filter;
}
// Function to add an element to the Bloom filter
void addToBloomFilter(BloomFilter *filter, const char *str) {
unsigned int index1 = hash1(str) % filter->size;
unsigned int index2 = hash2(str) % filter->size;
filter->bit_array[index1] = 1;
filter->bit_array[index2] = 1;
}
// Function to check if an element is possibly in the Bloom filter
bool checkBloomFilter(BloomFilter *filter, const char *str) {
unsigned int index1 = hash1(str) % filter->size;
unsigned int index2 = hash2(str) % filter->size;
return filter->bit_array[index1] && filter->bit_array[index2];
}
// First hash function (djb2)
unsigned int hash1(const char *str) {
unsigned int hash = 5381;
int c;
while ((c = *str++)) {
hash = ((hash << 5) + hash) + c; // hash * 33 + c
}
return hash;
}
// Second hash function (sdbm)
unsigned int hash2(const char *str) {
unsigned int hash = 0;
int c;
while ((c = *str++)) {
hash = c + (hash << 6) + (hash << 16) - hash; } return hash; } // Main function to demonstrate the usage of the Bloom filter int main() { BloomFilter *filter = createBloomFilter(BLOOM_SIZE); // Add some elements to the Bloom filter addToBloomFilter(filter, "hello"); addToBloomFilter(filter, "world"); addToBloomFilter(filter, "foo"); addToBloomFilter(filter, "bar"); // Check for membership printf("Checking 'hello': %s\n", checkBloomFilter(filter, "hello") ? "Possibly in set" : "Definitely not in set"); printf("Checking 'world': %s\n", checkBloomFilter(filter, "world") ? "Possibly in set" : "Definitely not in set"); printf("Checking 'baz': %s\n", checkBloomFilter(filter, "baz") ? "Possibly in set" : "Definitely not in set"); // Free allocated memory free(filter->bit_array);
free(filter);
return 0;
}
Explanation
The program begins by defining a struct BloomFilter
that contains a bit array (bit_array
) and the size of this array (size
). The size is defined as BLOOM_SIZE
, which is set to 1000 in this implementation. This bit array is used to store the results of the hash functions applied to the elements being added to the filter.
Two hash functions, hash1()
and hash2()
, are used to calculate the positions in the bit array where bits are set or checked. These hash functions generate different hash values for the same input, minimizing the likelihood of collisions and false positives.
The createBloomFilter()
function initializes a Bloom filter by allocating memory for the bit array and setting all bits to 0. The addToBloomFilter()
function takes a string, hashes it using both hash functions, and sets the corresponding bits in the bit array.
The checkBloomFilter()
function checks whether a given string is possibly in the set by hashing it and checking if the corresponding bits are set. If both bits are set, the string is considered “possibly in the set” (though it could be a false positive); otherwise, it is “definitely not in the set.”
In the main()
function, we demonstrate adding strings like “hello”, “world”, “foo”, and “bar” to the Bloom filter and then check if they or other strings like “baz” are in the set. The output will indicate whether each string is “possibly in the set” or “definitely not in the set.”