Bloom Filter Implementation in Bash
This document explains how to implement a Bloom Filter using Bash. A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. The filter may yield false positives (indicating an element is in the set when it isn’t) but never false negatives (indicating an element isn’t in the set when it is).
Program Structure
The program is structured into several key functions:
- initialize_filter: Initializes the Bloom Filter with a given size.
- hash1, hash2, hash3: Hash functions used to set and check bits in the filter.
- add: Adds an element to the Bloom Filter.
- check: Checks whether an element is possibly in the Bloom Filter.
Implementation
#!/bin/bash
# Initialize the Bloom filter with a given size
initialize_filter() {
local size=$1
filter=()
for (( i=0; i<size; i++ )); do
filter[i]=0
done
filter_size=$size
}
# First hash function
hash1() {
echo $(( $(echo -n "$1" | od -An -t d4 | awk '{s+=$1} END {print s}') % filter_size ))
}
# Second hash function
hash2() {
echo $(( $(echo -n "$1" | od -An -t d4 | awk '{s*=$1} END {print s}') % filter_size ))
}
# Third hash function
hash3() {
echo $(( $(echo -n "$1" | od -An -t d4 | awk '{s=($1+3)*7} END {print s}') % filter_size ))
}
# Add an element to the Bloom filter
add() {
local element=$1
local index1=$(hash1 "$element")
local index2=$(hash2 "$element")
local index3=$(hash3 "$element")
filter[index1]=1
filter[index2]=1
filter[index3]=1
}
# Check if an element is in the Bloom filter
check() {
local element=$1
local index1=$(hash1 "$element")
local index2=$(hash2 "$element")
local index3=$(hash3 "$element")
if [[ ${filter[index1]} -eq 1 && ${filter[index2]} -eq 1 && ${filter[index3]} -eq 1 ]]; then
echo "$element might be in the set."
else
echo "$element is definitely not in the set."
fi
}
# Example usage
initialize_filter 20
add "apple"
add "banana"
add "grape"
check "apple" # Likely to say it's in the set
check "banana" # Likely to say it's in the set
check "grape" # Likely to say it's in the set
check "orange" # Likely to say it's not in the set
Explanation
The program implements a simple Bloom Filter with the following components:
Initialization
The initialize_filter
function initializes the Bloom Filter with a specified size. The filter is represented by an array where each position starts with a value of 0, indicating that no elements have been hashed to that position yet.
Hash Functions
The program includes three basic hash functions: hash1
, hash2
, and hash3
. These functions are simplistic and use basic arithmetic on the ASCII values of the input string. Each hash function ensures that the result is within the bounds of the Bloom Filter array size.
Adding an Element
The add
function hashes the input element using the three hash functions. It then sets the corresponding indices in the Bloom Filter array to 1, indicating that the element may be part of the set.
Checking for an Element
The check
function hashes the input element with the same three hash functions. It checks whether all three corresponding indices in the Bloom Filter array are set to 1. If they are, the element is possibly in the set (though it could be a false positive). If any of the indices are 0, the element is definitely not in the set.
Example Usage
The example usage section shows how to initialize the filter, add elements to it, and check for their existence. For instance, “apple”, “banana”, and “grape” are added to the filter, and then the program checks for these elements as well as an element not in the filter (“orange”).
Conclusion
This Bash implementation of a Bloom Filter is a simple and efficient way to perform set membership tests with probabilistic guarantees. It’s particularly useful in applications where memory usage is a concern and false positives are acceptable.