Indexing Encrypted Data Using Bloom Filters: February 2020
Indexing Encrypted Data Using Bloom Filters: February 2020
Indexing Encrypted Data Using Bloom Filters: February 2020
net/publication/338544506
CITATIONS READS
0 36
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Claude N Warren, Jr. on 12 January 2020.
Claude N. Warren, Jr
Email: [email protected]
Github: https://github.com/Claudenw
LinkedIn: https://www.linkedin.com/in/claudewarren
Research Gate: https://www.researchgate.net/profile/Claude_Warren_Jr
1 / 18
Overview
1 Goals
2 The Process
4 Example
5 Goals Review
6 Additional Info
7 References
2 / 18
Goals: What are the goals
Indexing encrypted data using Bloom filters is not a new idea. There have
been several published papers that explore this problem [7, 4, 2, 10]. What
has changed is the introduction of multidimensional Bloom filters [5, 13]
that allow fast searching of a large number of Bloom filters.
3 / 18
The Process: Write
The write process
5 / 18
Introduction to the Bloom Filter: What is it
6 / 18
How is it defined
Bloom filters are constrained by: the number of elements in the set, the
number of hash functions, the number of bits in the vector, and the
probability of false positives.
p is the probability of false positives,
n is the number of elements in the set represented by the Bloom filter,
m is the number of bits, and
k is the number of hash functions.
Mitzenmacher and Upfal [9] have shown that the relationship between
these properties is:
p = (1 − e −kn/m )k
7 / 18
How is it constructed
Algorithm 1: How to construct a Bloom filter
Result: A populated bit vector
byte[][] buffers // the list of buffers to hash
bit[m] bitBuffer;
for buffer in buffers do
for i=1 to k do
long h = hash( buffer, i );
int bitIdx = h mod m;
bitBuffer[bitIdx] = 1;
end
end
Construct a Bloom filter using Apache Commons Collections [1]
HashFunction hFunc = new M u r m u r 1 2 8 x 8 6 C y c l i c () ;
Shape shape = new Shape ( hFunc , 10 , 1/2000000) ;
DynamicHasher hasher = new DynamicHasher . Builder ( hFunc ) . with (
buffer ) . build () ; // single buffer example
BloomFilter filter = new B i t S e t B l o o m F i l t e r ( hasher , shape ) ;
8 / 18
Data encoding issues
Interval (decimal numeric) data does not lend itself to Bloom filter
retrieval. Solution: Use ordinal values (e.g. small, medium, large) or
mathematically transform the value to an integer (e.g. round decimal
latitude or longitude values).
Some properties my have similar values leading to larger number of
hash conflicts. Solution: Prefix the value with the property name or
abbreviation. For example if tracking automobile interior and exterior
colors the values for a white care with red interior could be encoded
as ”exterior:white” and ”interior:red”.
9 / 18
Example: GeoNames is our data
10 / 18
Bloom encrypted index demo
The demo code [11] uses a multidimensional Bloom filter [5] to index
2 million (2e6) GeoName objects utilizing a 128 bit Murmur3 x86
hash implementation. The Bloom filter shape specifies: n = 10 and
p = 1.0/2000000 which yields m = 302, k = 21 and p ≈ 0.0000005
(1 in 2001957).
The multidimensional Bloom filter library [12] uses a Hasher that
does not retain the buffer bytes, just the hashed values.
After the demo loads the data it reports that it has loaded 2 million
items resulting in 704899 unique filters.
searching for ”Las Vegas” and ”PPL” (GeoNames feature code for
Populated Place) yields 8 results. All are named ”Las Vegas” and
have ”PPL” designations.
searching for ”want” yields 3 results. One named ”Want” and the
other 2 false positives.
11 / 18
Demo
DEMO
12 / 18
Goals Review: Did we meet the goals
13 / 18
Additional Info: Standard uses for Bloom filters
Typically used where hash table solutions are too memory intensive
and false positives can be addressed; for example a gating function to
a longer operation. (e.g. determine if a database lookup should be
made);
In bioinformatics they are used to test the existence of a k-mer in a
sequence or set of sequences. The k-mers of the sequence are indexed
in a Bloom filter, and any k-mer of the same size can be queried
against the Bloom filter.
In database engines used to perform joins. A bloom filter is
constructed for the one side of the join. During the join the other side
values are checked against the bloom filter first.
In the context of service discovery in a network, Bloom filters have
been used to determine how many hops it is from a specific node to a
node providing a desired service.
Bloom filters are often used to search large chemical structure
databases. The properties of the atom are encoded into Bloom filters
that are then stored in a multidimensional Bloom filter.
14 / 18
What is a multidimensional Bloom filter
Indexing bloom filters is not trivial as the sorting algorithms that underlie
most indexes require an ordinal comparison. However Bloom filter
comparisons do not produce ordinal results due to the bit level intersection
calculation. Bloom filter indexes are also hampered by the extreme speed
of the bit level intersection calculation. Warren, et.al. [13] have shown
that for fewer than 1000 entries there are no multidimensional Bloom
filters that are faster than the linear search.
15 / 18
References I
16 / 18
References II
17 / 18
References III