User:Shohihara/sandbox

Counting Bloom filters
A counting Bloom filter is an extended application of a Bloom filter, which is a data structure which allows users to test whether an item is a member of a set or not. It is a probabilistic and space-efficient data structure. In order to understand counting Bloom filters, it is essential to understand how Bloom filters work. Thus, the following section is a brief explanation of what a Bloom filter is.

Bloom filters
Conceived by Burton Howard Bloom in 1970, a Bloom filter is a probabilistic data structure with a strong advantage in its space usage by not storing item data themselves. Instead, it uses bit arrays, meaning it stores a bit of data (either a 0 or a 1) in each index of the array. Bloom filters work by mapping elements onto a combination of indices using multiple hash functions. Hash functions are functions which return specific indices when given an input. Simply put, an element of the set is represented by a combination of indices. For example, say we have a bit array of 10 indices and 3 hash functions f(x), g(x), and h(x). Initially, the bit array looks as follows:

[0] [0] [0] [0] [0] [0] [0] [0] [0] [0]

Now, imagine we want to store a word called "cabbage" in the list. The three hash functions will output three indices from the input "cabbage". Say, the outputs are:

$$f(\text{cabbage}) = 7 \quad g(\text{cabbage}) = 0 \quad h(\text{cabbage}) = 2$$

Indices 7, 0, and 2 in the list will turn 1, and the element becomes "stored". The bit array now looks as follows:

[1] [0] [1] [0] [0] [0] [0] [1] [0] [0]

Notice how the word "cabbage" itself did not have to be stored in the list, which saves space. Later, if a user wants to check if "cabbage" is in the list, the three same hash functions will return the three same values (7, 0, 2). If any of those three indices are 0, we can say that "cabbage" is not in the list. This is because if "cabbage" was ever stored, all three indices must be 1. It is important to know that even when all indices are 1, it does not guarantee that an element is in the set. It can only say that it might be in the set. Take an example; imagine there were two words "apple" and "tomato" in a set, which are represented by indices (3,5,7) and (2,0,9), respectively. The array would look as follows:

[1] [0] [1] [1] [0] [1] [0] [1] [0] [1]

If we tried to check if there was "cabbage" in this list, since indices (7,0,2) are all 1, the filter would suggest that there is. But in reality, there are only "apple" and "tomato" in this set. This is why, even when all indices that make up an element are 1, we cannot surely say there is that element in the list. For this same reason, we cannot delete elements from Bloom filters. If we do, we may end up deleting an index that is relevant to a different element.

Why "counting" Bloom filter?
Counting Bloom filters are meant to address the aforementioned weakness of Bloom filters; not being able to delete an element from the set. Instead of using a bit array with 1 bit in each index, it uses an n-bit array. This means that instead of a binary value of 0 and 1, each index can represent a value from 0-n. This would make counting Bloom filters less space-efficient, but it nows allows deletion of elements (because the indices can represent beyond a binary). We can think of Bloom filters to be equivalent to counting Bloom filters where n=1. An insertion of an element would require incrementing the corresponding bit values by 1, and a deletion would require decrementing the bit values by 1. n should be reasonably large enough to prevent indices from overflowing, since that is the whole point of using count Bloom filters as opposed to standard Bloom filters.

Operations supported by counting Bloom filters
As explained in the sections above, Bloom filters support:


 * insertion
 * search

In addition to the two operations, counting Bloom filters support:


 * deletion

Since counting Bloom filters do not store the data element itself, it is impossible to list all elements in the set. It can only tell the user whether an element may be in the set or not.

Possible applications
The strength of Bloom filters lies in its space efficiency by not storing the data itself. Therefore, it is most useful when storing large amounts of data; storing a small amount of data using counting Bloom filters does not utilize its strength. We can think of examples such as spell-checkers or security softwares.

Grammarly, a spell-checking software, does not need to give users the entire English dictionary. It only needs to tell if the user is misspelling a word. If Grammarly thinks that a certain word is not in their dictionary, they highlight the word for the user. Using Bloom filters allow them to do this operation very quickly, as it only takes O(k) time to search for each word, where k is the number of hash functions used in the filter. Furthermore, Grammarly allows users to add words in their dictionary as well as delete them. This suggests that they are using counting Bloom filters instead of simple Bloom filters.

Security softwares also do not need to show users the entire list of viruses they know. They just need to know if the downloaded file may be malicious or not. By using Bloom filters, they can allow the user to open the file without significant delay if the file is definitely not malicious. If the file is potentially malicious, the software can do a more thorough analysis of the file. This way, they do not have to perform an in-depth analysis of every downloaded file, most of which are likely safe.

Sho's implementation
import numpy as np import matplotlib.pyplot as plt def hash1(input): inp = int( '' .join(format(ord(i)) for i in input)) np.random.seed(3141592) temp = inp*np.random.uniform return int(temp%1000000)+np.random.randint(1,100) def hash2(input): inp = int( '' .join(format(ord(i)) for i in input)) np.random.seed(161803) temp = inp*np.random.uniform return int(temp%1000000)+np.random.randint(1,100) def hash3(input): inp = int( '' .join(format(ord(i)) for i in input)) np.random.seed(667408) temp = inp*np.random.uniform return int(temp%1000000)+np.random.randint(1,100) def makesure(inputlist): #just making sure that the hash function returns more or less uniform distribution hashes = [] for i in inputlist: hashes.append(hash1(i)) hashes.append(hash2(i)) hashes.append(hash3(i)) plt.hist(temp) def ins_cbf(inputlist, bitarr): for i in inputlist: #increment 1 for all indices from the four hash functions bitarr[hash1(i)] += 1 bitarr[hash2(i)] += 1 bitarr[hash3(i)] += 1 def del_cbf(item, bitarr): #decrement 1 for all indices from the four hash functions: bitarr[hash1(i)] -= 1 bitarr[hash2(i)] -= 1 bitarr[hash3(i)] -= 1 def check_cbf(item, bitarr): if (bitarr[hash1(item)]==0 or bitarr[hash2(item)]==0 or bitarr[hash3(item)]==0): #if any index is 0, not in set print(item, "is definitely not in the set.") return 0 else: print(item, "is probably in the set.") return 1 words = open("/usr/share/dict/words").read.splitlines #generate list of words bitarr = [0 for i in range(1000100)] #initialize bitarray, arraysize = 1000100 since the max possible hash value is such makesure(words) #just making sure the hashes are more or less evenly distributed ins_cbf(words, bitarr) #make counting bloom filter with "words" list check_cbf("cabbage", bitarr) #check if cabbage might be in the list del_cbf("cabbage", bitarr) #delete cabbage from the list check_cbf("cabbage", bitarr) #check if cabbage is in the list again
 * 1) three hash functions which takes a unicode point of the characters in the word,
 * 2) multiplies by some uniform variable, takes mod 1000000 and adds a random value from 1~100

Justification
I chose 4 hash functions to reduce the false positive (a non-existing element returning "might exist in the set") rate as much as possible. For the bit array size of 1,000,000 and an input size of 235886, the probability of any given index being a 0 is:

$$\left(1-\frac{1}{1000100}\right)^{235886k}$$

where k is the number of hash functions. Therefore, the probability of any given index being a 1 is:
$$1-\left(1-\frac{1}{1000100}\right)^{235886k}$$

The probability of all k indices returned by searching for a non-existent element being 1 is:

$${\left({\displaystyle 1-\left(1-{\frac {1}{1000100}}\right)^{235886k}}\right)}^k \thickapprox \left(1-e^{-235886k/1000100}\right)^k $$

The k value which minimizes this probability is:

$$\frac{1000100}{235886}\ln{2} \thickapprox 3$$

Hence why I used three hash functions in my implementation.

The memory size as a function of false positive rate
The memory size and input size can be denoted as m and n, respectively. The false positive rate can be calculated simply as:

$$\left(\frac{1}{2}\right)^k \thickapprox 0.62^{m/n}$$

where m is the memory size and n is the input size. $$\left(\frac{1}{2}\right)^k $$approximates $$0.62^{m/n}$$by substituting $$\frac{m}{n}\ln{2}$$into k.

$$m = 0.62^{m/n}\left(0.62^{n/m}\cdot\frac{100}{62}m \right) $$

Therefore, the memory size should scale at a factor of

$$m^{1/m}$$.

The memory size as a function of number of items stored
Since

$$k \thickapprox \frac{m}{n}\ln{2}$$,

we can say that

$$m \thickapprox \frac{n}{\ln{2}}k$$

Therefore, the memory size should scale linear to the number of items stored, assuming the number of hash function is optimal to the combination of n and m.

Access time as a function of the false positive rate
Since the access time is directly proportional to the number of hash functions used, we know that access time scales at a factor k.

Now, the false positive rate is $$\left(\frac{1}{2}\right)^k $$. We also know from algebra, that

$$k = \left(\frac{1}{2}\right)^k \left(\frac{1}{2}\right)^{\frac{1}{k}}\cdot2$$

Thus, the access time should scale at a factor

$$2^{1-1/k}$$

of the false positive rate.

Access time as a function of the number of items stored
Since

$$k \thickapprox \frac{m}{n}\ln{2}$$,

we can say that the access time scales inverse to the number of items stored. This is not because the hash function becomes faster, but rather because when we have more input, the number of optimal hash functions needed (=k) becomes less in order to avoid too many indices returned. Given a fixed spot, as input size gets larger, the number of hash functions must become less to "fill" less indices in order to avoid false positives.

False positive rate empirical observation
We know that the false positive rate is simply

$$\left(\frac{1}{2}\right)^k \thickapprox 0.62^{m/n}$$.

Thus, the false positive rate when we have three hash functions, 1000100 memory size and 233856 inputs should theoretically be approximately:

$$\left(\frac{1}{2}\right)^k \thickapprox 0.62^{m/n}\thickapprox 0.13$$

Below is a code that I wrote to empirically test the validity of this figure. bitarr = [0 for i in range(1000100)] ins_cbf(words, bitarr) master = [] for j in range(30): #30 samples accuracy = [] for i in range(500): randstring = '' .join(random.choice('abcdefghijklmnopqrstuvwxyz') for x in range(1,random.randrange(2,10))) #random string if randstring not in words: #make sure it doesn't exist accuracy.append(check_cbf(randstring, bitarr)) #false positive rate master.append(accuracy.count(1)/len(accuracy)) plt.scatter([i for i in range(1,31)], master) I could unfortunately not include the plot image due to the wiki-sandbox constraints, but please run the code above and see the plot.

Surprisingly, I consistently observed a value of around 0.3. I am uncertain why, but it seems like the hash functions are converging into the same values and the distribution is not quite uniform.

This could be because I use a mod function to determine the index, which may not be the most optimal. It should also be recognized that the process of converting strings into integers to create the index relies on unicode points, and close-by characters tend to produce close-by unicode values. There is a flaw in my hash function. To overcome this, I suggest future using a more sophisticated hash function, such as the Mersenne twister.