FuzzySet

A class used to represent a set of many, potentially large, values (e.g. many long strings such as URLs), using a significantly smaller amount of memory.

The set is "lossy" in that it cannot definitively state that is does contain a value but it can definitively say if a value is not in the set. It can therefore be used as a Bloom Filter. Another application of the set is that it can be used to perform fuzzy counting because it can estimate reasonably accurately how many unique values are contained in the set.

This class is NOT threadsafe.

Internally a Bitset is used to record values and once a client has finished recording a stream of values the .downsize method can be used to create a suitably smaller set that is sized appropriately for the number of values recorded and desired saturation levels.

Types

Link copied to clipboard
object Companion
Link copied to clipboard

Result from FuzzySet.contains: can never return definitively YES (always MAYBE), but can sometimes definitely return NO.

Properties

Link copied to clipboard

Returns nested resources of this class. The result should be a point-in-time snapshot (to avoid race conditions).

Link copied to clipboard
Link copied to clipboard
Link copied to clipboard

Functions

Link copied to clipboard
fun addValue(value: BytesRef)

Records a value in the set. The referenced bytes are hashed. From the 64-bit generated hash, two 32-bit hashes are derived from the msb and lsb which can be used to derive more hashes (see https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf). Finally, each generated hash is modulo n'd where n is the chosen size of the internal bitset.

Link copied to clipboard

The main method required for a Bloom filter which, given a value determines set membership. Unlike a conventional set, the fuzzy set returns NO or MAYBE rather than true or false. Hash generation follows the same principles as .addValue

Link copied to clipboard
fun downsize(targetMaxSaturation: Float): FuzzySet?
Link copied to clipboard
open override fun ramBytesUsed(): Long

Return the memory usage of this object in bytes. Negative values are illegal.

Link copied to clipboard

Serializes the data set to file using the following format:

Link copied to clipboard
open override fun toString(): String