IndexedDISI

Disk-based implementation of a DocIdSetIterator which can return the index of the current document, i.e. the ordinal of the current document among the list of documents that this iterator can return. This is useful to implement sparse doc values by only having to encode values for documents that actually have a value.

Implementation-wise, this DocIdSetIterator is inspired of RoaringDocIdSet and encodes ranges of 65536 documents independently and picks between 3 encodings depending on the density of the range:

  • ALL if the range contains 65536 documents exactly,

  • DENSE if the range contains 4096 documents or more; in that case documents are stored in a bit set,

  • SPARSE otherwise, and the lower 16 bits of the doc IDs are stored in a DataInput.readShort.

Only ranges that contain at least one value are encoded.

This implementation uses 6 bytes per document in the worst-case, which happens in the case that all ranges contain exactly one document.

To avoid O(n) lookup time complexity, with n being the number of documents, two lookup tables are used: A lookup table for block offset and index, and a rank structure for DENSE block index lookups.

The lookup table is an array of int-pairs, with a pair for each block. It allows for direct jumping to the block, as opposed to iteration from the current position and forward one block at a time.

Each int-pair entry consists of 2 logical parts:

The first 32 bit int holds the index (number of set bits in the blocks) up to just before the wanted block. The maximum number of set bits is the maximum number of documents, which is less than 2^31.

The next int holds the offset in bytes into the underlying slice. As there is a maximum of 2^16 blocks, it follows that the maximum size of any block must not exceed 2^15 bytes to avoid overflow (2^16 bytes if the int is treated as unsigned). This is currently the case, with the largest block being DENSE and using 2^13 + 36 bytes.

The cache overhead is numDocs/1024 bytes.

Note: There are 4 types of blocks: ALL, DENSE, SPARSE and non-existing (0 set bits). In the case of non-existing blocks, the entry in the lookup table has index equal to the previous entry and offset equal to the next non-empty block.

The block lookup table is stored at the end of the total block structure.

The rank structure for DENSE blocks is an array of byte-pairs with an entry for each sub-block (default 512 bits) out of the 65536 bits in the outer DENSE block.

Each rank-entry states the number of set bits within the block up to the bit before the bit positioned at the start of the sub-block. Note that that the rank entry of the first sub-block is always 0 and that the last entry can at most be 65536-2 = 65634 and thus will always fit into an byte-pair of 16 bits.

The rank structure for a given DENSE block is stored at the beginning of the DENSE block. This ensures locality and keeps logistics simple.

Constructors

Link copied to clipboard
constructor(in: IndexInput, offset: Long, length: Long, jumpTableEntryCount: Int, denseRankPower: Byte, cost: Long)

This constructor always creates a new blockSlice and a new jumpTable from in, to ensure that operations are independent from the caller. See .IndexedDISI for re-use of blockSlice and jumpTable.

Types

Link copied to clipboard
object Companion
Link copied to clipboard

Properties

Link copied to clipboard
var block: Int
Link copied to clipboard
Link copied to clipboard
val cost: Long
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
var doc: Int
Link copied to clipboard
Link copied to clipboard
var gap: Int
Link copied to clipboard
var index: Int
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard

The slice that stores the DocIdSetIterator.

Link copied to clipboard
var word: Long
Link copied to clipboard

Functions

Link copied to clipboard
open override fun advance(target: Int): Int

Advances to the first beyond the current whose document number is greater than or equal to target, and returns the document number itself. Exhausts the iterator and returns .NO_MORE_DOCS if target is greater than the highest document number in the set.

Link copied to clipboard
fun advanceExact(target: Int): Boolean
Link copied to clipboard
open override fun cost(): Long

Returns the estimated cost of this DocIdSetIterator.

Link copied to clipboard
open override fun docID(): Int

Returns the following:

Link copied to clipboard
fun index(): Int
Link copied to clipboard
open fun intoBitSet(upTo: Int, bitSet: FixedBitSet, offset: Int)

Load doc IDs into a FixedBitSet. This should behave exactly as if implemented as below, which is the default implementation:

Link copied to clipboard
open override fun nextDoc(): Int

Advances to the next document in the set and returns the doc it is currently on, or .NO_MORE_DOCS if there are no more docs in the set.

NOTE: after the iterator has exhausted you should not call this method, as it may result in unpredicted behavior.