DFRSimilarity

Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), 357-389.

The DFR scoring formula is composed of three separate components: the basic model, the aftereffect and an additional normalization component, represented by the classes `BasicModel`, `AfterEffect` and `Normalization`, respectively. The names of these classes were chosen to match the names of their counterparts in the Terrier IR engine.

To construct a DFRSimilarity, you must specify the implementations for all three components of DFR:

  1. [BasicModel]: Basic model of information content:
    • [BasicModelG]: Geometric approximation of Bose-Einstein
    • [BasicModelIn]: Inverse document frequency
    • [BasicModelIne]: Inverse expected document frequency [mixture of Poisson and IDF]
    • [BasicModelIF]: Inverse term frequency [approximation of I(ne)]
  2. [AfterEffect]: First normalization of information gain:
    • [AfterEffectL]: Laplace's law of succession
    • [AfterEffectB]: Ratio of two Bernoulli processes
  3. [Normalization]: Second (length) normalization:
    • [NormalizationH1]: Uniform distribution of term frequency
    • [NormalizationH2]: term frequency density inversely related to length
    • [NormalizationH3]: term frequency normalization provided by Dirichlet prior
    • [NormalizationZ]: term frequency normalization provided by a Zipfian relation
    • [Normalization.NoNormalization]: no second normalization

Note that qtf, the multiplicity of term-occurrence in the query, is not handled by this implementation.

Note that basic models BE (Limiting form of Bose-Einstein), P (Poisson approximation of the Binomial) and D (Divergence approximation of the Binomial) are not implemented because their formula couldn't be written in a way that makes scores non-decreasing with the normalized term frequency.

See also

Constructors

Link copied to clipboard
constructor(basicModel: BasicModel, afterEffect: AfterEffect, normalization: Normalization)

Creates DFRSimilarity from the three components and using default discountOverlaps value.

constructor(basicModel: BasicModel, afterEffect: AfterEffect, normalization: Normalization, discountOverlaps: Boolean)

Creates DFRSimilarity from the three components and with the specified discountOverlaps value.

Properties

Link copied to clipboard

The first normalization of the information content.

Link copied to clipboard

The basic model for information content.

Link copied to clipboard

True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.

Link copied to clipboard

The term frequency normalization.

Functions

Link copied to clipboard

Computes the normalization value for a field at index-time.

Link copied to clipboard
open override fun scorer(boost: Float, collectionStats: CollectionStatistics, vararg termStats: TermStatistics): Similarity.SimScorer

Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.

Link copied to clipboard
open override fun toString(): String

Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.