DFRSimilarity
Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), 357-389.
The DFR scoring formula is composed of three separate components: the basic model, the aftereffect and an additional normalization component, represented by the classes `BasicModel`, `AfterEffect` and `Normalization`, respectively. The names of these classes were chosen to match the names of their counterparts in the Terrier IR engine.
To construct a DFRSimilarity, you must specify the implementations for all three components of DFR:
- [BasicModel]: Basic model of information content:
- [BasicModelG]: Geometric approximation of Bose-Einstein
- [BasicModelIn]: Inverse document frequency
- [BasicModelIne]: Inverse expected document frequency [mixture of Poisson and IDF]
- [BasicModelIF]: Inverse term frequency [approximation of I(ne)]
- [AfterEffect]: First normalization of information gain:
- [AfterEffectL]: Laplace's law of succession
- [AfterEffectB]: Ratio of two Bernoulli processes
- [Normalization]: Second (length) normalization:
- [NormalizationH1]: Uniform distribution of term frequency
- [NormalizationH2]: term frequency density inversely related to length
- [NormalizationH3]: term frequency normalization provided by Dirichlet prior
- [NormalizationZ]: term frequency normalization provided by a Zipfian relation
- [Normalization.NoNormalization]: no second normalization
Note that qtf, the multiplicity of term-occurrence in the query, is not handled by this implementation.
Note that basic models BE (Limiting form of Bose-Einstein), P (Poisson approximation of the Binomial) and D (Divergence approximation of the Binomial) are not implemented because their formula couldn't be written in a way that makes scores non-decreasing with the normalized term frequency.
See also
Constructors
Creates DFRSimilarity from the three components and using default discountOverlaps value.
Creates DFRSimilarity from the three components and with the specified discountOverlaps value.
Properties
The first normalization of the information content.
The basic model for information content.
True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
The term frequency normalization.
Functions
Computes the normalization value for a field at index-time.
Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.