IBSimilarity

class IBSimilarity(val distribution: Distribution, val lambda: Lambda, val normalization: Normalization, discountOverlaps: Boolean = true) : SimilarityBase

Provides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, 234-241.

The retrieval function is of the form RSV(q, d) = Σ -xqw log Prob(Xw ≥ tdw | λw), where

  • xqw is the query boost;

  • Xw is a random variable that counts the occurrences of word w;

  • tdw is the normalized term frequency;

  • λw is a parameter.

The framework described in the paper has many similarities to the DFR framework (see DFRSimilarity). It is possible that the two Similarities will be merged at one point.

To construct an IBSimilarity, you must specify the implementations for all three components of the Information-Based model.

  1. Distribution: Probabilistic distribution used to model term occurrence

    • Log-logistic: Smoothed power-law

  2. Lambda: λw parameter of the probability distribution

    • LambdaDF: Nw/N or average number of documents where w occurs

    • LambdaTTF: Fw/N or average number of occurrences of w in the collection

  3. Normalization: Term frequency normalization

See also

Constructors

Link copied to clipboard
constructor(distribution: Distribution, lambda: Lambda, normalization: Normalization, discountOverlaps: Boolean = true)

Properties

Link copied to clipboard

True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.

Link copied to clipboard

The probabilistic distribution used to model term occurrence.

Link copied to clipboard

The lambda (λw) parameter.

Link copied to clipboard

The term frequency normalization.

Functions

Link copied to clipboard

Computes the normalization value for a field at index-time.

Link copied to clipboard
open override fun scorer(boost: Float, collectionStats: CollectionStatistics, vararg termStats: TermStatistics): Similarity.SimScorer

Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.

Link copied to clipboard
open override fun toString(): String

The name of IB methods follow the pattern IB <distribution> <lambda><normalization>. The name of the distribution is the same as in the original paper; for the names of lambda parameters, refer to the javadoc of the Lambda classes.