DFISimilarity

class DFISimilarity(val independence: Independence, discountOverlaps: Boolean = true) : SimilarityBase

Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).

DFI is both parameter-free and non-parametric:

  • parameter-free: it does not require any parameter tuning or training.

  • non-parametric: it does not make any assumptions about word frequency distributions on document collections.

It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.

For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence

See also

Constructors

Link copied to clipboard
constructor(independence: Independence, discountOverlaps: Boolean = true)

Properties

Link copied to clipboard

True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.

Link copied to clipboard

Functions

Link copied to clipboard

Computes the normalization value for a field at index-time.

Link copied to clipboard
open override fun scorer(boost: Float, collectionStats: CollectionStatistics, vararg termStats: TermStatistics): Similarity.SimScorer

Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.

Link copied to clipboard
open override fun toString(): String

Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.