core/org.gnit.lucenekmp.search.similarities/IndriDirichletSimilarity

IndriDirichletSimilarity

class IndriDirichletSimilarity : LMSimilarity

Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine (http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing!

tf_E + mu*P(t|D) P(t|E)= documentLength + documentMu
mu*P(t|C) + tf_D where P(t|D)= doclen + mu

A larger value for mu, produces more smoothing. Smoothing is most important for short documents where the probabilities are more granular.

Constructors

IndriDirichletSimilarity

constructor(collectionModel: LMSimilarity.CollectionModel, discountOverlaps: Boolean, mu: Float)

Instantiates the similarity with the provided parameters.

@JvmOverloads

constructor(collectionModel: LMSimilarity.CollectionModel = IndriCollectionModel(), mu: Float = 2000.0f)

Instantiates the similarity with the default value of 2000.

constructor(mu: Float)

Instantiates the similarity with the provided parameter.

Types

IndriCollectionModel

class IndriCollectionModel : LMSimilarity.CollectionModel

Models p(w|C) as the number of occurrences of the term in the collection, divided by the total number of tokens + 1.

Properties

discountOverlaps

val discountOverlaps: Boolean

True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.

val mu: Float

The parameter.

name

open override val name: String

Returns the name of the LM method. The values of the parameters should be included as well.

Functions

computeNorm

open fun computeNorm(state: FieldInvertState): Long

Computes the normalization value for a field at index-time.

scorer

open override fun scorer(boost: Float, collectionStats: CollectionStatistics, vararg termStats: TermStatistics): Similarity.SimScorer

Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.

toString

open override fun toString(): String

Returns the name of the LM method. If a custom collection model strategy is used, its name is included as well.