core/org.gnit.lucenekmp.search.similarities

Package-level declarations

Types

This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework. This component is also called the after effect and is defined by the formula Inf₂ = 1 - Prob₂, where Prob₂ measures the information gain.

AfterEffectB

class AfterEffectB : AfterEffect

Model of the information gain based on the ratio of two Bernoulli processes.

AfterEffectL

class AfterEffectL : AfterEffect

Model of the information gain based on Laplace's law of succession.

Axiomatic

abstract class Axiomatic(discountOverlaps: Boolean, s: Float, queryLen: Int, k: Float) : SimilarityBase

Axiomatic approaches for IR. From Hui Fang and Chengxiang Zhai 2005. An Exploration of Axiomatic Approaches to Information Retrieval. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '05). ACM, New York, NY, USA, 480-487.

AxiomaticF1EXP

class AxiomaticF1EXP : Axiomatic

F1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq

AxiomaticF1LOG

class AxiomaticF1LOG : Axiomatic

F1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq

AxiomaticF2EXP

class AxiomaticF2EXP : Axiomatic

F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq

AxiomaticF2LOG

class AxiomaticF2LOG : Axiomatic

F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq

AxiomaticF3EXP

class AxiomaticF3EXP(s: Float, queryLen: Int, k: Float = 0.35f) : Axiomatic

F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)queryLens/avdl NOTE: the gamma function of this similarity creates negative scores

AxiomaticF3LOG

class AxiomaticF3LOG(s: Float, queryLen: Int) : Axiomatic

F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)queryLens/avdl NOTE: the gamma function of this similarity creates negative scores

BasicModel

abstract class BasicModel

This class acts as the base class for the specific basic model implementations in the DFR framework. Basic models compute the informative content Inf₁ = -log₂Prob₁ .

BasicModelG

class BasicModelG : BasicModel

Geometric as limiting form of the Bose-Einstein model. The formula used in Lucene differs slightly from the one in the original paper: F is increased by 1 and N is increased by F.

BasicModelIF

class BasicModelIF : BasicModel

An approximation of the I(n_e) model.

BasicModelIn

class BasicModelIn : BasicModel

The basic tf-idf model of randomness.

BasicModelIne

class BasicModelIne : BasicModel

Tf-idf model of randomness, based on a mixture of Poisson and inverse document frequency.

BasicStats

open class BasicStats(val field: String?, val boost: Double)

Stores all statistics commonly used ranking methods.

BM25Similarity

class BM25Similarity(k1: Float = 1.2f, b: Float = 0.75f, discountOverlaps: Boolean = true) : Similarity

BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.

BooleanSimilarity

class BooleanSimilarity : Similarity

Simple similarity that gives terms a score that is equal to their query boost. This similarity is typically used with disabled norms since neither document statistics nor index statistics are used for scoring. That said, if norms are enabled, they will be computed the same way as [ ] and BM25Similarity with SimilarityBase.getDiscountOverlaps so that the Similarity can be changed after the index has been created.

ClassicSimilarity

open class ClassicSimilarity : TFIDFSimilarity

Expert: Historical scoring implementation. You might want to consider using [ ] instead, which is generally considered superior to TF-IDF.

DFISimilarity

class DFISimilarity(val independence: Independence, discountOverlaps: Boolean = true) : SimilarityBase

Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).

DFRSimilarity

class DFRSimilarity : SimilarityBase

Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), 357-389.

Distribution

abstract class Distribution

The probabilistic distribution used to model term occurrence in information-based models.

DistributionLL

class DistributionLL : Distribution

Log-logistic distribution.

DistributionSPL

class DistributionSPL : Distribution

The smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper.

IBSimilarity

class IBSimilarity(val distribution: Distribution, val lambda: Lambda, val normalization: Normalization, discountOverlaps: Boolean = true) : SimilarityBase

Provides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, 234-241.

Independence

abstract class Independence

Computes the measure of divergence from independence for DFI scoring functions.

IndependenceChiSquared

class IndependenceChiSquared : Independence

Normalized chi-squared measure of distance from independence

IndependenceSaturated

class IndependenceSaturated : Independence

Saturated measure of distance from independence

IndependenceStandardized

class IndependenceStandardized : Independence

Standardized measure of distance from independence

IndriDirichletSimilarity

class IndriDirichletSimilarity : LMSimilarity

Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine (http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing!

Lambda

abstract class Lambda

The lambda (λw) parameter in information-based models.

LambdaDF

class LambdaDF : Lambda

Computes lambda as docFreq+1 / numberOfDocuments+1.

LambdaTTF

class LambdaTTF : Lambda

Computes lambda as totalTermFreq+1 / numberOfDocuments+1.

LMDirichletSimilarity

class LMDirichletSimilarity : LMSimilarity

Bayesian smoothing using Dirichlet priors. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.

LMJelinekMercerSimilarity

class LMJelinekMercerSimilarity : LMSimilarity

Language model based on the Jelinek-Mercer smoothing method. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.

LMSimilarity

abstract class LMSimilarity @JvmOverloads constructor(collectionModel: LMSimilarity.CollectionModel = DefaultCollectionModel(), discountOverlaps: Boolean = true) : SimilarityBase

Abstract superclass for language modeling Similarities. The following inner types are introduced:

MultiSimilarity

class MultiSimilarity(sims: Array<Similarity>) : Similarity

Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A. Shaw, Edward A. Fox. In Text REtrieval Conference (1993), pp. 243-252

Normalization

abstract class Normalization

This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.

NormalizationH1

class NormalizationH1 : Normalization

Normalization model that assumes a uniform distribution of the term frequency.

NormalizationH2

class NormalizationH2 : Normalization

Normalization model in which the term frequency is inversely related to the length.

NormalizationH3

class NormalizationH3 : Normalization

Dirichlet Priors normalization

NormalizationZ

class NormalizationZ : Normalization

Pareto-Zipf Normalization

PerFieldSimilarityWrapper

abstract class PerFieldSimilarityWrapper : Similarity

Provides the ability to use a different Similarity for different fields.

RawTFSimilarity

class RawTFSimilarity : Similarity

Similarity that returns the raw TF as score.

Similarity

abstract class Similarity

Similarity defines the components of Lucene scoring.

SimilarityBase

abstract class SimilarityBase : Similarity

A subclass of Similarity that provides a simplified API for its descendants. Subclasses are only required to implement the .score and .toString methods. Implementing .explain is optional, inasmuch as SimilarityBase already provides a basic explanation of the score and the term frequency. However, implementers of a subclass are encouraged to include as much detail about the scoring method as possible.

TFIDFSimilarity

abstract class TFIDFSimilarity : Similarity

Implementation of Similarity with the Vector Space Model.