Package-level declarations

Types

Link copied to clipboard
abstract class AfterEffect

This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework. This component is also called the after effect and is defined by the formula Inf2 = 1 - Prob2, where Prob2 measures the information gain.

Link copied to clipboard

Model of the information gain based on the ratio of two Bernoulli processes.

Link copied to clipboard

Model of the information gain based on Laplace's law of succession.

Link copied to clipboard
abstract class Axiomatic(discountOverlaps: Boolean, s: Float, queryLen: Int, k: Float) : SimilarityBase

Axiomatic approaches for IR. From Hui Fang and Chengxiang Zhai 2005. An Exploration of Axiomatic Approaches to Information Retrieval. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '05). ACM, New York, NY, USA, 480-487.

Link copied to clipboard

F1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq

Link copied to clipboard

F1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq

Link copied to clipboard

F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq

Link copied to clipboard

F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq

Link copied to clipboard
class AxiomaticF3EXP(s: Float, queryLen: Int, k: Float = 0.35f) : Axiomatic

F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)queryLens/avdl NOTE: the gamma function of this similarity creates negative scores

Link copied to clipboard
class AxiomaticF3LOG(s: Float, queryLen: Int) : Axiomatic

F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)queryLens/avdl NOTE: the gamma function of this similarity creates negative scores

Link copied to clipboard
abstract class BasicModel

This class acts as the base class for the specific basic model implementations in the DFR framework. Basic models compute the informative content Inf1 = -log2Prob1 .

Link copied to clipboard

Geometric as limiting form of the Bose-Einstein model. The formula used in Lucene differs slightly from the one in the original paper: F is increased by 1 and N is increased by F.

Link copied to clipboard

An approximation of the I(ne) model.

Link copied to clipboard

The basic tf-idf model of randomness.

Link copied to clipboard

Tf-idf model of randomness, based on a mixture of Poisson and inverse document frequency.

Link copied to clipboard
open class BasicStats(val field: String?, val boost: Double)

Stores all statistics commonly used ranking methods.

Link copied to clipboard
class BM25Similarity(k1: Float = 1.2f, b: Float = 0.75f, discountOverlaps: Boolean = true) : Similarity

BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.

Link copied to clipboard

Simple similarity that gives terms a score that is equal to their query boost. This similarity is typically used with disabled norms since neither document statistics nor index statistics are used for scoring. That said, if norms are enabled, they will be computed the same way as [ ] and BM25Similarity with SimilarityBase.getDiscountOverlaps so that the Similarity can be changed after the index has been created.

Link copied to clipboard

Expert: Historical scoring implementation. You might want to consider using [ ] instead, which is generally considered superior to TF-IDF.

Link copied to clipboard
class DFISimilarity(val independence: Independence, discountOverlaps: Boolean = true) : SimilarityBase

Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).

Link copied to clipboard

Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), 357-389.

Link copied to clipboard
abstract class Distribution

The probabilistic distribution used to model term occurrence in information-based models.

Link copied to clipboard

Log-logistic distribution.

Link copied to clipboard

The smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper.

Link copied to clipboard
class IBSimilarity(val distribution: Distribution, val lambda: Lambda, val normalization: Normalization, discountOverlaps: Boolean = true) : SimilarityBase

Provides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, 234-241.

Link copied to clipboard
abstract class Independence

Computes the measure of divergence from independence for DFI scoring functions.

Link copied to clipboard

Normalized chi-squared measure of distance from independence

Link copied to clipboard

Saturated measure of distance from independence

Link copied to clipboard

Standardized measure of distance from independence

Link copied to clipboard

Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine (http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing!

Link copied to clipboard
abstract class Lambda

The lambda (λw) parameter in information-based models.

Link copied to clipboard

Computes lambda as docFreq+1 / numberOfDocuments+1.

Link copied to clipboard

Computes lambda as totalTermFreq+1 / numberOfDocuments+1.

Link copied to clipboard

Bayesian smoothing using Dirichlet priors. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.

Link copied to clipboard

Language model based on the Jelinek-Mercer smoothing method. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.

Link copied to clipboard
abstract class LMSimilarity @JvmOverloads constructor(collectionModel: LMSimilarity.CollectionModel = DefaultCollectionModel(), discountOverlaps: Boolean = true) : SimilarityBase

Abstract superclass for language modeling Similarities. The following inner types are introduced:

Link copied to clipboard

Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A. Shaw, Edward A. Fox. In Text REtrieval Conference (1993), pp. 243-252

Link copied to clipboard
abstract class Normalization

This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.

Link copied to clipboard

Normalization model that assumes a uniform distribution of the term frequency.

Link copied to clipboard

Normalization model in which the term frequency is inversely related to the length.

Link copied to clipboard

Dirichlet Priors normalization

Link copied to clipboard

Pareto-Zipf Normalization

Link copied to clipboard

Provides the ability to use a different Similarity for different fields.

Link copied to clipboard

Similarity that returns the raw TF as score.

Link copied to clipboard
abstract class Similarity

Similarity defines the components of Lucene scoring.

Link copied to clipboard
abstract class SimilarityBase : Similarity

A subclass of Similarity that provides a simplified API for its descendants. Subclasses are only required to implement the .score and .toString methods. Implementing .explain is optional, inasmuch as SimilarityBase already provides a basic explanation of the score and the term frequency. However, implementers of a subclass are encouraged to include as much detail about the scoring method as possible.

Link copied to clipboard
abstract class TFIDFSimilarity : Similarity

Implementation of Similarity with the Vector Space Model.