Package-level declarations
Types
This class acts as the base class for the implementations of the first normalization of the informative content in the DFR framework. This component is also called the after effect and is defined by the formula Inf2 = 1 - Prob2, where Prob2 measures the information gain.
Model of the information gain based on the ratio of two Bernoulli processes.
Model of the information gain based on Laplace's law of succession.
Axiomatic approaches for IR. From Hui Fang and Chengxiang Zhai 2005. An Exploration of Axiomatic Approaches to Information Retrieval. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '05). ACM, New York, NY, USA, 480-487.
F1EXP is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq
F1LOG is defined as Sum(tf(term_doc_freq)*ln(docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq
F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq
F2EXP is defined as Sum(tfln(term_doc_freq, docLen)*IDF(term)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq
F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = pow((N+1)/df(t), k) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)queryLens/avdl NOTE: the gamma function of this similarity creates negative scores
F3EXP is defined as Sum(tf(term_doc_freq)*IDF(term)-gamma(docLen, queryLen)) where IDF(t) = ln((N+1)/df(t)) N=total num of docs, df=doc freq gamma(docLen, queryLen) = (docLen-queryLen)queryLens/avdl NOTE: the gamma function of this similarity creates negative scores
This class acts as the base class for the specific basic model implementations in the DFR framework. Basic models compute the informative content Inf1 = -log2Prob1 .
Geometric as limiting form of the Bose-Einstein model. The formula used in Lucene differs slightly from the one in the original paper: F is increased by 1 and N is increased by F.
An approximation of the I(ne) model.
The basic tf-idf model of randomness.
Tf-idf model of randomness, based on a mixture of Poisson and inverse document frequency.
Stores all statistics commonly used ranking methods.
BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
Simple similarity that gives terms a score that is equal to their query boost. This similarity is typically used with disabled norms since neither document statistics nor index statistics are used for scoring. That said, if norms are enabled, they will be computed the same way as [ ] and BM25Similarity with SimilarityBase.getDiscountOverlaps so that the Similarity can be changed after the index has been created.
Expert: Historical scoring implementation. You might want to consider using [ ] instead, which is generally considered superior to TF-IDF.
Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).
Implements the divergence from randomness (DFR) framework introduced in Gianni Amati and Cornelis Joost Van Rijsbergen. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 4 (October 2002), 357-389.
The probabilistic distribution used to model term occurrence in information-based models.
Log-logistic distribution.
The smoothed power-law (SPL) distribution for the information-based framework that is described in the original paper.
Provides a framework for the family of information-based models, as described in Stéphane Clinchant and Eric Gaussier. 2010. Information-based models for ad hoc IR. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (SIGIR '10). ACM, New York, NY, USA, 234-241.
Computes the measure of divergence from independence for DFI scoring functions.
Normalized chi-squared measure of distance from independence
Saturated measure of distance from independence
Standardized measure of distance from independence
Bayesian smoothing using Dirichlet priors as implemented in the Indri Search engine (http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing!
Bayesian smoothing using Dirichlet priors. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.
Language model based on the Jelinek-Mercer smoothing method. From Chengxiang Zhai and John Lafferty. 2001. A study of smoothing methods for language models applied to Ad Hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '01). ACM, New York, NY, USA, 334-342.
Abstract superclass for language modeling Similarities. The following inner types are introduced:
Implements the CombSUM method for combining evidence from multiple similarity values described in: Joseph A. Shaw, Edward A. Fox. In Text REtrieval Conference (1993), pp. 243-252
This class acts as the base class for the implementations of the term frequency normalization methods in the DFR framework.
Normalization model that assumes a uniform distribution of the term frequency.
Normalization model in which the term frequency is inversely related to the length.
Dirichlet Priors normalization
Pareto-Zipf Normalization
Provides the ability to use a different Similarity for different fields.
Similarity that returns the raw TF as score.
Similarity defines the components of Lucene scoring.
A subclass of Similarity that provides a simplified API for its descendants. Subclasses are only required to implement the .score and .toString methods. Implementing .explain is optional, inasmuch as SimilarityBase already provides a basic explanation of the score and the term frequency. However, implementers of a subclass are encouraged to include as much detail about the scoring method as possible.
Implementation of Similarity with the Vector Space Model.