core/org.gnit.lucenekmp.document/FeatureField

FeatureField

class FeatureField(fieldName: String, featureName: String, featureValue: Float, storeTermVectors: Boolean = false) : Field

Field that can be used to store static scoring factors into documents. This is mostly inspired from the work from Nick Craswell, Stephen Robertson, Hugo Zaragoza and Michael Taylor. Relevance weighting for query independent evidence. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. August 15-19, 2005, Salvador, Brazil.

Feature values are internally encoded as term frequencies. Putting feature queries as clauses of a BooleanQuery allows to combine query-dependent scores (eg. BM25) with query-independent scores using a linear combination. The fact that feature values are stored as frequencies also allows search logic to efficiently skip documents that can't be competitive when total hit counts are not requested. This makes it a compelling option compared to storing such factors eg. in a doc-value field.

This field may only store factors that are positively correlated with the final score, like pagerank. In case of factors that are inversely correlated with the score like url length, the inverse of the scoring factor should be stored, ie. 1/urlLength.

This field only considers the top 9 significant bits for storage efficiency which allows to store them on 16 bits internally. In practice this limitation means that values are stored with a relative precision of 2^-8 = 0.00390625.

Given a scoring factor S > 0 and its weight w > 0, there are four ways that S can be turned into a score:

.newLogQuery, with a 1. This function usually makes sense because the distribution of scoring factors often follows a power law. This is typically the case for pagerank for instance. However the paper suggested that the satu and sigm functions give even better results.
.newSaturationQuery, with k > 0. This function is similar to the one used by BM25Similarity in order to incorporate term frequency into the final score and produces values between 0 and 1. A value of 0.5 is obtained when S and k are equal.
.newSigmoidQuery, with k > 0, a > 0. This function provided even better results than the two above but is also harder to tune due to the fact it has 2 parameters. Like with satu, values are in the 0..1 range and 0.5 is obtained when S and k are equal.
.newLinearQuery. Expert: This function doesn't apply any transformation to an indexed feature value, and the indexed value itself, multiplied by weight, determines the score. Thus, there is an expectation that a feature value is encoded in the index in a way that makes sense for scoring.

The constants in the above formulas typically need training in order to compute optimal values. If you don't know where to start, the .newSaturationQuery method uses 1f as a weight and tries to guess a sensible value for the pivot parameter of the saturation function based on index statistics, which shouldn't perform too bad. Here is an example, assuming that documents have a FeatureField called 'features' with values for the 'pagerank' feature.

Query query = new BooleanQuery.Builder()
.add(new TermQuery(new Term("body", "apache")), Occur.SHOULD)
.add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD)
.build();
Query boost = FeatureField.newSaturationQuery("features", "pagerank");
Query boostedQuery = new BooleanQuery.Builder()
.add(query, Occur.MUST)
.add(boost, Occur.SHOULD)
.build();
TopDocs topDocs = searcher.search(boostedQuery, 10);

Constructors

FeatureField

constructor(fieldName: String, featureName: String, featureValue: Float, storeTermVectors: Boolean = false)

Types

Companion

object Companion

Properties

charSequenceValue

open override val charSequenceValue: CharSequence?

Functions

binaryValue

open override fun binaryValue(): BytesRef?

Non-null if this field has a binary value

fieldType

open override fun fieldType(): IndexableFieldType

Returns the FieldType for this field.

getFeatureValue

fun getFeatureValue(): Float

This is useful if you have multiple features sharing a name and you want to take action to deduplicate them.

invertableType

open override fun invertableType(): InvertableType

Describes how this field should be inverted. This must return a non-null value if the field indexes terms and postings.

name

open override fun name(): String

Field name

numericValue

open override fun numericValue(): Number?

Non-null if this field has a numeric value

readerValue

open override fun readerValue(): Reader?

The value of the field as a Reader, or null. If null, the String value or binary value is used. Exactly one of stringValue(), readerValue(), and binaryValue() must be set.

setBytesValue

fun setBytesValue(value: ByteArray)

open fun setBytesValue(value: BytesRef)