FeatureField
Field that can be used to store static scoring factors into documents. This is mostly inspired from the work from Nick Craswell, Stephen Robertson, Hugo Zaragoza and Michael Taylor. Relevance weighting for query independent evidence. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. August 15-19, 2005, Salvador, Brazil.
Feature values are internally encoded as term frequencies. Putting feature queries as clauses of a BooleanQuery allows to combine query-dependent scores (eg. BM25) with query-independent scores using a linear combination. The fact that feature values are stored as frequencies also allows search logic to efficiently skip documents that can't be competitive when total hit counts are not requested. This makes it a compelling option compared to storing such factors eg. in a doc-value field.
This field may only store factors that are positively correlated with the final score, like pagerank. In case of factors that are inversely correlated with the score like url length, the inverse of the scoring factor should be stored, ie. 1/urlLength.
This field only considers the top 9 significant bits for storage efficiency which allows to store them on 16 bits internally. In practice this limitation means that values are stored with a relative precision of 2-8 = 0.00390625.
Given a scoring factor S > 0 and its weight w > 0, there are four ways that S can be turned into a score:
.newLogQuery, with a 1. This function usually makes sense because the distribution of scoring factors often follows a power law. This is typically the case for pagerank for instance. However the paper suggested that the
satuandsigmfunctions give even better results..newSaturationQuery, with k > 0. This function is similar to the one used by BM25Similarity in order to incorporate term frequency into the final score and produces values between 0 and 1. A value of 0.5 is obtained when S and k are equal.
.newSigmoidQuery, with k > 0, a > 0. This function provided even better results than the two above but is also harder to tune due to the fact it has 2 parameters. Like with
satu, values are in the 0..1 range and 0.5 is obtained when S and k are equal..newLinearQuery. Expert: This function doesn't apply any transformation to an indexed feature value, and the indexed value itself, multiplied by weight, determines the score. Thus, there is an expectation that a feature value is encoded in the index in a way that makes sense for scoring.
The constants in the above formulas typically need training in order to compute optimal values. If you don't know where to start, the .newSaturationQuery method uses 1f as a weight and tries to guess a sensible value for the pivot parameter of the saturation function based on index statistics, which shouldn't perform too bad. Here is an example, assuming that documents have a FeatureField called 'features' with values for the 'pagerank' feature.
Query query = new BooleanQuery.Builder()
.add(new TermQuery(new Term("body", "apache")), Occur.SHOULD)
.add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD)
.build();
Query boost = FeatureField.newSaturationQuery("features", "pagerank");
Query boostedQuery = new BooleanQuery.Builder()
.add(query, Occur.MUST)
.add(boost, Occur.SHOULD)
.build();
TopDocs topDocs = searcher.search(boostedQuery, 10);
*Functions
Non-null if this field has a binary value
Returns the FieldType for this field.
This is useful if you have multiple features sharing a name and you want to take action to deduplicate them.
Describes how this field should be inverted. This must return a non-null value if the field indexes terms and postings.
Non-null if this field has a numeric value
The value of the field as a Reader, or null. If null, the String value or binary value is used. Exactly one of stringValue(), readerValue(), and binaryValue() must be set.
Expert: change the value of this field. See .setStringValue.
Expert: change the value of this field. See .setStringValue.
Expert: change the value of this field. See .setStringValue.
Update the feature value of this field.
Expert: change the value of this field. See .setStringValue.
Expert: change the value of this field. See .setStringValue.
Expert: change the value of this field. See .setStringValue.
Expert: change the value of this field. See .setStringValue.
Expert: change the value of this field. See .setStringValue.
Expert: change the value of this field. This can be used during indexing to re-use a single Field instance to improve indexing speed by avoiding GC cost of new'ing and reclaiming Field instances. Typically a single Document instance is re-used as well. This helps most on small documents.
Expert: sets the token stream to be used for indexing.
Stored value. This method is called to populate stored fields and must return a non-null value if the field stored.
The value of the field as a String, or null. If null, the Reader value or binary value is used. Exactly one of stringValue(), readerValue(), and binaryValue() must be set.
Creates the TokenStream used for indexing this field. If appropriate, implementations should use the given Analyzer to create the TokenStreams.
The TokenStream for this field to be used when indexing, or null. If null, the Reader value or String value is analyzed to produce the indexed tokens.