Package-level declarations

Types

Link copied to clipboard
class ASCIIFoldingFilter(input: TokenStream, val isPreserveOriginal: Boolean = false) : TokenFilter

This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

Link copied to clipboard
class CapitalizationFilter(in: TokenStream, onlyFirstWord: Boolean = true, keep: CharArraySet? = null, forceFirstLetter: Boolean = true, okPrefix: Collection<CharArray>? = null, minWordLength: Int = 0, maxWordCount: Int = DEFAULT_MAX_WORD_COUNT, maxTokenLength: Int = DEFAULT_MAX_TOKEN_LENGTH) : TokenFilter

A filter to apply normal capitalization rules to Tokens. It will make the first letter capital and the rest lower case.

Link copied to clipboard

Removes words that are too long or too short from the stream.

Link copied to clipboard
Link copied to clipboard
class ConcatenateGraphFilter(inputTokenStream: TokenStream, tokenSeparator: Char?, preservePositionIncrements: Boolean, maxGraphExpansions: Int) : TokenStream

Concatenates/Joins every incoming token with a separator into one output token for every path through the token stream (which is a graph). In simple cases this yields one token, but in the presence of any tokens with a zero positionIncrmeent (e.g. synonyms) it will be more. This filter uses the token bytes, position increment, and position length of the incoming stream. Other attributes are not used or manipulated.

Link copied to clipboard

A TokenStream that takes an array of input TokenStreams as sources, and concatenates them together.

Link copied to clipboard

Allows skipping TokenFilters based on the current set of attributes.

Link copied to clipboard

Abstract parent class for analysis factories that create ConditionalTokenFilter instances

Link copied to clipboard
fun interface DateRecognizer
Link copied to clipboard

Filters all tokens that cannot be recognized as a date.

Link copied to clipboard
class DelimitedTermFrequencyTokenFilter(input: TokenStream, delimiter: Char = DEFAULT_DELIMITER) : TokenFilter

Characters before the delimiter are the "token", the textual integer after is the term frequency. To use this TokenFilter the field must be indexed with but no positions or offsets.

Link copied to clipboard

Factory for DelimitedTermFrequencyTokenFilter. The field must have omitPositions=true.

Link copied to clipboard

Allows Tokens with a given combination of flags to be dropped. If all flags specified are present the token is dropped, otherwise it is retained.

Link copied to clipboard

Provides a filter that will drop tokens matching a set of flags. This might be used if you had both custom filters that identify tokens to be removed, but need to run before other filters that want to see the token that will eventually be dropped. Alternately you might have separate flag setting filters and then remove tokens that match a particular combination of those filters.

In Solr this might be configured such as

Link copied to clipboard

An always exhausted token stream.

Link copied to clipboard

Filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens. This can be useful for clustering/linking use cases.

Link copied to clipboard

A filter to correct offsets that illegally go backwards.

Link copied to clipboard

When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters. In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together. This filter should be used on indexing time only. Example field definition in schema.xml:

Link copied to clipboard

A TokenFilter that only keeps tokens with text contained in the required words. This filter behaves like the inverse of StopFilter.

Link copied to clipboard

Marks terms as keywords via the KeywordAttribute.

Link copied to clipboard

This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with KeywordAttribute.isKeyword set to true and once set to false. This is useful if used with a stem filter that respects the KeywordAttribute to index the stemmed and the un-stemmed version of a term into the same field.

Link copied to clipboard

Removes words that are too long or too short from the stream.

Link copied to clipboard
Link copied to clipboard

This Analyzer limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside org.gnit.lucenekmp.index.IndexWriter.

Link copied to clipboard

This TokenFilter limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside org.gnit.lucenekmp.index.IndexWriter.

Link copied to clipboard

Lets all tokens pass through until it sees one with a start offset <= a configured limit, which won't pass and ends the stream. This can be useful to limit highlighting, for example.

Link copied to clipboard

This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.

Link copied to clipboard
Link copied to clipboard

Marks terms as keywords via the org.gnit.lucenekmp.analysis.tokenattributes.KeywordAttribute. Each token that matches the provided pattern is marked as a keyword by setting org.gnit.lucenekmp.analysis.tokenattributes.KeywordAttribute.isKeyword to true.

Link copied to clipboard
class PerFieldAnalyzerWrapper(defaultAnalyzer: Analyzer, fieldAnalyzers: Map<String, Analyzer>? = null) : DelegatingAnalyzerWrapper

This analyzer is used to facilitate scenarios where different fields require different analysis techniques. Use the Map argument in PerFieldAnalyzerWrapper to add non-default analyzers for fields.

Link copied to clipboard
class ProtectedTermFilter(protectedTerms: CharArraySet, input: TokenStream, inputFactory: (TokenStream) -> TokenStream) : ConditionalTokenFilter

A ConditionalTokenFilter that only applies its wrapped filters to tokens that are not contained in a protected set.

Link copied to clipboard

A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.

Link copied to clipboard

This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.

Link copied to clipboard

This filter normalize use of the interchangeable Scandinavian characters aeAEaeAEoeOEoeOE and folded variants (aa, ao, ae, oe and oo) by transforming them to aaAAaeAEoeOE.

Link copied to clipboard

This Normalizer does the heavy lifting for a set of Scandinavian normalization filters, normalizing use of the interchangeable Scandinavian characters aeAEaeAEoeOEoeOE and folded variants (aa, ao, ae, oe and oo) by transforming them to aaAAaeAEoeOE.

Link copied to clipboard

Marks terms as keywords via the KeywordAttribute when they exist in the provided set.

Link copied to clipboard

Provides the ability to override any KeywordAttribute-aware stemmer with custom dictionary-based stemming.

Link copied to clipboard

Trims leading and trailing whitespace from Tokens in the stream.

Link copied to clipboard
Link copied to clipboard

A token filter for truncating the terms into a specific length. Fixed prefix truncation, as a stemming method, produces good results on Turkish language. It is reported that F5, using first 5 characters, produced best results in Information Retrieval on Turkish Texts

Link copied to clipboard

Factory for TruncateTokenFilter. The following type is recommended for "diacritics-insensitive search" for Turkish.

Link copied to clipboard

Adds the TypeAttribute.type as a synonym, i.e. another token at the same position, optionally with a specified prefix prepended, optionally transfering flags, and optionally ignoring some types. See TypeAsSynonymFilterFactory for full details.

Link copied to clipboard

Splits words into subwords and performs optional transformations on subword groups.

Link copied to clipboard

Splits words into subwords and performs optional transformations on subword groups, producing a correct token graph so that e.g. PhraseQuery can work correctly when this filter is used in the search-time analyzer. Unlike the deprecated WordDelimiterFilter, this token filter produces a correct token graph as output. However, it cannot consume an input token graph correctly. Processing is suppressed by KeywordAttribute.isKeyword=true.

Link copied to clipboard
class WordDelimiterIterator(charTypeTable: ByteArray, val splitOnCaseChange: Boolean, val splitOnNumerics: Boolean, val stemEnglishPossessive: Boolean)

A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.