common/org.gnit.lucenekmp.analysis.miscellaneous

Package-level declarations

Types

class ASCIIFoldingFilter(input: TokenStream, val isPreserveOriginal: Boolean = false) : TokenFilter

This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

ASCIIFoldingFilterFactory

class ASCIIFoldingFilterFactory : TokenFilterFactory

Factory for ASCIIFoldingFilter.

CapitalizationFilter

class CapitalizationFilter(in: TokenStream, onlyFirstWord: Boolean = true, keep: CharArraySet? = null, forceFirstLetter: Boolean = true, okPrefix: Collection<CharArray>? = null, minWordLength: Int = 0, maxWordCount: Int = DEFAULT_MAX_WORD_COUNT, maxTokenLength: Int = DEFAULT_MAX_TOKEN_LENGTH) : TokenFilter

A filter to apply normal capitalization rules to Tokens. It will make the first letter capital and the rest lower case.

CapitalizationFilterFactory

class CapitalizationFilterFactory : TokenFilterFactory

Factory for CapitalizationFilter.

CodepointCountFilter

class CodepointCountFilter(in: TokenStream, min: Int, max: Int) : FilteringTokenFilter

Removes words that are too long or too short from the stream.

CodepointCountFilterFactory

class CodepointCountFilterFactory : TokenFilterFactory

Factory for CodepointCountFilter.

CompositeDateRecognizer

class CompositeDateRecognizer(recognizers: DateRecognizer) : DateRecognizer

ConcatenateGraphFilter

class ConcatenateGraphFilter(inputTokenStream: TokenStream, tokenSeparator: Char?, preservePositionIncrements: Boolean, maxGraphExpansions: Int) : TokenStream

Concatenates/Joins every incoming token with a separator into one output token for every path through the token stream (which is a graph). In simple cases this yields one token, but in the presence of any tokens with a zero positionIncrmeent (e.g. synonyms) it will be more. This filter uses the token bytes, position increment, and position length of the incoming stream. Other attributes are not used or manipulated.

ConcatenateGraphFilterFactory

class ConcatenateGraphFilterFactory : TokenFilterFactory

Factory for ConcatenateGraphFilter.

ConcatenatingTokenStream

class ConcatenatingTokenStream(sources: TokenStream) : TokenStream

A TokenStream that takes an array of input TokenStreams as sources, and concatenates them together.

ConditionalTokenFilter

abstract class ConditionalTokenFilter : TokenFilter

Allows skipping TokenFilters based on the current set of attributes.

ConditionalTokenFilterFactory

abstract class ConditionalTokenFilterFactory : TokenFilterFactory, ResourceLoaderAware

Abstract parent class for analysis factories that create ConditionalTokenFilter instances

DateRecognizer

fun interface DateRecognizer

DateRecognizerFilter

class DateRecognizerFilter : FilteringTokenFilter

Filters all tokens that cannot be recognized as a date.

DateRecognizerFilterFactory

class DateRecognizerFilterFactory : TokenFilterFactory

Factory for DateRecognizerFilter.

DelimitedTermFrequencyTokenFilter

class DelimitedTermFrequencyTokenFilter(input: TokenStream, delimiter: Char = DEFAULT_DELIMITER) : TokenFilter

Characters before the delimiter are the "token", the textual integer after is the term frequency. To use this TokenFilter the field must be indexed with but no positions or offsets.

DelimitedTermFrequencyTokenFilterFactory

class DelimitedTermFrequencyTokenFilterFactory : TokenFilterFactory

Factory for DelimitedTermFrequencyTokenFilter. The field must have omitPositions=true.

DropIfFlaggedFilter

class DropIfFlaggedFilter(input: TokenStream, dropFlags: Int) : FilteringTokenFilter

Allows Tokens with a given combination of flags to be dropped. If all flags specified are present the token is dropped, otherwise it is retained.

DropIfFlaggedFilterFactory

class DropIfFlaggedFilterFactory : TokenFilterFactory

Provides a filter that will drop tokens matching a set of flags. This might be used if you had both custom filters that identify tokens to be removed, but need to run before other filters that want to see the token that will eventually be dropped. Alternately you might have separate flag setting filters and then remove tokens that match a particular combination of those filters.

In Solr this might be configured such as

EmptyTokenStream

class EmptyTokenStream : TokenStream

An always exhausted token stream.

EnglishDefaultDateRecognizer

object EnglishDefaultDateRecognizer : DateRecognizer

FingerprintFilter

class FingerprintFilter : TokenFilter

Filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens. This can be useful for clustering/linking use cases.

FingerprintFilterFactory

class FingerprintFilterFactory : TokenFilterFactory

Factory for FingerprintFilter.

FixBrokenOffsetsFilter

class ~~FixBrokenOffsetsFilter~~(in: TokenStream) : TokenFilter

A filter to correct offsets that illegally go backwards.

FixBrokenOffsetsFilterFactory

class ~~FixBrokenOffsetsFilterFactory~~ : TokenFilterFactory

Factory for FixBrokenOffsetsFilter.

HyphenatedWordsFilter

class HyphenatedWordsFilter(in: TokenStream) : TokenFilter

When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters. In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together. This filter should be used on indexing time only. Example field definition in schema.xml:

HyphenatedWordsFilterFactory

class HyphenatedWordsFilterFactory : TokenFilterFactory

Factory for HyphenatedWordsFilter.

KeepWordFilter

class KeepWordFilter(in: TokenStream, words: CharArraySet) : FilteringTokenFilter

A TokenFilter that only keeps tokens with text contained in the required words. This filter behaves like the inverse of StopFilter.

KeepWordFilterFactory

class KeepWordFilterFactory : AbstractWordsFileFilterFactory

Factory for KeepWordFilter.

KeywordMarkerFilter

abstract class KeywordMarkerFilter(input: TokenStream) : TokenFilter

Marks terms as keywords via the KeywordAttribute.

KeywordMarkerFilterFactory

class KeywordMarkerFilterFactory : TokenFilterFactory, ResourceLoaderAware

Factory for KeywordMarkerFilter.

KeywordRepeatFilter

class KeywordRepeatFilter(input: TokenStream) : TokenFilter

This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with KeywordAttribute.isKeyword set to true and once set to false. This is useful if used with a stem filter that respects the KeywordAttribute to index the stemmed and the un-stemmed version of a term into the same field.

KeywordRepeatFilterFactory

class KeywordRepeatFilterFactory : TokenFilterFactory

Factory for KeywordRepeatFilter.

LengthFilter

class LengthFilter(in: TokenStream, min: Int, max: Int) : FilteringTokenFilter

Removes words that are too long or too short from the stream.

LengthFilterFactory

class LengthFilterFactory : TokenFilterFactory

Factory for LengthFilter.

LimitTokenCountAnalyzer

class LimitTokenCountAnalyzer : AnalyzerWrapper

This Analyzer limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside org.gnit.lucenekmp.index.IndexWriter.

LimitTokenCountFilter

class LimitTokenCountFilter : TokenFilter

This TokenFilter limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside org.gnit.lucenekmp.index.IndexWriter.

LimitTokenCountFilterFactory

class LimitTokenCountFilterFactory : TokenFilterFactory

Factory for LimitTokenCountFilter.

LimitTokenOffsetFilter

class LimitTokenOffsetFilter : TokenFilter

Lets all tokens pass through until it sees one with a start offset <= a configured limit, which won't pass and ends the stream. This can be useful to limit highlighting, for example.

LimitTokenOffsetFilterFactory

class LimitTokenOffsetFilterFactory : TokenFilterFactory

Factory for LimitTokenOffsetFilter.

LimitTokenPositionFilter

class LimitTokenPositionFilter : TokenFilter

This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.

LimitTokenPositionFilterFactory

class LimitTokenPositionFilterFactory : TokenFilterFactory

Factory for LimitTokenPositionFilter.

PatternDateRecognizer

class PatternDateRecognizer(datePattern: String) : DateRecognizer

PatternKeywordMarkerFilter

class PatternKeywordMarkerFilter(in: TokenStream, pattern: Regex) : KeywordMarkerFilter

Marks terms as keywords via the org.gnit.lucenekmp.analysis.tokenattributes.KeywordAttribute. Each token that matches the provided pattern is marked as a keyword by setting org.gnit.lucenekmp.analysis.tokenattributes.KeywordAttribute.isKeyword to true.

PerFieldAnalyzerWrapper

class PerFieldAnalyzerWrapper(defaultAnalyzer: Analyzer, fieldAnalyzers: Map<String, Analyzer>? = null) : DelegatingAnalyzerWrapper

This analyzer is used to facilitate scenarios where different fields require different analysis techniques. Use the Map argument in PerFieldAnalyzerWrapper to add non-default analyzers for fields.

ProtectedTermFilter

class ProtectedTermFilter(protectedTerms: CharArraySet, input: TokenStream, inputFactory: (TokenStream) -> TokenStream) : ConditionalTokenFilter

A ConditionalTokenFilter that only applies its wrapped filters to tokens that are not contained in a protected set.

ProtectedTermFilterFactory

class ProtectedTermFilterFactory(args: MutableMap<String, String>) : ConditionalTokenFilterFactory

Factory for a ProtectedTermFilter

RemoveDuplicatesTokenFilter

class RemoveDuplicatesTokenFilter(in: TokenStream) : TokenFilter

A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.

RemoveDuplicatesTokenFilterFactory

class RemoveDuplicatesTokenFilterFactory : TokenFilterFactory

Factory for RemoveDuplicatesTokenFilter.

ScandinavianFoldingFilter

class ScandinavianFoldingFilter(input: TokenStream) : TokenFilter

This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.

ScandinavianFoldingFilterFactory

class ScandinavianFoldingFilterFactory : TokenFilterFactory

Factory for ScandinavianFoldingFilter.

ScandinavianNormalizationFilter

class ScandinavianNormalizationFilter(input: TokenStream) : TokenFilter

This filter normalize use of the interchangeable Scandinavian characters aeAEaeAEoeOEoeOE and folded variants (aa, ao, ae, oe and oo) by transforming them to aaAAaeAEoeOE.

ScandinavianNormalizationFilterFactory

class ScandinavianNormalizationFilterFactory : TokenFilterFactory

Factory for ScandinavianNormalizationFilter.

ScandinavianNormalizer

class ScandinavianNormalizer(foldings: Set<ScandinavianNormalizer.Foldings>)

This Normalizer does the heavy lifting for a set of Scandinavian normalization filters, normalizing use of the interchangeable Scandinavian characters aeAEaeAEoeOEoeOE and folded variants (aa, ao, ae, oe and oo) by transforming them to aaAAaeAEoeOE.

SetKeywordMarkerFilter

class SetKeywordMarkerFilter(input: TokenStream, keywordSet: CharArraySet) : KeywordMarkerFilter

Marks terms as keywords via the KeywordAttribute when they exist in the provided set.

StemmerOverrideFilter

class StemmerOverrideFilter(input: TokenStream, stemmerOverrideMap: StemmerOverrideFilter.StemmerOverrideMap) : TokenFilter

Provides the ability to override any KeywordAttribute-aware stemmer with custom dictionary-based stemming.

StemmerOverrideFilterFactory

class StemmerOverrideFilterFactory : TokenFilterFactory, ResourceLoaderAware

Factory for StemmerOverrideFilter.

TrimFilter

class TrimFilter(in: TokenStream) : TokenFilter

Trims leading and trailing whitespace from Tokens in the stream.

TrimFilterFactory

class TrimFilterFactory : TokenFilterFactory

Factory for TrimFilter.

TruncateTokenFilter

class TruncateTokenFilter(input: TokenStream, length: Int) : TokenFilter

A token filter for truncating the terms into a specific length. Fixed prefix truncation, as a stemming method, produces good results on Turkish language. It is reported that F5, using first 5 characters, produced best results in Information Retrieval on Turkish Texts

TruncateTokenFilterFactory

class TruncateTokenFilterFactory : TokenFilterFactory

Factory for TruncateTokenFilter. The following type is recommended for "diacritics-insensitive search" for Turkish.

TypeAsSynonymFilter

class TypeAsSynonymFilter : TokenFilter

Adds the TypeAttribute.type as a synonym, i.e. another token at the same position, optionally with a specified prefix prepended, optionally transfering flags, and optionally ignoring some types. See TypeAsSynonymFilterFactory for full details.

TypeAsSynonymFilterFactory

class TypeAsSynonymFilterFactory : TokenFilterFactory

Factory for TypeAsSynonymFilter.

WordDelimiterFilter

class ~~WordDelimiterFilter~~ : TokenFilter

Splits words into subwords and performs optional transformations on subword groups.

WordDelimiterFilterFactory

class ~~WordDelimiterFilterFactory~~ : TokenFilterFactory, ResourceLoaderAware

Factory for WordDelimiterFilter.

WordDelimiterGraphFilter

class WordDelimiterGraphFilter : TokenFilter

Splits words into subwords and performs optional transformations on subword groups, producing a correct token graph so that e.g. PhraseQuery can work correctly when this filter is used in the search-time analyzer. Unlike the deprecated WordDelimiterFilter, this token filter produces a correct token graph as output. However, it cannot consume an input token graph correctly. Processing is suppressed by KeywordAttribute.isKeyword=true.

WordDelimiterGraphFilterFactory

class WordDelimiterGraphFilterFactory : TokenFilterFactory, ResourceLoaderAware

Factory for WordDelimiterGraphFilter.

WordDelimiterIterator

class WordDelimiterIterator(charTypeTable: ByteArray, val splitOnCaseChange: Boolean, val splitOnNumerics: Boolean, val stemEnglishPossessive: Boolean)

A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.