Package-level declarations
Types
This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
Factory for ASCIIFoldingFilter.
A filter to apply normal capitalization rules to Tokens. It will make the first letter capital and the rest lower case.
Factory for CapitalizationFilter.
Removes words that are too long or too short from the stream.
Factory for CodepointCountFilter.
Concatenates/Joins every incoming token with a separator into one output token for every path through the token stream (which is a graph). In simple cases this yields one token, but in the presence of any tokens with a zero positionIncrmeent (e.g. synonyms) it will be more. This filter uses the token bytes, position increment, and position length of the incoming stream. Other attributes are not used or manipulated.
Factory for ConcatenateGraphFilter.
A TokenStream that takes an array of input TokenStreams as sources, and concatenates them together.
Allows skipping TokenFilters based on the current set of attributes.
Abstract parent class for analysis factories that create ConditionalTokenFilter instances
Filters all tokens that cannot be recognized as a date.
Factory for DateRecognizerFilter.
Factory for DelimitedTermFrequencyTokenFilter. The field must have omitPositions=true.
Allows Tokens with a given combination of flags to be dropped. If all flags specified are present the token is dropped, otherwise it is retained.
Provides a filter that will drop tokens matching a set of flags. This might be used if you had both custom filters that identify tokens to be removed, but need to run before other filters that want to see the token that will eventually be dropped. Alternately you might have separate flag setting filters and then remove tokens that match a particular combination of those filters.
In Solr this might be configured such as
An always exhausted token stream.
Filter outputs a single token which is a concatenation of the sorted and de-duplicated set of input tokens. This can be useful for clustering/linking use cases.
Factory for FingerprintFilter.
A filter to correct offsets that illegally go backwards.
Factory for FixBrokenOffsetsFilter.
When the plain text is extracted from documents, we will often have many words hyphenated and broken into two lines. This is often the case with documents where narrow text columns are used, such as newsletters. In order to increase search efficiency, this filter puts hyphenated words broken into two lines back together. This filter should be used on indexing time only. Example field definition in schema.xml:
Factory for HyphenatedWordsFilter.
A TokenFilter that only keeps tokens with text contained in the required words. This filter behaves like the inverse of StopFilter.
Factory for KeepWordFilter.
Marks terms as keywords via the KeywordAttribute.
Factory for KeywordMarkerFilter.
This TokenFilter emits each incoming token twice once as keyword and once non-keyword, in other words once with KeywordAttribute.isKeyword set to true and once set to false. This is useful if used with a stem filter that respects the KeywordAttribute to index the stemmed and the un-stemmed version of a term into the same field.
Factory for KeywordRepeatFilter.
Removes words that are too long or too short from the stream.
Factory for LengthFilter.
This Analyzer limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside org.gnit.lucenekmp.index.IndexWriter.
This TokenFilter limits the number of tokens while indexing. It is a replacement for the maximum field length setting inside org.gnit.lucenekmp.index.IndexWriter.
Factory for LimitTokenCountFilter.
Lets all tokens pass through until it sees one with a start offset <= a configured limit, which won't pass and ends the stream. This can be useful to limit highlighting, for example.
Factory for LimitTokenOffsetFilter.
This TokenFilter limits its emitted tokens to those with positions that are not greater than the configured limit.
Factory for LimitTokenPositionFilter.
Marks terms as keywords via the org.gnit.lucenekmp.analysis.tokenattributes.KeywordAttribute. Each token that matches the provided pattern is marked as a keyword by setting org.gnit.lucenekmp.analysis.tokenattributes.KeywordAttribute.isKeyword to true.
This analyzer is used to facilitate scenarios where different fields require different analysis techniques. Use the Map argument in PerFieldAnalyzerWrapper to add non-default analyzers for fields.
A ConditionalTokenFilter that only applies its wrapped filters to tokens that are not contained in a protected set.
Factory for a ProtectedTermFilter
A TokenFilter which filters out Tokens at the same position and Term text as the previous token in the stream.
Factory for RemoveDuplicatesTokenFilter.
This filter folds Scandinavian characters åÅäæÄÆ->a and öÖøØ->o. It also discriminate against use of double vowels aa, ae, ao, oe and oo, leaving just the first one.
Factory for ScandinavianFoldingFilter.
This filter normalize use of the interchangeable Scandinavian characters aeAEaeAEoeOEoeOE and folded variants (aa, ao, ae, oe and oo) by transforming them to aaAAaeAEoeOE.
Factory for ScandinavianNormalizationFilter.
This Normalizer does the heavy lifting for a set of Scandinavian normalization filters, normalizing use of the interchangeable Scandinavian characters aeAEaeAEoeOEoeOE and folded variants (aa, ao, ae, oe and oo) by transforming them to aaAAaeAEoeOE.
Marks terms as keywords via the KeywordAttribute when they exist in the provided set.
Provides the ability to override any KeywordAttribute-aware stemmer with custom dictionary-based stemming.
Factory for StemmerOverrideFilter.
Trims leading and trailing whitespace from Tokens in the stream.
Factory for TrimFilter.
A token filter for truncating the terms into a specific length. Fixed prefix truncation, as a stemming method, produces good results on Turkish language. It is reported that F5, using first 5 characters, produced best results in Information Retrieval on Turkish Texts
Factory for TruncateTokenFilter. The following type is recommended for "diacritics-insensitive search" for Turkish.
Adds the TypeAttribute.type as a synonym, i.e. another token at the same position, optionally with a specified prefix prepended, optionally transfering flags, and optionally ignoring some types. See TypeAsSynonymFilterFactory for full details.
Factory for TypeAsSynonymFilter.
Splits words into subwords and performs optional transformations on subword groups.
Factory for WordDelimiterFilter.
Splits words into subwords and performs optional transformations on subword groups, producing a correct token graph so that e.g. PhraseQuery can work correctly when this filter is used in the search-time analyzer. Unlike the deprecated WordDelimiterFilter, this token filter produces a correct token graph as output. However, it cannot consume an input token graph correctly. Processing is suppressed by KeywordAttribute.isKeyword=true.
Factory for WordDelimiterGraphFilter.
A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.