Package-level declarations

Types

Link copied to clipboard

Abstract parent class for analysis factories TokenizerFactory, TokenFilterFactory and CharFilterFactory.

Link copied to clipboard
class AnalysisSPILoader<S : AbstractAnalysisFactory>(clazz: KClass<S>, classloader: ClassLoader? = null)

Helper class for loading named SPIs from classpath (e.g. Tokenizers, TokenStreams).

Link copied to clipboard
expect object AnalysisSPIReflection
Link copied to clipboard
Link copied to clipboard
abstract class Analyzer : AutoCloseable

An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

Link copied to clipboard
abstract class AnalyzerWrapper : Analyzer

Extension to Analyzer suitable for Analyzers which wrap other Analyzers.

Link copied to clipboard

Converts an Automaton into a TokenStream.

Link copied to clipboard

This class can be used if the token attributes of a TokenStream are intended to be consumed more than once. It caches all token attribute states locally in a List when the first call to .incrementToken is called. Subsequent calls will used the cache.

Link copied to clipboard

Utility class to write tokenizers or token filters.

Link copied to clipboard

A simple class that stores key Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the map, nor does it resize its hash table to be smaller, etc. It is designed to be quick to retrieve items by char[] keys without the necessity of converting to a String first.

Link copied to clipboard

A simple class that stores Strings as char[]'s in a hash table. Note that this is not a general purpose class. For example, it cannot remove items from the set, nor does it resize its hash table to be smaller, etc. It is designed to be quick to test if a char[] is in the set without the necessity of converting it to a String first.

Link copied to clipboard
abstract class CharFilter(input: Reader) : Reader

Subclasses of CharFilter can be chained to filter a Reader They can be used as [ ] with additional offset correction. Tokenizers will automatically use .correctOffset if a CharFilter subclass is used.

Link copied to clipboard

Abstract parent class for analysis factories that create CharFilter instances.

Link copied to clipboard

An analyzer wrapper, that doesn't allow to wrap components or readers. By disallowing it, it means that the thread local resources can be delegated to the delegate analyzer, and not also be allocated on this analyzer. This wrapper class is the base class of all analyzers that just delegate to another analyzer, e.g. per field name.

Link copied to clipboard

Abstract base class for TokenFilters that may remove tokens. You have to implement .accept and return a boolean if the current token should be preserved. .incrementToken uses this method to decide if a token should be passed to the caller.

Link copied to clipboard
abstract class GraphTokenFilter(input: TokenStream) : TokenFilter

An abstract TokenFilter that exposes its input stream as a graph

Link copied to clipboard

Normalizes token text to lower case.

Link copied to clipboard

Internal class to enable reuse of the string reader by

Link copied to clipboard

Removes stop words from a token stream.

Link copied to clipboard

Base class f2or Analyzers that need to make use of stopword sets.

Link copied to clipboard

A TokenFilter is a TokenStream whose input is another TokenStream.

Link copied to clipboard

Abstract parent class for analysis factories that create [ ] instances.

Link copied to clipboard
abstract class Tokenizer : TokenStream

A Tokenizer is a TokenStream whose input is a Reader.

Link copied to clipboard

Abstract parent class for analysis factories that create Tokenizer instances.

Link copied to clipboard

A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text.

Link copied to clipboard

Consumes a TokenStream and creates an Automaton where the transition labels are UTF8 bytes (or Unicode code points if unicodeArcs is true) from the TermToBytesRefAttribute. Between tokens we insert POS_SEP and for holes we insert HOLE.

Link copied to clipboard

Loader for text files that represent a list of stopwords.