Package-level declarations

Types

Link copied to clipboard

Folds all Unicode digits in :General_Category=Decimal_Number: to Basic Latin digits (0-9).

Link copied to clipboard

Converts an incoming graph token stream, such as one from SynonymGraphFilter, into a flat form so that all nodes form a single linear chain with no side paths.

Link copied to clipboard

"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.

Link copied to clipboard

Emits the entire input as a single token.

Link copied to clipboard

A LetterTokenizer is a tokenizer that divides text at non-letters. That's to say, it defines tokens as maximal strings of adjacent letters, as defined by Character.isLetter() predicate.

Link copied to clipboard
Link copied to clipboard

Normalizes token text to lower case.

Link copied to clipboard
Link copied to clipboard

Removes stop words from a token stream.

Link copied to clipboard
Link copied to clipboard

Removes tokens whose types appear in a set of blocked types from a token stream.

Link copied to clipboard
Link copied to clipboard

A UnicodeWhitespaceTokenizer is a tokenizer that divides text at whitespace. Adjacent sequences of non-Whitespace characters form tokens (according to Unicode's WHITESPACE property).

Link copied to clipboard

Normalizes token text to UPPER CASE.

Link copied to clipboard
class WhitespaceAnalyzer(maxTokenLength: Int = WhitespaceTokenizer.DEFAULT_MAX_WORD_LEN) : Analyzer

An Analyzer that uses WhitespaceTokenizer.

Link copied to clipboard

A tokenizer that divides text at whitespace characters as defined by . Note: That definition explicitly excludes the non-breaking space. Adjacent sequences of non-Whitespace characters form tokens.