ClassicTokenizer

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

  • Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
  • Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
  • Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, [org.gnit.lucenekmp.analysis.standard.StandardTokenizer] implements Unicode text segmentation, as specified by UAX#29.

Constructors

Link copied to clipboard
constructor()

Creates a new instance of the org.gnit.lucenekmp.analysis.classic.ClassicTokenizer. Attaches the input to the newly created JFlex scanner.

constructor(factory: AttributeFactory)

Creates a new ClassicTokenizer with a given org.gnit.lucenekmp.util.AttributeFactory

Types

Link copied to clipboard
object Companion

Properties

Functions

Link copied to clipboard
fun <T : Attribute> addAttribute(attClass: KClass<T>): T
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
open override fun close()
Link copied to clipboard
fun copyTo(target: AttributeSource)
Link copied to clipboard
open override fun end()
Link copied to clipboard
Link copied to clipboard
open operator override fun equals(obj: Any?): Boolean
Link copied to clipboard
fun <T : Attribute> getAttribute(attClass: KClass<T>): T?
Link copied to clipboard
Link copied to clipboard
fun hasAttribute(attClass: KClass<out Attribute>): Boolean
Link copied to clipboard
Link copied to clipboard
open override fun hashCode(): Int
Link copied to clipboard
open override fun incrementToken(): Boolean
Link copied to clipboard
fun reflectAsString(prependAttClass: Boolean): String
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
open override fun reset()
Link copied to clipboard
Link copied to clipboard
fun setMaxTokenLength(length: Int)

Set the max allowed token length. Any token longer than this is skipped.

Link copied to clipboard
fun setReader(input: Reader)
Link copied to clipboard
open override fun toString(): String