common/org.gnit.lucenekmp.analysis.classic/ClassicTokenizer

ClassicTokenizer

class ClassicTokenizer : Tokenizer

A grammar-based tokenizer constructed with JFlex

This should be a good tokenizer for most European-language documents:

Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token.
Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split.
Recognizes email addresses and internet hostnames as one token.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

ClassicTokenizer was named StandardTokenizer in Lucene versions prior to 3.1. As of 3.1, [org.gnit.lucenekmp.analysis.standard.StandardTokenizer] implements Unicode text segmentation, as specified by UAX#29.

Constructors

ClassicTokenizer

constructor()

Creates a new instance of the org.gnit.lucenekmp.analysis.classic.ClassicTokenizer. Attaches the input to the newly created JFlex scanner.

constructor(factory: AttributeFactory)

Creates a new ClassicTokenizer with a given org.gnit.lucenekmp.util.AttributeFactory

Types

object Companion

Properties

attributeClassesIterator

val attributeClassesIterator: Iterator<Any>

attributeFactory

val attributeFactory: AttributeFactory

attributeImplsIterator

val attributeImplsIterator: Iterator<AttributeImpl>

Functions

fun <T : Attribute> addAttribute(attClass: KClass<T>): T

addAttributeImpl

fun addAttributeImpl(att: AttributeImpl)

fun captureState(): AttributeSource.State?

clearAttributes

fun clearAttributes()

cloneAttributes

fun cloneAttributes(): AttributeSource

open override fun close()

fun copyTo(target: AttributeSource)

open override fun end()

fun endAttributes()

open operator override fun equals(obj: Any?): Boolean

fun <T : Attribute> getAttribute(attClass: KClass<T>): T?

getMaxTokenLength

fun getMaxTokenLength(): Int

fun hasAttribute(attClass: KClass<out Attribute>): Boolean

fun hasAttributes(): Boolean

open override fun hashCode(): Int

open override fun incrementToken(): Boolean

reflectAsString

fun reflectAsString(prependAttClass: Boolean): String

fun reflectWith(reflector: AttributeReflector)

removeAllAttributes

fun removeAllAttributes()

open override fun reset()

fun restoreState(state: AttributeSource.State?)

setMaxTokenLength

fun setMaxTokenLength(length: Int)

Set the max allowed token length. Any token longer than this is skipped.

fun setReader(input: Reader)

open override fun toString(): String