core/org.gnit.lucenekmp.analysis.standard/StandardTokenizer

StandardTokenizer

A grammar-based tokenizer constructed with JFlex.

This class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Constructors

StandardTokenizer

constructor()

Creates a new instance of the org.gnit.lucenekmp.analysis.standard.StandardTokenizer. Attaches the input to the newly created JFlex scanner!!.

constructor(factory: AttributeFactory)

Creates a new StandardTokenizer with a given org.gnit.lucenekmp.util.AttributeFactory

Types

Companion

object Companion

Properties

attributeClassesIterator

val attributeClassesIterator: Iterator<Any>

attributeFactory

val attributeFactory: AttributeFactory

attributeImplsIterator

val attributeImplsIterator: Iterator<AttributeImpl>

Functions

addAttribute

fun <T : Attribute> addAttribute(attClass: KClass<T>): T

The caller must pass in a Class value. This method first checks if an instance of that class is already in this AttributeSource and returns it. Otherwise a new instance is created, added to this AttributeSource and returned.

addAttributeImpl

fun addAttributeImpl(att: AttributeImpl)

Expert: Adds a custom AttributeImpl instance with one or more Attribute interfaces.

captureState

fun captureState(): AttributeSource.State?

Captures the state of all Attributes. The return value can be passed to .restoreState to restore the state of this or another AttributeSource.

clearAttributes

fun clearAttributes()

Resets all Attributes in this AttributeSource by calling AttributeImpl.clear on each Attribute implementation.

cloneAttributes

fun cloneAttributes(): AttributeSource

Performs a clone of all AttributeImpl instances returned in a new AttributeSource instance. This method can be used to e.g. create another TokenStream with exactly the same attributes (using .AttributeSource). You can also use it as a (non-performant) replacement for .captureState, if you need to look into / modify the captured state.

open override fun close()

{@inheritDoc}

copyTo

fun copyTo(target: AttributeSource)

Copies the contents of this AttributeSource to the given target AttributeSource. The given instance has to provide all Attributes this instance contains. The actual attribute implementations must be identical in both AttributeSource instances; ideally both AttributeSource instances should use the same [ ]. You can use this method as a replacement for .restoreState, if you use .cloneAttributes instead of .captureState.

end

open override fun end()

This method is called by the consumer after the last token has been consumed, after .incrementToken returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.

endAttributes

fun endAttributes()

Resets all Attributes in this AttributeSource by calling AttributeImpl.end on each Attribute implementation.

equals

open operator override fun equals(obj: Any?): Boolean

getAttribute

fun <T : Attribute> getAttribute(attClass: KClass<T>): T?

Returns the instance of the passed in Attribute contained in this AttributeSource

getMaxTokenLength

fun getMaxTokenLength(): Int

Returns the current maximum token length

hasAttribute

fun hasAttribute(attClass: KClass<out Attribute>): Boolean

The caller must pass in a Class value. Returns true, iff this AttributeSource contains the passed-in Attribute.

hasAttributes

fun hasAttributes(): Boolean

Returns true, iff this AttributeSource has any attributes

hashCode

open override fun hashCode(): Int

incrementToken

open override fun incrementToken(): Boolean

Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate [ ]s with the attributes of the next token.

reflectAsString

fun reflectAsString(prependAttClass: Boolean): String

This method returns the current attribute values as a string in the following format by calling the .reflectWith method:

reflectWith

fun reflectWith(reflector: AttributeReflector)

This method is for introspection of attributes, it should simply add the key/values this AttributeSource holds to the given AttributeReflector.

removeAllAttributes

fun removeAllAttributes()

Removes all attributes and their implementations from this AttributeSource.

reset

open override fun reset()

This method is called by a consumer before it begins consumption using .incrementToken.

restoreState

fun restoreState(state: AttributeSource.State?)

Restores this state by copying the values of all attribute implementations that this state contains into the attributes implementations of the targetStream. The targetStream must contain a corresponding instance for each argument contained in this state (e.g. it is not possible to restore the state of an AttributeSource containing a TermAttribute into a AttributeSource using a Token instance as implementation).

setMaxTokenLength

fun setMaxTokenLength(length: Int)

Set the max allowed token length. Tokens larger than this will be chopped up at this token length and emitted as multiple tokens. If you need to skip such large tokens, you could increase this max length, and then use LengthFilter to remove long tokens. The default is StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH.

setReader

fun setReader(input: Reader)

Expert: Set a new reader on the Tokenizer. Typically, an analyzer (in its tokenStream method) will use this to re-use a previously created tokenizer.

toString

open override fun toString(): String

Returns a string consisting of the class's simple name, the hex representation of the identity hash code, and the current reflection of all attributes.