StandardTokenizer

A grammar-based tokenizer constructed with JFlex.

This class implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer.

Constructors

Link copied to clipboard
constructor()

Creates a new instance of the org.gnit.lucenekmp.analysis.standard.StandardTokenizer. Attaches the input to the newly created JFlex scanner!!.

constructor(factory: AttributeFactory)

Creates a new StandardTokenizer with a given org.gnit.lucenekmp.util.AttributeFactory

Types

Link copied to clipboard
object Companion

Properties

Functions

Link copied to clipboard
fun <T : Attribute> addAttribute(attClass: KClass<T>): T

The caller must pass in a Class value. This method first checks if an instance of that class is already in this AttributeSource and returns it. Otherwise a new instance is created, added to this AttributeSource and returned.

Link copied to clipboard

Expert: Adds a custom AttributeImpl instance with one or more Attribute interfaces.

Link copied to clipboard

Captures the state of all Attributes. The return value can be passed to .restoreState to restore the state of this or another AttributeSource.

Link copied to clipboard

Resets all Attributes in this AttributeSource by calling AttributeImpl.clear on each Attribute implementation.

Link copied to clipboard

Performs a clone of all AttributeImpl instances returned in a new AttributeSource instance. This method can be used to e.g. create another TokenStream with exactly the same attributes (using .AttributeSource). You can also use it as a (non-performant) replacement for .captureState, if you need to look into / modify the captured state.

Link copied to clipboard
open override fun close()

{@inheritDoc}

Link copied to clipboard
fun copyTo(target: AttributeSource)

Copies the contents of this AttributeSource to the given target AttributeSource. The given instance has to provide all Attributes this instance contains. The actual attribute implementations must be identical in both AttributeSource instances; ideally both AttributeSource instances should use the same [ ]. You can use this method as a replacement for .restoreState, if you use .cloneAttributes instead of .captureState.

Link copied to clipboard
open override fun end()

This method is called by the consumer after the last token has been consumed, after .incrementToken returned false (using the new TokenStream API). Streams implementing the old API should upgrade to use this feature.

Link copied to clipboard

Resets all Attributes in this AttributeSource by calling AttributeImpl.end on each Attribute implementation.

Link copied to clipboard
open operator override fun equals(obj: Any?): Boolean
Link copied to clipboard
fun <T : Attribute> getAttribute(attClass: KClass<T>): T?

Returns the instance of the passed in Attribute contained in this AttributeSource

Link copied to clipboard

Returns the current maximum token length

Link copied to clipboard
fun hasAttribute(attClass: KClass<out Attribute>): Boolean

The caller must pass in a Class value. Returns true, iff this AttributeSource contains the passed-in Attribute.

Link copied to clipboard

Returns true, iff this AttributeSource has any attributes

Link copied to clipboard
open override fun hashCode(): Int
Link copied to clipboard
open override fun incrementToken(): Boolean

Consumers (i.e., IndexWriter) use this method to advance the stream to the next token. Implementing classes must implement this method and update the appropriate [ ]s with the attributes of the next token.

Link copied to clipboard
fun reflectAsString(prependAttClass: Boolean): String

This method returns the current attribute values as a string in the following format by calling the .reflectWith method:

Link copied to clipboard

This method is for introspection of attributes, it should simply add the key/values this AttributeSource holds to the given AttributeReflector.

Link copied to clipboard

Removes all attributes and their implementations from this AttributeSource.

Link copied to clipboard
open override fun reset()

This method is called by a consumer before it begins consumption using .incrementToken.

Link copied to clipboard

Restores this state by copying the values of all attribute implementations that this state contains into the attributes implementations of the targetStream. The targetStream must contain a corresponding instance for each argument contained in this state (e.g. it is not possible to restore the state of an AttributeSource containing a TermAttribute into a AttributeSource using a Token instance as implementation).

Link copied to clipboard
fun setMaxTokenLength(length: Int)

Set the max allowed token length. Tokens larger than this will be chopped up at this token length and emitted as multiple tokens. If you need to skip such large tokens, you could increase this max length, and then use LengthFilter to remove long tokens. The default is StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH.

Link copied to clipboard
fun setReader(input: Reader)

Expert: Set a new reader on the Tokenizer. Typically, an analyzer (in its tokenStream method) will use this to re-use a previously created tokenizer.

Link copied to clipboard
open override fun toString(): String

Returns a string consisting of the class's simple name, the hex representation of the identity hash code, and the current reflection of all attributes.