common/org.gnit.lucenekmp.analysis.email/UAX29URLEmailTokenizerImpl

UAX29URLEmailTokenizerImpl

class UAX29URLEmailTokenizerImpl(in: Reader)

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

: A sequence of alphabetic and numeric characters
: A number
: A URL
: An email address
: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer
: A single CJKV ideographic character
: A single hiragana character
: A sequence of katakana characters
: A sequence of Hangul characters
: A sequence of Emoji characters

Constructors

UAX29URLEmailTokenizerImpl

constructor(in: Reader)

Types

object Companion

Functions

fun getNextToken(): Int

Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.

fun getText(t: CharTermAttribute)

Fills CharTermAttribute with the current token text.

fun setBufferSize(numChars: Int)

Sets the scanner buffer size in chars

fun yyatEOF(): Boolean

Returns whether the scanner has reached the end of the reader it reads from.

fun yybegin(newState: Int)

Enters a new lexical state.

fun yychar(): Int

Character count processed so far

fun yycharat(position: Int): Char

Returns the character at the given position from the matched text.

Closes the input reader.

fun yylength(): Int

How many characters were matched.

fun yypushback(number: Int)

Pushes the specified amount of characters back into the input stream.

fun yyreset(reader: Reader)

Resets the scanner to read from a new input stream.

fun yystate(): Int

Returns the current lexical state.

fun yytext(): String

Returns the text matched by the current regular expression.