UAX29URLEmailTokenizerImpl

This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Tokens produced are of the following types:

  • : A sequence of alphabetic and numeric characters

  • : A number

  • : A URL

  • : An email address

  • : A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer

  • : A single CJKV ideographic character

  • : A single hiragana character

  • : A sequence of katakana characters

  • : A sequence of Hangul characters

  • : A sequence of Emoji characters

Constructors

Link copied to clipboard
constructor(in: Reader)

Types

Link copied to clipboard
object Companion

Functions

Link copied to clipboard

Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.

Link copied to clipboard

Fills CharTermAttribute with the current token text.

Link copied to clipboard
fun setBufferSize(numChars: Int)

Sets the scanner buffer size in chars

Link copied to clipboard

Returns whether the scanner has reached the end of the reader it reads from.

Link copied to clipboard
fun yybegin(newState: Int)

Enters a new lexical state.

Link copied to clipboard
fun yychar(): Int

Character count processed so far

Link copied to clipboard
fun yycharat(position: Int): Char

Returns the character at the given position from the matched text.

Link copied to clipboard
fun yyclose()

Closes the input reader.

Link copied to clipboard
fun yylength(): Int

How many characters were matched.

Link copied to clipboard
fun yypushback(number: Int)

Pushes the specified amount of characters back into the input stream.

Link copied to clipboard
fun yyreset(reader: Reader)

Resets the scanner to read from a new input stream.

Link copied to clipboard
fun yystate(): Int

Returns the current lexical state.

Link copied to clipboard
fun yytext(): String

Returns the text matched by the current regular expression.