UAX29URLEmailTokenizerImpl
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
Tokens produced are of the following types:
: A sequence of alphabetic and numeric characters : A number : A URL : An email address : A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer : A single CJKV ideographic character : A single hiragana character : A sequence of katakana characters : A sequence of Hangul characters : A sequence of Emoji characters
Functions
Resumes scanning until the next regular expression is matched, the end of input is encountered or an I/O-Error occurs.
Fills CharTermAttribute with the current token text.
Sets the scanner buffer size in chars
Pushes the specified amount of characters back into the input stream.