UAX29URLEmailTokenizer
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
Constructors
Link copied to clipboard
constructor()
Creates a new instance of the UAX29URLEmailTokenizer. Attaches the input to the newly created JFlex scanner.
Creates a new UAX29URLEmailTokenizer with a given AttributeFactory
Functions
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Set the max allowed token length. Tokens larger than this will be chopped up at this token length and emitted as multiple tokens. If you need to skip such large tokens, you could increase this max length, and then use LengthFilter to remove long tokens. The default is UAX29URLEmailAnalyzer.DEFAULT_MAX_TOKEN_LENGTH.