DictionaryBasedBreakIterator
A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words.
DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator, but adds one more special substitution name:
DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It follows a prescribed search path to locate the dictionary (right now, it looks for it in /com/ibm/text/resources in each directory in the classpath, and won't find it in JAR files, but this location is likely to change). The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.
Properties
Functions
Returns true if the specified position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".
Advances the iterator to the next boundary position.
Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().
Set a new text string to be scanned. The current scan position is reset to first().
Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.
Validates the magic number, version, and the length of the given data.