DictionaryBasedBreakIterator

A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words.

DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator, but adds one more special substitution name: . This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in , it goes back through that range and derives additional break positions (if possible) using the dictionary.

DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It follows a prescribed search path to locate the dictionary (right now, it looks for it in /com/ibm/text/resources in each directory in the classpath, and won't find it in JAR files, but this location is likely to change). The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

Constructors

Link copied to clipboard
constructor(ruleData: ByteArray, dictionaryData: ByteArray)

Properties

Link copied to clipboard

A table for additional data. May be used by a subclass of RuleBasedBreakIterator.

Link copied to clipboard
Link copied to clipboard
val next: Int
Link copied to clipboard
open override val text: CharacterIterator

The character iterator through which this BreakIterator accesses the text

Functions

Link copied to clipboard
open override fun clone(): Any

Clones this iterator.

Link copied to clipboard
open override fun current(): Int

Returns the current iteration position.

Link copied to clipboard
open operator override fun equals(other: Any?): Boolean

Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.

Link copied to clipboard
open override fun first(): Int

Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).

Link copied to clipboard
open override fun following(offset: Int): Int

Sets the current iteration position to the first boundary position after the specified position.

Link copied to clipboard
open override fun hashCode(): Int

{@return hashcode for this BreakIterator}

Link copied to clipboard
open override fun isBoundary(offset: Int): Boolean

Returns true if the specified position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".

Link copied to clipboard
open override fun last(): Int

Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).

Link copied to clipboard
open override fun next(): Int

Advances the iterator to the next boundary position.

open override fun next(n: Int): Int

Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().

Link copied to clipboard
open override fun preceding(offset: Int): Int

Sets the current iteration position to the last boundary position before the specified position.

Link copied to clipboard
open override fun previous(): Int

Advances the iterator one step backwards.

Link copied to clipboard
fun setText(newText: String)

Set a new text string to be scanned. The current scan position is reset to first().

open override fun setText(newText: CharacterIterator)

Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.

Link copied to clipboard
open override fun toString(): String

Returns text

Link copied to clipboard

Validates the magic number, version, and the length of the given data.