core/org.gnit.lucenekmp.jdkport/DictionaryBasedBreakIterator

DictionaryBasedBreakIterator

class DictionaryBasedBreakIterator(ruleData: ByteArray, dictionaryData: ByteArray) : RuleBasedBreakIterator

A subclass of RuleBasedBreakIterator that adds the ability to use a dictionary to further subdivide ranges of text beyond what is possible using just the state-table-based algorithm. This is necessary, for example, to handle word and line breaking in Thai, which doesn't use spaces between words. The state-table-based algorithm used by RuleBasedBreakIterator is used to divide up text as far as possible, and then contiguous ranges of letters are repeatedly compared against a list of known words (i.e., the dictionary) to divide them up into words.

DictionaryBasedBreakIterator uses the same rule language as RuleBasedBreakIterator, but adds one more special substitution name: . This substitution name is used to identify characters in words in the dictionary. The idea is that if the iterator passes over a chunk of text that includes two or more characters in a row that are included in , it goes back through that range and derives additional break positions (if possible) using the dictionary.

DictionaryBasedBreakIterator is also constructed with the filename of a dictionary file. It follows a prescribed search path to locate the dictionary (right now, it looks for it in /com/ibm/text/resources in each directory in the classpath, and won't find it in JAR files, but this location is likely to change). The dictionary file is in a serialized binary format. We have a very primitive (and slow) BuildDictionaryFile utility for creating dictionary files, but aren't currently making it public. Contact us for help.

Constructors

DictionaryBasedBreakIterator

constructor(ruleData: ByteArray, dictionaryData: ByteArray)

Properties

additionalData

var additionalData: ByteArray?

A table for additional data. May be used by a subclass of RuleBasedBreakIterator.

current

val current: Int

val next: Int

text

open override val text: CharacterIterator

The character iterator through which this BreakIterator accesses the text

Functions

clone

open override fun clone(): Any

Clones this iterator.

current

open override fun current(): Int

Returns the current iteration position.

equals

open operator override fun equals(other: Any?): Boolean

Returns true if both BreakIterators are of the same class, have the same rules, and iterate over the same text.

first

open override fun first(): Int

Sets the current iteration position to the beginning of the text. (i.e., the CharacterIterator's starting offset).

following

open override fun following(offset: Int): Int

Sets the current iteration position to the first boundary position after the specified position.

hashCode

open override fun hashCode(): Int

{@return hashcode for this BreakIterator}

isBoundary

open override fun isBoundary(offset: Int): Boolean

Returns true if the specified position is a boundary position. As a side effect, leaves the iterator pointing to the first boundary position at or after "offset".

last

open override fun last(): Int

Sets the current iteration position to the end of the text. (i.e., the CharacterIterator's ending offset).

open override fun next(): Int

Advances the iterator to the next boundary position.

open override fun next(n: Int): Int

Advances the iterator either forward or backward the specified number of steps. Negative values move backward, and positive values move forward. This is equivalent to repeatedly calling next() or previous().

preceding

open override fun preceding(offset: Int): Int

Sets the current iteration position to the last boundary position before the specified position.

open override fun previous(): Int

Advances the iterator one step backwards.

setText

fun setText(newText: String)

Set a new text string to be scanned. The current scan position is reset to first().

open override fun setText(newText: CharacterIterator)

Set the iterator to analyze a new piece of text. This function resets the current iteration position to the beginning of the text.

toString

open override fun toString(): String

Returns text

validateRuleData

fun validateRuleData(bb: ByteBuffer)

Validates the magic number, version, and the length of the given data.