TokenStreamToAutomaton

Consumes a TokenStream and creates an Automaton where the transition labels are UTF8 bytes (or Unicode code points if unicodeArcs is true) from the TermToBytesRefAttribute. Between tokens we insert POS_SEP and for holes we insert HOLE.

Constructors

Link copied to clipboard
constructor()

Types

Link copied to clipboard
object Companion

Functions

Link copied to clipboard
fun setFinalOffsetGapAsHole(finalOffsetGapAsHole: Boolean)

If true, any final offset gaps will result in adding a position hole.

Link copied to clipboard
fun setPreservePositionIncrements(enablePositionIncrements: Boolean)

Whether to generate holes in the automaton for missing positions, true by default.

Link copied to clipboard
fun setUnicodeArcs(unicodeArcs: Boolean)

Whether to make transition labels Unicode code points instead of UTF8 bytes, false by default

Link copied to clipboard

Pulls the graph (including PositionLengthAttribute) from the provided [ ], and creates the corresponding automaton where arcs are bytes (or Unicode code points if unicodeArcs = true) from each term.