FSTCompiler

class FSTCompiler<T>

Builds a minimal FST (maps an IntsRef term to an arbitrary output) from pre-sorted terms with outputs. The FST becomes an FSA if you use NoOutputs. The FST is written on-the-fly into a compact serialized format byte array, which can be saved to / loaded from a Directory or used directly for traversal. The FST is always finite (no cycles).

NOTE: The algorithm is described at http://citeseerx.ist.psu.edu/viewdoc/summarydoi=10.1.1.24.3698

The parameterized type T is the output type. See the subclasses of Outputs.

FSTs larger than 2.1GB are now possible (as of Lucene 4.2). FSTs containing more than 2.1B nodes are also now possible, however they cannot be packed.

It now supports 3 different workflows:

  • Build FST and use it immediately entirely in RAM and then discard it

  • Build FST and use it immediately entirely in RAM and also save it to other DataOutput, and load it later and use it

  • Build FST but stream it immediately to disk (except the FSTMetaData, to be saved at the end). In order to use it, you need to construct the corresponding DataInput and use the FST constructor to read it.

Types

Link copied to clipboard
class Arc<T>

Expert: holds a pending (seen but not yet serialized) arc.

Link copied to clipboard
class Builder<T>(inputType: FST.INPUT_TYPE, outputs: Outputs<T>)

Fluent-style constructor for FST FSTCompiler.

Link copied to clipboard
object Companion
Link copied to clipboard

Reusable buffer for building nodes with fixed length arcs (binary search or direct addressing).

Link copied to clipboard
interface Node
Link copied to clipboard
class UnCompiledNode<T>(val owner: FSTCompiler<T>, depth: Int) : FSTCompiler.Node

Expert: holds a pending (seen but not yet serialized) Node.

Properties

Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
val fst: FST<T>
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard

Functions

Link copied to clipboard
fun add(input: IntsRef, output: T)

Add the next input/output pair. The provided input must be sorted after the previous one according to IntsRef.compareTo. It's also OK to add the same input twice in a row with different outputs, as long as Outputs implements the Outputs.merge method. Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not. So if your outputs are changeable (eg ByteSequenceOutputs or [ ]) then you cannot reuse across calls.

Link copied to clipboard
Link copied to clipboard

Returns the metadata of the final FST. NOTE: this will return null if nothing is accepted by the FST themselves.

Link copied to clipboard
fun finish(newStartNode: Long)
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard

Get the respective FSTReader of the DataOutput. To call this method, you need to use the default DataOutput or .getOnHeapReaderWriter, otherwise we will throw an exception.

Link copied to clipboard
@JvmName(name = "getNodeCountKt")
fun getNodeCount(): Long
Link copied to clipboard