ShingleFilter

A ShingleFilter constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token.

For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".

This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0.

Constructors

Link copied to clipboard
constructor(input: TokenStream, minShingleSize: Int, maxShingleSize: Int)

Constructs a ShingleFilter with the specified shingle size from the TokenStream input

constructor(input: TokenStream, maxShingleSize: Int)

Constructs a ShingleFilter with the specified shingle size from the TokenStream input

constructor(input: TokenStream)

Construct a ShingleFilter with default shingle size: 2.

constructor(input: TokenStream, tokenType: String)

Construct a ShingleFilter with the specified token type for shingle tokens and the default shingle size: 2

Types

Link copied to clipboard
object Companion

Properties

Link copied to clipboard
Link copied to clipboard

true if no shingles have been output yet (for outputUnigramsIfNoShingles).

Functions

Link copied to clipboard
fun <T : Attribute> addAttribute(attClass: KClass<T>): T
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
open override fun close()
Link copied to clipboard
fun copyTo(target: AttributeSource)
Link copied to clipboard
open override fun end()
Link copied to clipboard
Link copied to clipboard
open operator override fun equals(obj: Any?): Boolean
Link copied to clipboard
fun <T : Attribute> getAttribute(attClass: KClass<T>): T?
Link copied to clipboard
fun hasAttribute(attClass: KClass<out Attribute>): Boolean
Link copied to clipboard
Link copied to clipboard
open override fun hashCode(): Int
Link copied to clipboard
open override fun incrementToken(): Boolean
Link copied to clipboard
fun reflectAsString(prependAttClass: Boolean): String
Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
open override fun reset()
Link copied to clipboard
Link copied to clipboard
fun setFillerToken(fillerToken: String?)

Sets the string to insert for each position at which there is no token (i.e., when position increment is greater than one).

Link copied to clipboard
fun setMaxShingleSize(maxShingleSize: Int)

Set the max shingle size (default: 2)

Link copied to clipboard
fun setMinShingleSize(minShingleSize: Int)

Set the min shingle size (default: 2).

Link copied to clipboard
fun setOutputUnigrams(outputUnigrams: Boolean)

Shall the output stream contain the input tokens (unigrams) as well as shingles? (default: true.)

Link copied to clipboard
fun setOutputUnigramsIfNoShingles(outputUnigramsIfNoShingles: Boolean)

Shall we override the behavior of outputUnigrams==false for those times when no shingles are available (because there are fewer than minShingleSize tokens in the input stream)? (default: false.)

Link copied to clipboard
fun setTokenSeparator(tokenSeparator: String?)

Sets the string to use when joining adjacent tokens to form a shingle

Link copied to clipboard
fun setTokenType(tokenType: String)

Set the type of the shingle tokens produced by this filter. (default: "shingle")

Link copied to clipboard
open override fun toString(): String
Link copied to clipboard
open override fun unwrap(): TokenStream