Package-level declarations

Types

Link copied to clipboard

A TokenFilter that reorders Khmer characters within each token to a canonical order, and applies various regex-based normalizations for split vowels and coeng sequences.

Link copied to clipboard

Tokenizes a string in Khmer grapheme clusters (not phonetic syllables), for instance: "ខ្ញុំចង់ធ្វើការ" will be tokenized as "ខ្ញុំ", "ច", "ង់", "ធ្វើ", "កា", "រ", not "ខ្ញុំ", "ចង់", "ធ្វើ", "ការ". It uses a simple state machine to do so.

Link copied to clipboard

Analyzer for Khmer text.

Link copied to clipboard

Applies character normalization for Khmer text. Wraps MappingCharFilter and applies a set of confusable-character mappings that vary by normalization level.

Link copied to clipboard