common/org.gnit.lucenekmp.analysis.custom/CustomAnalyzer

CustomAnalyzer

A general-purpose Analyzer that can be created with a builder-style API. Under the hood it uses the factory classes TokenizerFactory, TokenFilterFactory, and CharFilterFactory.

You can create an instance of this Analyzer using the builder by passing the SPI names (as defined by the Java `ServiceLoader` interface) to it:

Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(LowerCaseFilterFactory.NAME)
.addTokenFilter(StopFilterFactory.NAME, "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();

The parameters passed to components are also used by Apache Solr and are documented on their corresponding factory classes. Refer to documentation of subclasses of TokenizerFactory, TokenFilterFactory, and CharFilterFactory.

This is the same as the above:

Analyzer ana = CustomAnalyzer.builder(Paths.get("/path/to/config/dir"))
.withTokenizer("standard")
.addTokenFilter("lowercase")
.addTokenFilter("stop", "ignoreCase", "false", "words", "stopwords.txt", "format", "wordset")
.build();

The list of names to be used for components can be looked up through: [TokenizerFactory.availableTokenizers], [TokenFilterFactory.availableTokenFilters], and [CharFilterFactory.availableCharFilters].

You can create conditional branches in the analyzer by using [Builder.when] and [Builder.whenTerm]:

Analyzer ana = CustomAnalyzer.builder()
    .withTokenizer("standard")
    .addTokenFilter("lowercase")
    .whenTerm(t -> t.length() > 10)
      .addTokenFilter("reversestring")
    .endwhen()
    .build();