core/org.gnit.lucenekmp.util.automaton/RegExp/Companion/CASE_INSENSITIVE

CASE_INSENSITIVE

Allows case-insensitive matching of most Unicode characters.

In general the attempt is to reach parity with java.util.regex.Pattern Pattern.CASE_INSENSITIVE and Pattern.UNICODE_CASE flags when doing a case-insensitive match. We support common case folding in addition to simple case folding as defined by the common (C), simple (S) and special (T) mappings in https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. This is in line with [ ] and means characters like those representing the Greek symbol sigma (Σ, σ, ς) will all match one another despite σ and ς both being lowercase characters as detailed here: https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt.

Some Unicode characters are difficult to correctly decode casing. In some cases Java's String class correctly handles decoding these but Java's java.util.regex.Pattern class does not. We make only a best effort to maintaining consistency with [ ] and there may be differences.

There are three known special classes of these characters:

the set of characters whose casing matches across multiple characters such as the Greek sigma character mentioned above (Σ, σ, ς); we support these; notably some of these characters fall into the ASCII range and so will behave differently when this flag is enabled

the set of characters that are neither in an upper nor lower case stable state and can be both uppercased and lowercased from their current code point such as ǅ which when uppercased produces Ǆ and when lowercased produces ǆ; we support these

the set of characters that when uppercased produce more than 1 character. For performance reasons we ignore characters for now, which is consistent with [ ]

Sometimes these classes of character will overlap; if a character is in both class 3 and any other case listed above it is ignored; this is consistent with java.util.regex.Pattern and C,S,T mappings in https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt. Support for class 3 is only available with full (F) mappings, which is not supported. For instance: this character ῼ will match it's lowercase form ῳ but not it's uppercase form: ΩΙ

Class 3 characters that when uppercased generate multiple characters such as ﬗ (0xFB17) which when uppercased produces ՄԽ (code points: 0x0544 0x053D) and are therefore ignored; however, lowercase matching on these values is supported: 0x00DF, 0x0130, 0x0149, 0x01F0, 0x0390, 0x03B0, 0x0587, 0x1E96-0x1E9A, 0x1F50, 0x1F52, 0x1F54, 0x1F56, 0x1F80-0x1FAF, 0x1FB2-0x1FB4, 0x1FB6, 0x1FB7, 0x1FBC, 0x1FC2-0x1FC4, 0x1FC6, 0x1FC7, 0x1FCC, 0x1FD2, 0x1FD3, 0x1FD6, 0x1FD7, 0x1FE2-0x1FE4, 0x1FE6, 0x1FE7, 0x1FF2-0x1FF4, 0x1FF6, 0x1FF7, 0x1FFC, 0xFB00-0xFB06, 0xFB13-0xFB17