core/org.gnit.lucenekmp.util/UnicodeUtil

UnicodeUtil

Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes(StandardCharsets.UTF_8) does.

Types

UTF8CodePoint

class UTF8CodePoint

Holds a codepoint along with the number of bytes required to represent it in UTF8

Properties

BIG_TERM

val BIG_TERM: BytesRef

A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g. collation keys) one would normally encounter, and definitely bigger than any UTF-8 terms.

HALF_MASK

const val HALF_MASK: Long = 1023

HALF_SHIFT

const val HALF_SHIFT: Long = 10

MAX_UTF8_BYTES_PER_CHAR

const val MAX_UTF8_BYTES_PER_CHAR: Int = 3

Maximum number of UTF8 bytes per UTF16 character.

MIN_SUPPLEMENTARY_CODE_POINT

const val MIN_SUPPLEMENTARY_CODE_POINT: Int = 65536

SURROGATE_OFFSET

val SURROGATE_OFFSET: Int

UNI_MAX_BMP

const val UNI_MAX_BMP: Long = 65535

UNI_REPLACEMENT_CHAR

const val UNI_REPLACEMENT_CHAR: Int = 65533

UNI_SUR_HIGH_END

const val UNI_SUR_HIGH_END: Int = 56319

UNI_SUR_HIGH_START

const val UNI_SUR_HIGH_START: Int = 55296

UNI_SUR_LOW_END

const val UNI_SUR_LOW_END: Int = 57343

UNI_SUR_LOW_START

const val UNI_SUR_LOW_START: Int = 56320

utf8CodeLength

val utf8CodeLength: IntArray

const val v: Int

Functions

calcUTF16toUTF8Length

fun calcUTF16toUTF8Length(s: CharSequence, offset: Int, len: Int): Int

Calculates the number of UTF8 bytes necessary to write a UTF16 string.

codePointAt

fun codePointAt(utf8: ByteArray, pos: Int, reuse: UnicodeUtil.UTF8CodePoint?): UnicodeUtil.UTF8CodePoint

Computes the codepoint and codepoint length (in bytes) of the specified offset in the provided utf8 byte array, assuming UTF8 encoding. As with other related methods in this class, this assumes valid UTF8 input and does not perform full UTF8 validation. Passing invalid UTF8 or a position that is not a valid header byte position may result in undefined behavior. This makes no attempt to synchronize or validate.

codePointCount

fun codePointCount(utf8: BytesRef): Int

Returns the number of code points in this UTF8 sequence.

createString

fun createString(value: CharArray, offset: Int, count: Int): String

maxUTF8Length

fun maxUTF8Length(utf16Length: Int): Int

Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)

newString

fun newString(codePoints: IntArray, offset: Int, count: Int): String

Cover JDK 1.5 API. Create a String from an array of codePoints.

toHexString

fun toHexString(s: String): String

UTF16toUTF8

fun UTF16toUTF8(source: CharArray, offset: Int, length: Int, out: ByteArray): Int

Encode characters from a char[] source, starting at offset for length chars. It is the responsibility of the caller to make sure that the destination array is large enough.

fun UTF16toUTF8(s: CharSequence, offset: Int, length: Int, out: ByteArray): Int

Encode characters from this String, starting at offset for length characters. It is the responsibility of the caller to make sure that the destination array is large enough.

fun UTF16toUTF8(s: CharSequence, offset: Int, length: Int, out: ByteArray, outOffset: Int): Int

Encode characters from this String, starting at offset for length characters. Output to the destination array will begin at outOffset. It is the responsibility of the caller to make sure that the destination array is large enough.

UTF8toUTF16

fun UTF8toUTF16(bytesRef: BytesRef, chars: CharArray): Int

Utility method for .UTF8toUTF16

fun UTF8toUTF16(utf8: ByteArray, offset: Int, length: Int, out: CharArray): Int

Interprets the given byte array as UTF-8 and converts to UTF-16. It is the responsibility of the caller to make sure that the destination array is large enough.

UTF8toUTF32

fun UTF8toUTF32(utf8: BytesRef, ints: IntArray): Int

This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped). It is the responsibility of the caller to make sure that the destination array is large enough.

validUTF16String

fun validUTF16String(s: CharSequence): Boolean

fun validUTF16String(s: CharArray, size: Int): Boolean