UnicodeUtil

Class to encode java's UTF16 char[] into UTF8 byte[] without always allocating a new byte[] as String.getBytes(StandardCharsets.UTF_8) does.

Types

Link copied to clipboard

Holds a codepoint along with the number of bytes required to represent it in UTF8

Properties

Link copied to clipboard

A binary term consisting of a number of 0xff bytes, likely to be bigger than other terms (e.g. collation keys) one would normally encounter, and definitely bigger than any UTF-8 terms.

Link copied to clipboard
const val HALF_MASK: Long = 1023
Link copied to clipboard
const val HALF_SHIFT: Long = 10
Link copied to clipboard

Maximum number of UTF8 bytes per UTF16 character.

Link copied to clipboard
Link copied to clipboard
Link copied to clipboard
const val UNI_MAX_BMP: Long = 65535
Link copied to clipboard
const val UNI_REPLACEMENT_CHAR: Int = 65533
Link copied to clipboard
const val UNI_SUR_HIGH_END: Int = 56319
Link copied to clipboard
const val UNI_SUR_HIGH_START: Int = 55296
Link copied to clipboard
const val UNI_SUR_LOW_END: Int = 57343
Link copied to clipboard
const val UNI_SUR_LOW_START: Int = 56320
Link copied to clipboard
Link copied to clipboard
const val v: Int

Functions

Link copied to clipboard
fun calcUTF16toUTF8Length(s: CharSequence, offset: Int, len: Int): Int

Calculates the number of UTF8 bytes necessary to write a UTF16 string.

Link copied to clipboard

Computes the codepoint and codepoint length (in bytes) of the specified offset in the provided utf8 byte array, assuming UTF8 encoding. As with other related methods in this class, this assumes valid UTF8 input and does not perform full UTF8 validation. Passing invalid UTF8 or a position that is not a valid header byte position may result in undefined behavior. This makes no attempt to synchronize or validate.

Link copied to clipboard

Returns the number of code points in this UTF8 sequence.

Link copied to clipboard
fun createString(value: CharArray, offset: Int, count: Int): String
Link copied to clipboard
fun maxUTF8Length(utf16Length: Int): Int

Returns the maximum number of utf8 bytes required to encode a utf16 (e.g., java char[], String)

Link copied to clipboard
fun newString(codePoints: IntArray, offset: Int, count: Int): String

Cover JDK 1.5 API. Create a String from an array of codePoints.

Link copied to clipboard
Link copied to clipboard
fun UTF16toUTF8(source: CharArray, offset: Int, length: Int, out: ByteArray): Int

Encode characters from a char[] source, starting at offset for length chars. It is the responsibility of the caller to make sure that the destination array is large enough.

fun UTF16toUTF8(s: CharSequence, offset: Int, length: Int, out: ByteArray): Int

Encode characters from this String, starting at offset for length characters. It is the responsibility of the caller to make sure that the destination array is large enough.

fun UTF16toUTF8(s: CharSequence, offset: Int, length: Int, out: ByteArray, outOffset: Int): Int

Encode characters from this String, starting at offset for length characters. Output to the destination array will begin at outOffset. It is the responsibility of the caller to make sure that the destination array is large enough.

Link copied to clipboard
fun UTF8toUTF16(bytesRef: BytesRef, chars: CharArray): Int

Utility method for .UTF8toUTF16

fun UTF8toUTF16(utf8: ByteArray, offset: Int, length: Int, out: CharArray): Int

Interprets the given byte array as UTF-8 and converts to UTF-16. It is the responsibility of the caller to make sure that the destination array is large enough.

Link copied to clipboard
fun UTF8toUTF32(utf8: BytesRef, ints: IntArray): Int

This method assumes valid UTF8 input. This method does not perform full UTF8 validation, it will check only the first byte of each codepoint (for multi-byte sequences any bytes after the head are skipped). It is the responsibility of the caller to make sure that the destination array is large enough.

Link copied to clipboard