Java 编码相关问题

Java 编码相关问题

Orange Summer

2023 年 03 月 28 日

776 次浏览

暂无评论

3520字数

技术

字符编码相关知识

字符编码

字节(Byte)1字节代表8比特(bit)，例如 00001111为1字节。字符任何一个文字或符号都是一个字符，但...

String.length()

public class StringBytes {
    public static void main(String[] args) {
        String temp1 = "𝄞";
        String temp2 = "\uD834\uDD1E"; //上面那个字符的UTF-16编码
        System.out.println(temp2);
        System.out.println(temp1.length());
        System.out.println(temp1.codePointCount(0, temp2.length()));
    }
}
//运行结果
𝄞
2
1

/**
 * Returns the length of this string.
 * The length is equal to the number of Unicode code units in the string.
 *
 * @return  the length of the sequence of characters represented by this
 *          object.
 */
public int length() {
    return value.length;
}

从String.length()的 java 源码的注释可以看出，返回的长度等于字符串的 unicode 码元的数量。Java 默认编码是UTF-16，从上面那个字符的编码可以明显看出是UTF-16编码的代理区中的字符，编码为32位，码元为16位，所以有2个码元，字符串的长度为2，但实际不等于字符串中字符的数量。

String.codePointCount()

顾名思义，返回字符串中码点的数量，不知道码点是什么可以看最开头那篇文章。总之一个字符一定对应一个码点，所以这个方法返回的是准确的字符串的字符数量。

String.getBytes().length

/**
 * Encodes this {@code String} into a sequence of bytes using the
 * platform's default charset, storing the result into a new byte array.
 *
 * <p> The behavior of this method when this string cannot be encoded in
 * the default charset is unspecified.  The {@link
 * java.nio.charset.CharsetEncoder} class should be used when more control
 * over the encoding process is required.
 *
 * @return  The resultant byte array
 *
 * @since      JDK1.1
 */
public byte[] getBytes() {
    return StringCoding.encode(value, 0, value.length);
}
public byte[] getBytes(String charsetName)
        throws UnsupportedEncodingException {
    if (charsetName == null) throw new NullPointerException();
    return StringCoding.encode(charsetName, value, 0, value.length);
}

注释说第一个方法使用平台的默认字符集将字符串编码为字节序列，靠环境的默认字符集决定结果显然是很危险的，一般建议是用第二个方法添加选择编码方案的参数来保证结果是自己想要的。

public class StringBytes {
    public static void main(String[] args) throws UnsupportedEncodingException {
        String temp = "字"; //unicode 码点为U+5B57
        System.out.println(temp.getBytes("UTF-8").length);
        System.out.println(temp.getBytes("UTF-16").length);
    }
}
//运行结果
3
4

上述例子查询对应码点范围可知该字符在UTF-8中用3个字节表示，所以得到对应的结果。

但是U+5B57在UTF-16中只需要2 byte 表示，为什么这里显示4 byte 呢？

UTF-16的大小端问题

当一个字符要用大于一个字节表示并传输时，就要考虑字节序的问题。

观察UTF-8的编码规则会发现只要通过每个字节开头的几位就能确定字节的次序，而UTF-16不具有这种性质，将两个字节对调就变成了另一个字符，因此在传输时需要指明UTF-16编码的大小端。

上面多出2个字节的问题就是因为在编码方案中没有指明大小端时，选择开头多用两个字节表示大小端。

public class StringBytes {
    public static void main(String[] args) throws UnsupportedEncodingException {
        //unicode 码点为 U+5B57
        String temp = "字";
        System.out.println(temp.getBytes(StandardCharsets.UTF_16).length);
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_16)));
        System.out.println(temp.getBytes(StandardCharsets.UTF_16LE).length);
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_16LE)));
        System.out.println(temp.getBytes(StandardCharsets.UTF_16BE).length);
        System.out.println(Arrays.toString(temp.getBytes(StandardCharsets.UTF_16BE)));
    }
}
//运行结果
4
[-2, -1, 91, 87]
2
[87, 91]
2
[91, 87]

上述代码中UTF-16有如下几种情形：