range over a string
suggest changeIterate over bytes
You can iterate over bytes in a string:
s := "a 世"
for i := range s {
b := s[i]
fmt.Printf("idx: %d, byte: %d\n", i, b)
}
idx: 0, byte: 97
idx: 1, byte: 32
idx: 2, byte: 228
Iterate over runes
Things are more complicated when you want to iterate over logical characters (runes) in a string:
s := "Hey 世界"
for i, rune := range s {
fmt.Printf("idx: %d, rune: %d\n", i, rune)
}
idx: 0, rune: 72
idx: 1, rune: 101
idx: 2, rune: 121
idx: 3, rune: 32
idx: 4, rune: 19990
idx: 7, rune: 30028
In Go strings are immutable sequence of bytes. Think a read-only []byte
slice.
Each byte is in 0 to 255 range.
There are many more characters in all the world’s alphabets.
Unicode standard defines unique value for every known character. Unicode calls them code points and they are integers that can fit in 32 bits.
To represent Unicode code points, Go has a rune
type. It is an alias for int32
.
Literal strings in Go source code are UTF-8 encoded.
Every Unicode code point can be encoded with 1 to 4 bytes.
In this form of iteration, Go assumes that a string is UTF-8 encoded. range
decodes each code point as UTF-8, returns decoded rune and its byte index in string.
You can see the byte index of last code point jumped by 3 because code point before it represents a Chinese character and required 3 bytes in UTF-8 encoding.
Strings and UTF-8
Go strings are slices of bytes. You can put arbitrary binary data in them.
How the bytes are interpreted is up to your code.
Most of the time a string represents Unicode string in UTF-8 encoding but outside of string literals in Go source code, Go doesn't check or ensure that string data form a valid UTF-8 sequence.
That being said, Go provides functionality for working with UTF-8 encoded data.
The behavior of range
is one example of that.