Python for Programmers: Bytes
Welcome to the Bytes lesson!
This lesson is shown as static text below. However, it's designed to be used interactively. Click the button below to start!
When we first learn to program, we're often told that everything in a computer is made of bits: "just ones and zeros". We use those bits to build 8-bit bytes, then use the bytes to build strings, integers, floats, lists, etc. Python lets us ignore bits and bytes most of the time, but sometimes we need to work with them directly.
bytesobjects serve that purpose. They're sequences of plain bytes, 8 bits each, with possible values of 0 through 255. (Each bit is 1 or 0, so 8 bits can represent2**8, or 256, different values.)We create
bytesobjects by callingbyteswith a list of integer values, one per byte.>
data = bytes([104, 101, 108, 108, 111])The indexing operator gives us individual byte values as integers.
- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
data[0]Result:
104
- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
data[4]Result:
111
Bytes are important when we communicate over networks or work with files. Network connections transmit raw bytes, and files also store raw bytes. When data is written to a file or sent over a network connection, it must be converted into bytes somehow.
For strings, that means an "encoding", a bidirectional mapping between characters like "h" and byte values like 104.
Here's the string
"hello"encoded using a few different encodings. We'll print the individual byte values as integers, like when we built thebytesobject above.>
print(list("hello".encode("utf_8")))print(list("hello".encode("utf_16")))print(list("hello".encode("cp037")))console outputIn Python, there's only one string
"hello". But to send the string"hello"over a network connection, the software on both ends of the connection must agree on an encoding. Is it UTF-8? UTF-16? IBM037 (the "cp037" above)? Each of the byte sequences above is a different encoding of the same string,"hello". None of them is the "correct" or "real" encoding; they're all equally legitimate.What happens when we try to read text using the wrong encoding? Sometimes, we'll get an answer that seems correct, but only because we got lucky. That particular string happened to encode to the same bytes in both encodings.
Here are some bytes encoded using UTF-8. But we decode them with ShiftJIS, an encoding sometimes used for Japanese text. We get lucky this time: these five characters encode to the same bytes in both UTF-8 and ShiftJIS.
>
bytes([104, 101, 108, 108, 111]).decode("shift_jis")Result:
Here are the same bytes again, but decoded using the "cp037" encoding. This gives us nonsense, even though the bytes are identical.
>
bytes([104, 101, 108, 108, 111]).decode("cp037")Result:
Here are the same bytes yet again, but decoded using UTF-16, which is another Unicode encoding. This raises an exception, because these bytes aren't valid UTF-16 data.
>
bytes([104, 101, 108, 108, 111]).decode("utf_16")Result:
Text encodings are like measurement units. An air temperature of 55 means very different things in Fahrenheit (you might want a light jacket) vs. Celsius (your life is in danger due to heatstroke) vs. Kelvin (you are ice). And the bytes
[104, 101, 108, 108, 111]mean very different things in UTF-8 vs. UTF-16 vs. IBM037.In practice, most text is now Unicode. But that doesn't solve the problem! As we saw above, UTF-8 and UTF-16 give very different bytes for the same string. Both are Unicode encodings, and both are widely used.
Python has distinct types for bytes (
bytes) and strings (str) because bytes and strings are separate in practice. Having separate types ensures that the encoding and decoding steps are explicit, which reduces mistakes like the ones shown above.So far, we've created
bytesobjects by callingbytes(...). We can also write them with a string-like syntax, but with abprefix. This syntax allows ASCII characters, which are converted into their equivalent ASCII codes. (Roughly speaking, ASCII includes English letters, numbers, and punctuation.)>
b"hello"Result:
If we look inside that bytes object, we'll see the same familiar array of byte values from some earlier code examples.
>
list(b"hello")Result:
We can use the
\xescape code to specify any byte as a hexadecimal value. (Hexadecimal is the base-16 number system. Hex digits have the values 0-15, written using the numbers 0-9 and letters a-f. For example,\x00is 0,\x0fis 15,\xf0is 240 (15 * 16), and\xffis the maximum byte, 255.)>
b"\x00\x01\xff"Result:
>
list(b"\x00\x01\xff")Result:
If some of our escaped bytes are legal ASCII characters, Python will print them in that way when we look at the bytes object. This is a bit misleading, because it makes bytes look like strings. But it's important to remember that bytes are not strings, even when they're rendered in a similar way!
>
b"\x68ello\x00"Result:
Although
bytesobjects aren't strings, they do support some convenience methods and operators that are familiar fromstr.>
b"TTP" in b"HTTP GET"Result:
True
>
b"Hello " + b"World"Result:
b'Hello World'
Like strings,
bytesare immutable. Trying to modify them after they're created raises aTypeError.>
data = b"GTTP GET"data[0] = 72dataResult:
TypeError: 'bytes' object does not support item assignment
The distinction between bytes and strings may not be the most exciting programming topic, but it's very important. Newer programming environments usually draw a distinction between the two, like Python does. For example, modern versions of JavaScript have
Uint8Arrayvs. regular JavaScript strings. (Uint8Arrayis JavaScript's equivalent ofbytes.)Knowing the difference between bytes and strings will save you a lot of frustration when working with text! Most programmers today will never encounter the IBM037 or ShiftJIS encodings that we saw above, but most of us do eventually work with more than one of UTF-8, UTF-16, and UTF-32.