Execute Program

Python for Programmers: Bytes

Welcome to the Bytes lesson!

This lesson is shown as static text below. However, it's designed to be used interactively. Click the button below to start!

  • When we first learn to program, we're often told that everything in a computer is made of bits: "just ones and zeros". We use those bits to build 8-bit bytes, then use the bytes to build strings, integers, floats, lists, etc. Python lets us ignore bits and bytes most of the time, but sometimes we need to work with them directly.

  • bytes objects serve that purpose. They're sequences of plain bytes, 8 bits each, with possible values of 0 through 255. (Each bit is 1 or 0, so 8 bits can represent 2**8, or 256, different values.)

  • We create bytes objects by calling bytes with a list of integer values, one per byte.

  • >
    data = bytes([104, 101, 108, 108, 111])
  • The indexing operator gives us individual byte values as integers.

  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    data[0]
    Result:
    104Pass Icon
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    data[4]
    Result:
    111Pass Icon
  • Bytes are important when we communicate over networks or work with files. Network connections transmit raw bytes, and files also store raw bytes. When data is written to a file or sent over a network connection, it must be converted into bytes somehow.

  • For strings, that means an "encoding", a bidirectional mapping between characters like "h" and byte values like 104.

  • Here's the string "hello" encoded using a few different encodings. We'll print the individual byte values as integers, like when we built the bytes object above.

  • >
    print(list("hello".encode("utf_8")))
    print(list("hello".encode("utf_16")))
    print(list("hello".encode("cp037")))
    console output
  • In Python, there's only one string "hello". But to send the string "hello" over a network connection, the software on both ends of the connection must agree on an encoding. Is it UTF-8? UTF-16? IBM037 (the "cp037" above)? Each of the byte sequences above is a different encoding of the same string, "hello". None of them is the "correct" or "real" encoding; they're all equally legitimate.

  • What happens when we try to read text using the wrong encoding? Sometimes, we'll get an answer that seems correct, but only because we got lucky. That particular string happened to encode to the same bytes in both encodings.

  • Here are some bytes encoded using UTF-8. But we decode them with ShiftJIS, an encoding sometimes used for Japanese text. We get lucky this time: these five characters encode to the same bytes in both UTF-8 and ShiftJIS.

  • >
    bytes([104, 101, 108, 108, 111]).decode("shift_jis")
    Result:
  • Here are the same bytes again, but decoded using the "cp037" encoding. This gives us nonsense, even though the bytes are identical.

  • >
    bytes([104, 101, 108, 108, 111]).decode("cp037")
    Result:
  • Here are the same bytes yet again, but decoded using UTF-16, which is another Unicode encoding. This raises an exception, because these bytes aren't valid UTF-16 data.

  • >
    bytes([104, 101, 108, 108, 111]).decode("utf_16")
    Result:
  • Text encodings are like measurement units. An air temperature of 55 means very different things in Fahrenheit (you might want a light jacket) vs. Celsius (your life is in danger due to heatstroke) vs. Kelvin (you are ice). And the bytes [104, 101, 108, 108, 111] mean very different things in UTF-8 vs. UTF-16 vs. IBM037.

  • In practice, most text is now Unicode. But that doesn't solve the problem! As we saw above, UTF-8 and UTF-16 give very different bytes for the same string. Both are Unicode encodings, and both are widely used.

  • Python has distinct types for bytes (bytes) and strings (str) because bytes and strings are separate in practice. Having separate types ensures that the encoding and decoding steps are explicit, which reduces mistakes like the ones shown above.

  • So far, we've created bytes objects by calling bytes(...). We can also write them with a string-like syntax, but with a b prefix. This syntax allows ASCII characters, which are converted into their equivalent ASCII codes. (Roughly speaking, ASCII includes English letters, numbers, and punctuation.)

  • >
    b"hello"
    Result:
  • If we look inside that bytes object, we'll see the same familiar array of byte values from some earlier code examples.

  • >
    list(b"hello")
    Result:
  • We can use the \x escape code to specify any byte as a hexadecimal value. (Hexadecimal is the base-16 number system. Hex digits have the values 0-15, written using the numbers 0-9 and letters a-f. For example, \x00 is 0, \x0f is 15, \xf0 is 240 (15 * 16), and \xff is the maximum byte, 255.)

  • >
    b"\x00\x01\xff"
    Result:
  • >
    list(b"\x00\x01\xff")
    Result:
  • If some of our escaped bytes are legal ASCII characters, Python will print them in that way when we look at the bytes object. This is a bit misleading, because it makes bytes look like strings. But it's important to remember that bytes are not strings, even when they're rendered in a similar way!

  • >
    b"\x68ello\x00"
    Result:
  • Although bytes objects aren't strings, they do support some convenience methods and operators that are familiar from str.

  • >
    b"TTP" in b"HTTP GET"
    Result:
    TruePass Icon
  • >
    b"Hello " + b"World"
    Result:
    b'Hello World'Pass Icon
  • Like strings, bytes are immutable. Trying to modify them after they're created raises a TypeError.

  • >
    data = b"GTTP GET"
    data[0] = 72
    data
    Result:
    TypeError: 'bytes' object does not support item assignmentPass Icon
  • The distinction between bytes and strings may not be the most exciting programming topic, but it's very important. Newer programming environments usually draw a distinction between the two, like Python does. For example, modern versions of JavaScript have Uint8Array vs. regular JavaScript strings. (Uint8Array is JavaScript's equivalent of bytes.)

  • Knowing the difference between bytes and strings will save you a lot of frustration when working with text! Most programmers today will never encounter the IBM037 or ShiftJIS encodings that we saw above, but most of us do eventually work with more than one of UTF-8, UTF-16, and UTF-32.