Execute Program

Python in Detail: Hash

Welcome to the Hash lesson!

This lesson is shown as static text below. However, it's designed to be used interactively. Click the button below to start!

  • How can Python allow so many different data types as dictionary keys? And how does Python let us look up keys that are equal but not identical?

  • >
    tuple1 = (1, 2)
    tuple2 = tuple([1, 2])
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    tuple1 is tuple2
    Result:
    FalsePass Icon
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    tuple1 == tuple2
    Result:
    TruePass Icon
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    a_dict = {
    tuple1: "some value"
    }

    # When we look up by either tuple, we get the same result. Dictionary lookup
    # works by equality, not object identity.
    a_dict[tuple1] == a_dict[tuple2]
    Result:
    TruePass Icon
  • This seems intuitive: if we ask for a key, and the dictionary has a key that's equal to it, then we should get the value at that key.

  • But this intuition is deceptive. It only works in practice because dicts are built on Python's hashing system. This lesson covers the hash function, the first topic on our path toward understanding dictionary key lookup.

  • In programming (not just in Python), a hash function is a one-way function that takes a value and returns a hash value, often simply called a hash. Crucially, the function always returns the same value for the same input.

  • Python's hash function takes any "hashable" data type, and it returns an integer. It always returns the same value for a given argument. If hash("hello") is -181332398, then it'll always be -181332398, at least until we terminate and restart the Python runtime.

  • >
    hash("hello")
    Result:
  • >
    hash((1, 2, 3))
    Result:
  • >
    a_tuple = (1, 2, 3)
    (hash(a_tuple), hash(a_tuple), hash(a_tuple))
    Result:
    (-2022708474, -2022708474, -2022708474)Pass Icon
  • Hash values aren't consistent between Python interpreters, different computers, or even different runs of Python on a single computer. For security reasons, hash values are randomized when Python starts up.

  • In Execute Program, we disable that randomization so that our code examples return consistent values. If you run the code examples locally, you'll probably get different hash values.

  • Hash values don't tell us anything about the original value, so we should never make assumptions about them. For example, the hash of "hello" above was -181332398. That's true today, on a specific computer, for a specific Python version, configured in a certain way. But it's not guaranteed.

  • Although hash values don't mean anything, they do still follow rules.

    1. A value's hash doesn't change unless we terminate and restart Python itself.
    2. Passing the same value to hash twice always gives the same hash.
    3. If two objects are equal, then their hashes must also be equal. If x == y, then hash(x) == hash(y).
  • >
    hash(2) == hash(2)
    Result:
    TruePass Icon
  • Any two equal objects will always have the same hash, even if they're not identical. For example, we can build two equal tuples through different code paths. They're different objects in memory, so they're not identical. But they're still equal, so their hashes are also equal.

  • >
    tuple1 = (1, 2, 3)

    def build_tuple_2(x):
    return (1, 2, x)

    tuple2 = build_tuple_2(3)
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    tuple1 == tuple2
    Result:
    TruePass Icon
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    tuple1 is tuple2
    Result:
    FalsePass Icon
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    hash(tuple1) == hash(tuple2)
    Result:
    TruePass Icon
  • We said that equal values have equal hashes. But the reverse isn't true! It's possible for two non-equal values to end up with equal hashes, simply by chance. That's called a hash collision.

  • >
    # Unlike most of our code examples, this output is fake.
    #
    # You almost certainly won't get this result if you try the code yourself.
    # Still, this could happen by chance!
    hash("a string") == hash("a different string")
    Result:
  • Hash collisions happen due to a simple fact: there are infinitely many possible Python values, but hashes are 64-bit integers. We can't map an infinite number of possible objects onto a finite number of possible hash values, so duplicates must happen sometimes.

  • Some data types are "unhashable", so we can't pass them to hash at all. In most cases, data types are unhashable because they're mutable (they can be changed after they're created).

  • For example, lists and dictionaries are mutable: we can re-assign values to their indexes or keys, like some_list[0] = some_value. Trying to hash a list or dict raises a TypeError.

  • >
    hash([1, 2, 3])
    Result:
    TypeError: unhashable type: 'list'Pass Icon
  • >
    hash({
    "a": 1,
    "b": 2
    })
    Result:
    TypeError: unhashable type: 'dict'Pass Icon
  • Hashing a mutable value would violate the rules we saw above. Here's an example:

  • >
    x = [1, 2, 3]
    y = [1, 2, 3, 4]

    assert x != y
    x.append(4)

    assert x == y
    Result:
  • The lists only became equal after we ran .append(4). According to the hashing rules, the lists' hashes must be equal when their contents are equal.

  • This leads to a contradiction. Because a value's hash isn't allowed to change, that means that the lists' hashes must've been equal the whole time, even before their values were equal. But Python can't see into the future, so it can't know in advance that we'll make the lists equal later. (Unfortunately, Python isn't that developer-friendly!) So, mutable values aren't hashable.

  • Now that we've learned about what hash does, let's see why hashes are so useful.

  • Suppose that we have some very large tuples, which themselves contain other tuples, which contain more tuples, in a deeply-nested structure. To decide whether two of these large tuples are equal, we have to look at the tuples inside of them, then the tuples inside of those, etc. As the tuples get large, this can become very slow.

  • However, calling hash(x) is usually fast. Because the hash doesn't change, Python can compute it once, then store it for any future hash(...) calls.

  • hash's fast performance lets us optimize many operations, most notably dictionary lookups. We know that equal values have equal hashes. We can also flip that: if hash(x) != hash(y), then x and y must not be equal.

  • Here's an equal function that uses that fact to do comparisons quickly:

  • >
    def equal(x, y):
    if hash(x) != hash(y):
    # The hashes aren't equal, so the values can't possibly be equal.
    return False
    else:
    # The hashes are equal, so the values MIGHT be equal. We still have to
    # check, because it's always possible that the hashes were equal purely
    # by chance.
    return x == y
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    equal(1, 2)
    Result:
    FalsePass Icon
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    equal("hello", "h" + "ello")
    Result:
    TruePass Icon
  • Our equal function is fast when comparing larger compound data structures. It sees that the hashes are different, so it doesn't have to do the full comparison.

  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    equal((1, 2, 3), (1, 2, 4000))
    Result:
    FalsePass Icon
  • But equal function relies on hashing, so it errors when given unhashable values like lists or dicts.

  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    equal([1], [1])
    Result:
    TypeError: unhashable type: 'list'Pass Icon
  • Is equal actually faster than a regular x == y comparison? Sometimes. In the common case of comparing simple primitive values like ints or strings, it's significantly slower than ==. Comparing primitives is so fast that the overhead of calling equal and checking the hash take more time than letting Python do a regular equality check.

  • For large data structures, equal could theoretically outperform ==. But those data structures often have an __eq__ method that already checks the hash. In effect, the data structures already contain the optimization that we wrote in equal.

  • Simply put, hash doesn't exist to speed up ==. If that worked, Python itself would already do it.

  • Instead, hash exists to speed up more complex and slow operations, most notably dict indexing. We'll see more on that in a future lesson.