Python in Detail: Hash
Welcome to the Hash lesson!
This lesson is shown as static text below. However, it's designed to be used interactively. Click the button below to start!
How can Python allow so many different data types as dictionary keys? And how does Python let us look up keys that are equal but not identical?
>
tuple1 = (1, 2)tuple2 = tuple([1, 2])- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
tuple1 is tuple2Result:
False
- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
tuple1 == tuple2Result:
True
- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
a_dict = {tuple1: "some value"}# When we look up by either tuple, we get the same result. Dictionary lookup# works by equality, not object identity.a_dict[tuple1] == a_dict[tuple2]Result:
True
This seems intuitive: if we ask for a key, and the dictionary has a key that's equal to it, then we should get the value at that key.
But this intuition is deceptive. It only works in practice because dicts are built on Python's hashing system. This lesson covers the
hashfunction, the first topic on our path toward understanding dictionary key lookup.In programming (not just in Python), a hash function is a one-way function that takes a value and returns a hash value, often simply called a hash. Crucially, the function always returns the same value for the same input.
Python's
hashfunction takes any "hashable" data type, and it returns an integer. It always returns the same value for a given argument. Ifhash("hello")is -181332398, then it'll always be -181332398, at least until we terminate and restart the Python runtime.>
hash("hello")Result:
>
hash((1, 2, 3))Result:
>
a_tuple = (1, 2, 3)(hash(a_tuple), hash(a_tuple), hash(a_tuple))Result:
(-2022708474, -2022708474, -2022708474)
Hash values aren't consistent between Python interpreters, different computers, or even different runs of Python on a single computer. For security reasons, hash values are randomized when Python starts up.
In Execute Program, we disable that randomization so that our code examples return consistent values. If you run the code examples locally, you'll probably get different hash values.
Hash values don't tell us anything about the original value, so we should never make assumptions about them. For example, the hash of
"hello"above was-181332398. That's true today, on a specific computer, for a specific Python version, configured in a certain way. But it's not guaranteed.Although hash values don't mean anything, they do still follow rules.
- A value's hash doesn't change unless we terminate and restart Python itself.
- Passing the same value to
hashtwice always gives the same hash. - If two objects are equal, then their hashes must also be equal.
If
x == y, thenhash(x) == hash(y).
>
hash(2) == hash(2)Result:
True
Any two equal objects will always have the same hash, even if they're not identical. For example, we can build two equal tuples through different code paths. They're different objects in memory, so they're not identical. But they're still equal, so their hashes are also equal.
>
tuple1 = (1, 2, 3)def build_tuple_2(x):return (1, 2, x)tuple2 = build_tuple_2(3)- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
tuple1 == tuple2Result:
True
- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
tuple1 is tuple2Result:
False
- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
hash(tuple1) == hash(tuple2)Result:
True
We said that equal values have equal hashes. But the reverse isn't true! It's possible for two non-equal values to end up with equal hashes, simply by chance. That's called a hash collision.
>
# Unlike most of our code examples, this output is fake.## You almost certainly won't get this result if you try the code yourself.# Still, this could happen by chance!hash("a string") == hash("a different string")Result:
Hash collisions happen due to a simple fact: there are infinitely many possible Python values, but hashes are 64-bit integers. We can't map an infinite number of possible objects onto a finite number of possible hash values, so duplicates must happen sometimes.
Some data types are "unhashable", so we can't pass them to
hashat all. In most cases, data types are unhashable because they're mutable (they can be changed after they're created).For example, lists and dictionaries are mutable: we can re-assign values to their indexes or keys, like
some_list[0] = some_value. Trying to hash a list or dict raises aTypeError.>
hash([1, 2, 3])Result:
TypeError: unhashable type: 'list'
>
hash({"a": 1,"b": 2})Result:
TypeError: unhashable type: 'dict'
Hashing a mutable value would violate the rules we saw above. Here's an example:
>
x = [1, 2, 3]y = [1, 2, 3, 4]assert x != yx.append(4)assert x == yResult:
The lists only became equal after we ran
.append(4). According to the hashing rules, the lists' hashes must be equal when their contents are equal.This leads to a contradiction. Because a value's hash isn't allowed to change, that means that the lists' hashes must've been equal the whole time, even before their values were equal. But Python can't see into the future, so it can't know in advance that we'll make the lists equal later. (Unfortunately, Python isn't that developer-friendly!) So, mutable values aren't hashable.
Now that we've learned about what
hashdoes, let's see why hashes are so useful.Suppose that we have some very large tuples, which themselves contain other tuples, which contain more tuples, in a deeply-nested structure. To decide whether two of these large tuples are equal, we have to look at the tuples inside of them, then the tuples inside of those, etc. As the tuples get large, this can become very slow.
However, calling
hash(x)is usually fast. Because the hash doesn't change, Python can compute it once, then store it for any futurehash(...)calls.hash's fast performance lets us optimize many operations, most notably dictionary lookups. We know that equal values have equal hashes. We can also flip that: ifhash(x) != hash(y), thenxandymust not be equal.Here's an
equalfunction that uses that fact to do comparisons quickly:>
def equal(x, y):if hash(x) != hash(y):# The hashes aren't equal, so the values can't possibly be equal.return Falseelse:# The hashes are equal, so the values MIGHT be equal. We still have to# check, because it's always possible that the hashes were equal purely# by chance.return x == y- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
equal(1, 2)Result:
False
- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
equal("hello", "h" + "ello")Result:
True
Our
equalfunction is fast when comparing larger compound data structures. It sees that the hashes are different, so it doesn't have to do the full comparison.- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
equal((1, 2, 3), (1, 2, 4000))Result:
False
But
equalfunction relies on hashing, so it errors when given unhashable values like lists or dicts.- Note: this code example reuses elements (variables, etc.) defined in earlier examples.
>
equal([1], [1])Result:
TypeError: unhashable type: 'list'
Is
equalactually faster than a regularx == ycomparison? Sometimes. In the common case of comparing simple primitive values like ints or strings, it's significantly slower than==. Comparing primitives is so fast that the overhead of callingequaland checking the hash take more time than letting Python do a regular equality check.For large data structures,
equalcould theoretically outperform==. But those data structures often have an__eq__method that already checks the hash. In effect, the data structures already contain the optimization that we wrote inequal.Simply put,
hashdoesn't exist to speed up==. If that worked, Python itself would already do it.Instead,
hashexists to speed up more complex and slow operations, most notably dict indexing. We'll see more on that in a future lesson.