Execute Program

Python in Detail: Itertools

Welcome to the Itertools lesson!

This lesson is shown as static text below. However, it's designed to be used interactively. Click the button below to start!

  • Iterators are common in real-world Python code, so we often find ourselves combining them, splitting them, and filtering them. Fortunately, Python ships with itertools, a set of utility functions to work with iterators. We'll explore a small sampling of itertools functions in this lesson.

  • Sometimes we need an iterator that counts up from a given starting value. We could write it as a generator.

  • >
    def count(initial):
    i = initial
    while True:
    yield i
    i += 1

    counter = count(22)
    (next(counter), next(counter), next(counter))
    Result:
    (22, 23, 24)Pass Icon
  • However, we don't have to write that code because itertools.count already does the same thing. It starts from 0 by default, or we can provide a starting value.

  • >
    import itertools

    counter = itertools.count(5)
    (next(counter), next(counter), next(counter))
    Result:
  • >
    import itertools

    counter = itertools.count()
    (next(counter), next(counter), next(counter))
    Result:
    (0, 1, 2)Pass Icon
  • Let's see itertools.count in practice. In an earlier lesson, we used defaultdict to generate IDs for users. The IDs themselves are created in a get_next_id function, which modifies a global variable. Here's the code we wrote in that lesson:

  • >
    from collections import defaultdict

    next_id = 1

    def get_next_id():
    global next_id
    id = next_id
    next_id += 1
    return id

    user_ids = defaultdict(get_next_id)

    first_amir_id = user_ids["Amir"]
    betty_id = user_ids["Betty"]
    second_amir_id = user_ids["Amir"]
    (first_amir_id, betty_id, second_amir_id)
    Result:
  • To generate IDs, we had to import defaultdict, define a global variable, and write a function to increment that variable. The next example solves the same problem, but uses itertools.count instead of our get_next_id function. That lets us remove 5 lines of code.

  • >
    import collections, itertools

    next_id_iterator = itertools.count(1)
    user_ids = collections.defaultdict(lambda: next(next_id_iterator))

    first_amir_id = user_ids["Amir"]
    betty_id = user_ids["Betty"]
    second_amir_id = user_ids["Amir"]
    (first_amir_id, betty_id, second_amir_id)
    Result:
    (1, 2, 1)Pass Icon
  • We made the code shorter, but that didn't require anything clever. We just made better use of the tools that Python gives us.

  • For convenience, the next few examples call list on iterators to see their contents. But remember that itertools functions return iterators, not lists!

  • The itertools.count iterator is infinitely long, so we can't convert it into a list. Trying to build that infinite list would eventually consume all available memory, then crash.

  • However, we can slice the iterator to get only the section of it that we want. We've already seen list slicing: some_list[start:end]. For iterators, we use the itertools.islice function to do the same thing.

  • >
    import itertools

    list(itertools.islice(itertools.count(), 2, 5))
    Result:
  • >
    import itertools

    def letters():
    for char in "Ms. Fluff":
    yield char

    list(itertools.islice(letters(), 4, 9))
    Result:
  • Strictly speaking, we don't need that letters function. The string itself is an iterable.

  • >
    import itertools

    list(itertools.islice("Ms. Fluff", 4, 9))
    Result:
    ['F', 'l', 'u', 'f', 'f']Pass Icon
  • The repeat function repeats a single value a certain number of times.

  • >
    import itertools

    list(itertools.repeat("a", 3))
    Result:
    ['a', 'a', 'a']Pass Icon
  • If we don't provide a number, the iterator always returns the same value.

  • >
    import itertools

    always_a = itertools.repeat("a")
    next(always_a)
    Result:
    'a'Pass Icon
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    (next(always_a), next(always_a))
    Result:
    ('a', 'a')Pass Icon
  • The cycle function creates an infinite iterator that endlessly cycles through another iterator's values. When it gets to the end of the source iterator, it starts back at the beginning again.

  • >
    list(range(3))
    Result:
  • >
    import itertools

    cycling = itertools.cycle(range(3))
    list(itertools.islice(cycling, 0, 22))
    Result:
  • Note: this code example reuses elements (variables, etc.) defined in earlier examples.
    >
    list(itertools.islice(cycling, 9184, 9188))
    Result:
  • In that example, we only got four list elements. But to get to those elements, islice first had to consume 9,184 other iterator elements.

  • The takewhile function uses another function to filter an iterator. It calls the function on each iterated value, and returns an iterator that gives us the values where the function returns True. When the function returns False, the iterator immediately ends.

  • >
    import itertools

    list(itertools.takewhile(lambda n: n < 5, itertools.count()))
    Result:
  • >
    import itertools

    # Remember: takewhile stops as soon as the function returns False!
    list(itertools.takewhile(lambda n: n % 2 == 0, range(0, 10)))
    Result:
    [0]Pass Icon
  • One important thing about takewhile: to find the first False value, it needs to actually consume that value from the underlying iterator. (In the example above, that final value was 1.) The passed function returns False for that value, so it doesn't appear in the final iterator. But it also isn't left in the original iterator. It's simply lost.

  • >
    import itertools

    my_iter = itertools.count()
    list(itertools.takewhile(lambda n: n <= 4, my_iter))
    first_unconsumed_value = next(my_iter)
    first_unconsumed_value
    Result:
    6Pass Icon
  • Sometimes we want the opposite: we only want the values starting from the first point where the function returns True. For that, we can use dropwhile. It consumes (drops) all of the iterated values until the function is true. Then it iterates over all remaining values, regardless of what the function returns.

  • >
    import itertools

    my_iter = itertools.dropwhile(lambda n: n < 5, itertools.count())
    (next(my_iter), next(my_iter))
    Result:
  • The tee function duplicates an iterator into two iterators. Each of the resulting iterators iterates over all of the values from the original iterator. In other words, consuming values from one iterator won't affect the other iterator.

  • This is called "tee" by analogy to a "tee joint" in plumbing: a T-shaped section of pipe that splits one pipe into two pipes. It's an imperfect analogy: a given water molecule can either go left or right in a pipe, but not both, whereas tee sends each value to both iterators.

  • >
    import itertools

    numbers = itertools.count()
    (my_iter_1, my_iter_2) = itertools.tee(numbers)
    (next(my_iter_1), next(my_iter_2))
    Result:
    (0, 0)Pass Icon
  • We can provide an integer as an optional second argument to tee to split it into even more iterators.

  • >
    import itertools

    numbers = itertools.count()
    (my_iter_1, my_iter_2, my_iter_3, my_iter_4) = itertools.tee(numbers, 4)
    (next(my_iter_1), next(my_iter_2), next(my_iter_3), next(my_iter_4))
    Result:
  • Normally, iterators don't store all of their values in memory. That's why they're so useful for representing infinite sequences of data. However, tee introduces a new complication. Here's a demonstration that we'll analyze after seeing it.

  • >
    import itertools

    numbers = itertools.count()
    (my_iter_1, my_iter_2) = itertools.tee(numbers)

    # Consume 1,000 numbers from my_iter_1.
    for _ in range(1000):
    next(my_iter_1)

    # my_iter_2 is still waiting at the first number.
    next(my_iter_2)
    Result:
    0Pass Icon
  • If we continue to call next(my_iter_2), we'll get 1, 2, etc. But where are those numbers coming from? We know that the original itertools.count iterator was already iterated all the way to 1,000, so the numbers aren't coming from that original iterator.

  • The answer is that tee stores all of those values in memory. But note that the two iterators returned by tee are still independent. Both of them will iterate over each value from the original iterator. Iterating one of the tee iterators won't affect the other tee iterator.

  • Sometimes this can cause serious memory usage problems. Imagine that the code above continues to iterate over my_iter_1, but never iterates my_iter_2. Python has to keep all of the iterated values in memory forever, just in case we start iterating my_iter_2 in the future. Eventually, Python will run out of memory and crash. That's a problem, but it's also expected: when working with large data sets, memory management is always a concern, even in a language like Python.

  • We've covered a few of the many itertools methods here. There are many more, so memorizing them all up front isn't a reasonable goal. Instead, we recommend checking the docs whenever you find yourself writing a function that transforms an iterator into another iterator. It's possible that the function is already written for you!