Skip to content Skip to sidebar Skip to footer

Caching A Generator

A recent similar question (isinstance(foo, types.GeneratorType) or inspect.isgenerator(foo)?) got me curious about how to implement this generically. It seems like a generally-use

Solution 1:

Based on this comment:

my intention here is that this would only be used if the user knows he wants to iterate multiple times over the 'iterable', but doesn't know if the input is a generator or iterable. this lets you ignore that distinction, while not losing (much) performance.

This simple solution does exactly that:

defensure_list(it):
    ifisinstance(it, (list, tuple, dict)):
        return it
    else:
        returnlist(it)

now ensure_list(a_list) is practically a no-op - two function calls - while ensure_list(a_generator) will turn it into a list and return it, which turned out to be faster than any other approach.

Solution 2:

What you want is not an iterator, but an iterable. An iterator can only iterate once through its contents. You want something which takes an iterator and over which you can then iterate multiple times, producing the same values from the iterator, even if the iterator doesn't remember them, like a generator. Then it's just a matter of special-casing those inputs which don't need caching. Here's a non-thread-safe example (EDIT: updated for efficiency):

import itertools
classAsYouGoCachingIterable(object):
    def__init__(self, iterable):
        self.iterable = iterable
        self.iter = iter(iterable)
        self.done = False
        self.vals = []

    def__iter__(self):
        if self.done:
            returniter(self.vals)
        #chain vals so far & then gen the restreturn itertools.chain(self.vals, self._gen_iter())

    def_gen_iter(self):
        #gen new vals, appending as it goesfor new_val in self.iter:
            self.vals.append(new_val)
            yield new_val
        self.done = True

And some timings:

classListCachingIterable(object):
    def__init__(self, obj):
        self.vals = list(obj)

    def__iter__(self):
        returniter(self.vals)

defcube_generator(max=1000):
    i = 0while i < max:
        yield i*i*i
        i += 1defrunit(iterable_factory):
    for i in xrange(5):
        for what in iterable_factory():
            passdefpuregen():
    runit(lambda: cube_generator())
deflisttheniter():
    res = list(cube_generator())
    runit(lambda: res)
deflistcachingiterable():
    res = ListCachingIterable(cube_generator())
    runit(lambda: res)
defasyougocachingiterable():
    res = AsYouGoCachingIterable(cube_generator())
    runit(lambda: res)

Results are:

In [59]:%timeitpuregen()1000 loops,best of 3:774usperloopIn [60]:%timeitlisttheniter()1000 loops,best of 3:345usperloopIn [61]:%timeitlistcachingiterable()1000 loops,best of 3:348usperloopIn [62]:%timeitasyougocachingiterable()1000 loops,best of 3:630usperloop

So the simplest approach in terms of a class, ListCachingIterable, works just about as well as doing the list manually. The "as-you-go" variant is almost twice as slow, but has advantages if you don't consume the entire list, e.g. say you're only looking for the first cube over 100:

deffirst_cube_past_100(cubes):
    for cube in cubes:
        if cube > 100:
            return cube
    raise Error("No cube > 100 in this iterable")

Then:

In [76]: %timeit first_cube_past_100(cube_generator())
100000 loops, best of 3: 2.92 us per loop

In [77]: %timeit first_cube_past_100(ListCachingIterable(cube_generator()))
1000 loops, best of 3: 255 us per loop

In [78]: %timeit first_cube_past_100(AsYouGoCachingIterable(cube_generator()))
100000 loops, best of 3: 10.2 us per loop

Solution 3:

Just made a library that solves exactly that -- supports caching for functions returning iterators:

from typing import *
from cacheable_iter import iter_cache

@iter_cache
def iterator_function(n: int) -> Iterator[int]:
    yield from range(n)

An example of usage:

from typing import *
from cacheable_iter import iter_cache

@iter_cache
def my_iter(n: int) -> Iterator[int]:
    print(" * my_iter called")
    for i in range(n):
        print(f" * my_iter step {i}")
        yield i

gen1 = my_iter(4)
print("Creating an iterator...")
print(f"The first value of gen1 is {next(gen1)}")
print(f"The second value of gen1 is {next(gen1)}")

gen2 = my_iter(4)
print("Creating an iterator...")
print(f"The first value of gen2 is {next(gen2)}")
print(f"The second value of gen2 is {next(gen2)}")
print(f"The third value of gen2 is {next(gen2)}")

Which would print:

Creating an iterator...
 * my_iter called
 * my_iter step0
The first value of gen1 is0
 * my_iter step1
The second value of gen1 is1
Creating an iterator...
The first value of gen2 is0
The second value of gen2 is1
 * my_iter step2
The third value of gen2 is2

Also supports caching awaitable iterators and asynchronous iterators

Post a Comment for "Caching A Generator"