Generator

Generator Functions



What are generator functions?

Unlike regular functions, generator functions are special kind of functions that mainly lets you suspend their execution so you could go and do something else then come back and continue execution from where you left off. Generator functions look and act just like regular functions, but with one defining characteristic. Generator functions use the yield keyword instead of return. A generator function returns a lazy iterator or a generator object, simplifying the creation of iterators which are objects that you can loop over like a list.
Unlike lists which usually store their content in memory, lazy iterators returned by generator functions are not stored in memory, therefore allowing for working with large datasets more efficiently.

When do we use generator functions?

As you may have guessed by now, a common use case of generator functions is when we are working with large files or datasets. For instance if you are trying to load a very large CSV file that is larger in size than your memory using a regular function, you will get a MemoryError.
Similarly, another use case is when building a data pipeline, for instance loading a large dataset for training a machine learning model. What you usually want to do in this case is to create, for instance, a dataset class that will return a lazy iterator, in which you can efficiently yield data items or batches of data without overwhelming your memory, which allows you to process batches of data at a time.
These are use cases where we can take advantage of generator functions being memory efficient. There are few other use cases where generator functions can come in handy, such as in applications where you want to generate unique identifiers and so on. I personally have only used generator functions while working on developing training schemes for deep learning models using Python.

How to use generator functions?

While regular functions use the return keyword, generator functions use the yield keyword.
Let's assume that you want to read a text file and return the data using the get_data() function below:

def get_data(filename):
  file = open(filename)
  data = file.read().split('\n')
  return data

Assuming that the provided text file includes lines separated by a new line character (\n), the function above returns a list containing each line as a separate item, all of which loaded into memory. This is because we are using the .split() method, although the open() returns a generator object. Therefore, using the above function on a relatively large file will produce a MemoryError, or at least will start to slow your computer down. So how do we want to handle large data files?

Let's modify our function above so that we can lazily iterate through the data line by line. To do that we can rewrite the function in the following way:

def get_data(filename):
  for line in open(filename, 'r'):
    yield line

Using the above version of the function we will be able to open a large file, loop through each line and yield it instead of return it. Using yield will result in a generator object, while using return will result in the first line of the file only.

The yield keyword controls the flow of a generator function and makes some magic happen! We can assign the generator object to a variable in order to use it which allows us to use special methods such as next() and then the code executed within the generator function is up to yield. When the yield statement is executed, the function execution is suspended and returns the yielded value to the caller, and when the function is suspended the state of it is saved. Therefore, this permits us to resume execution where we left off. You can see this in the simple example below:

>>> def generator_example():
...   yield 1
...   yield 2
...   yield 3

>>> generator_obj = generator_example()
>>> print(next(generator_obj))
1
>>> print(next(generator_obj))
2
>>> print(next(generator_obj))
3
>>> print(next(generator_obj))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

If you are paying attention to the last next() call, you can see how the execution has terminated with a traceback. This is because the generator object has been exhausted; once all the values have been returned, iteration will stop. In the above example we saw the StopIteration exception raised on the last next() call, this is to signal the end of an iterator.



Done


Leave a comment


Post


Comments

Be the first one to leave a comment!