Table of Contents

Introducing the groupby function of the itertools module

The itertools module in the Python Standard Library is used for creating iterators for efficient looping. In this post, we are going to explore the groupby function in the itertools module.

In simple words, the groupby function takes in an iterable and a key as parameters and groups items in the iterable based on the key.

itertools.groupby(iterable[, key])

If no key is specified or is None, it defaults to identity function and returns the element unchanged. Now let us see groupby in action.

Consider a list of animals like so:

animals = ['Antelope', 'Alligator', 'Anaconda', 'Albatross', 'Anteater', 'Aardvark']

If we use groupby on this list, what will happen?

>>> groupby(animals)
<itertools.groupby at 0x7f216a5a9468>
Alrighty, groupby returns an iterator. Let's see what happens if we iterate over this iterator

>>> for group in groupby(animals):
        print(group)
('Antelope', <itertools._grouper object at 0x7f216bce2400>)
('Alligator', <itertools._grouper object at 0x7f216bce2e48>)
('Anaconda', <itertools._grouper object at 0x7f216bce2400>)
('Albatross', <itertools._grouper object at 0x7f216bce2e48>)
('Anteater', <itertools._grouper object at 0x7f216bce2400>)
('Aardvark', <itertools._grouper object at 0x7f216bce2e48>)

Okay, so groupby essentially returns an iterator over tuples. The first element of each tuple is a value identifying a unique group. In our case all animals are different and so each animal identifies a group.

What about the second element of the tuple? Yes, it is another iterator. Let's loop over it using the list constructor and see what it holds.

>>> first_group = {
        key: list(value)
        for key, value in groupby(animals)
    }
>>> first_group
{'Antelope': ['Antelope'],
 'Alligator': ['Alligator'],
 'Anaconda': ['Anaconda'],
 'Albatross': ['Albatross'],
 'Anteater': ['Anteater'],
 'Aardvark': ['Aardvark']}

Since each animal in the list is unique, each group is populated with only one animal.

The above use-case is not very helpful. Usually, we would like to group elements based on some condition. For instance, group elements by their length. In our case, the strings 'Antelope', 'Anaconda', 'Anteater', and 'Aardvark' have length 8 while 'Alligator' and 'Albatross' have length 9. So if we group them based on length, we should get 2 groups:

{ 8: ['Antelope', 'Anaconda', 'Anteater', 'Aardvark'], 9: ['Alligator', 'Albatross'] }

Let's see if groupby works as expected. We specify the grouping criterion using the key parameter.

>>> second_group = {
        key: list(value)
        for key, value in groupby(animals, key=len)
    }
>>> second_group
{8: ['Anteater', 'Aardvark'], 9: ['Albatross']}

Oh-oh! What happened there. This is where groupby could be slightly confusing. Let's see our original list again:

animals = ['Antelope', 'Alligator', 'Anaconda', 'Albatross', 'Anteater', 'Aardvark']

Let's trace how groupby actually works:

  • It first finds 'Antelope' with length 8, creates a first group 8: ['Antelope'].
  • Then it finds 'Alligator' with length 9, creates a second group 9: ['Alligator'].
  • Then it finds 'Anaconda' with length 8 and creates a third group 8: ['Anaconda']. Does not append 'Anaconda' to the first group.
  • Then it finds 'Albatross' with length 9 and creates a fourth group 9:['Albatross']. Does not append 'Albatross' to the second group.
  • Then it finds 'Anteater' with length 8 and creates a fifth group 8: ['Anteater']. Does not append 'Anteater' to the third group.
  • Then it finds 'Aardvark' with length 8 and appends to the fifth group

So ultimately, the groups are as given below:

{
    8: ['Antelope'],
    9: ['Alligator'],
    8: ['Anaconda'],
    9: ['Albatross'],
    8: ['Anteater', 'Aardvark']
}

Since a dictionary cannot have duplicate keys, the values get overwritten and we finally get the result as shown above(second_group).

So, on looping through an iterable, groupby appends the consecutive items into a group as long as the key function returns the same value for the items. As soon as the key function returns a new value, it creates a new group.

Generally, the iterable needs to already be sorted on the same key function.

>>> sorted_animals = sorted(animals, key=len)
>>> sorted_animals 
['Antelope', 'Anaconda', 'Anteater', 'Aardvark', 'Alligator', 'Albatross']
Let's pass this sorted list into the groupby function

>>> third_group = {
        key: list(value)
        for key, value in groupby(sorted_animals, key=len)
    }
>>> third_group
{8: ['Antelope', 'Anaconda', 'Anteater', 'Aardvark'],
 9: ['Alligator', 'Albatross']}

Now we are getting the groups based on the length as expected.

Conclusion

In this post, I have touched upon the following:

  • How groupby function in itertools module could be used for grouping an iterable based on a key?
  • What does the groupby function return?
  • How does the groupby function work?
  • In general, the iterable needs to be sorted on the same key function before passing it on to groupby function.
Share this post:

Leave a comment

Similar Posts


Be so good they can’t ignore you.