Recipe19.21.Computing a Summary Report with itertools.groupby

Recipe 19.21. Computing a Summary Report with itertools.groupby

Credit: Paul Moore, Raymond Hettinger

Problem

You have a list of data grouped by a key value, typically read from a spreadsheet or the like, and want to generate a summary of that information for reporting purposes.

Solution

The itertools.groupby function introduced in Python 2.4 helps with this task:

from itertools import groupby from operator import itemgetter def summary(data, key=itemgetter(0), field=itemgetter(1)):     """ Summarise the given data (a sequence of rows), grouped by the         given key (default: the first item of each row), giving totals         of the given field (default: the second item of each row).         The key and field arguments should be functions which, given a         data record, return the relevant value.     """     for k, group in groupby(data, key):         yield k, sum(field(row) for row in group) if _ _name_ _ == "_ _main_ _":     # Example: given a sequence of sales data for city within region,     # _sorted on region_, produce a sales report by region     sales = [('Scotland', 'Edinburgh', 20000),              ('Scotland', 'Glasgow', 12500),              ('Wales', 'Cardiff', 29700),              ('Wales', 'Bangor', 12800),              ('England', 'London', 90000),              ('England', 'Manchester', 45600),              ('England', 'Liverpool', 29700)]     for region, total in summary(sales, field=itemgetter(2)):         print "%10s: %d" % (region, total)

Discussion

In many situations, data is available in tabular form, with the information naturally grouped by a subset of the data values (e.g., recordsets obtained from database queries and data read from spreadsheetstypically with the csv module of the Python Standard Library). It is often useful to be able to produce summaries of the detail data.

The new groupby function (added in Python 2.4 to the itertools module of the Python Standard Library) is designed exactly for the purpose of handling such grouped data. It takes as arguments an iterator, whose items are to be thought of as records, along with a function to extract the key value from each record. itertools.groupby yields each distinct key from the iterator in turn, each along with a new iterator that runs through the data values associated with that key.

The groupby function is often used to generate summary totals for a dataset. The summary function defined in this recipe shows one simple way of doing this. For a summary report, two extraction functions are required: one function to extract the key, which is the function that you pass to the groupby function, and another function to extract the values to be summarized. The recipe uses another innovation of Python 2.4 for these purposes: the operator.itemgetter higher-order function: called with an index i as its argument. itemgetter produces a function f such that f(x) extracts the i^th item from x, operating just like an indexing x[i].

The input records must be sorted by the given key; if you're uncertain about that condition, you can use groubpy(sorted(data, key=key), key) to ensure it, exploiting the built-in function sorted, also new in Python 2.4. It's quite convenient that the same key-extraction function can be passed to both sorted and groupby in this idiom. The groupby function itself does not sort its input, which gains extra flexibility that may come in handyalthough most of the time you will want to use groupby only on sorted data. See Recipe 19.10 for a case in which it's quite handy to use groupby on nonsorted data.

For example, if the sales data was in a CSV file sales.csv, the usage example in the recipe's if _ _name_ _ == `_ _main_ _' section might become:

    import csv     sales = sorted(cvs.reader(open('sales.csv', 'rb')),                    key=itemgetter(1))     for region, total in summary(sales, field=itemgetter(2)):         print "%10s: %d" % (region, total)

Overall, this recipe provides a vivid illustration of how the new Python 2.4 features work well together: in addition to the groupby function, the operator.itemgetter used to provide field extraction functions, and the potential use of the built-in function sorted, the recipe also uses a generator expression as the argument to the sum built-in function. If you need to implement this recipe's functionality in Python 2.3, you can start by implementing your own approximate version of groupby, for example as follows:

class groupby(dict):     def _ _init_ _(self, seq, key):         for value in seq:             k = key(value)             self.setdefault(k, [  ]).append(value)     _ _iter_ _ = dict.iteritems

This version doesn't include all the features of Python 2.4's groupby, but it's very simple and may be sufficient for your purposes. Similarly, you can write your own simplified versions of functions itemgetter and sorted, such as:

def itemgetter(i):     def getter(x): return x[i]     return getter def sorted(seq, key):     aux = [(key(x), i, x) for i, x in enumerate(seq)]     aux.sort( )     return [x for k, i, x in aux]

As for the generator expression, you can simply use a list comprehension in its placejust call sum([field(row) for row in group]) where the recipe has the same call without the additional square brackets, [ ]. Each of these substitutions will cost a little performance, but, overall, you can build the same functionality in Python 2.3 as you can in version 2.4the latter just is slicker, simpler, faster, neater!

Recipe19.21.Computing a Summary Report with itertools.groupby