I pick up where I left last time, with the second problem of Day 0 in the 10 Days of Statistics.
Problem 2
Given an array, , of integers and an array, , representing the respective weights of ‘s elements, calculate and print the weighted mean of ‘s elements. Your answer should be rounded to a scale of decimal place (i.e., format).
Given all the work done for the first problem, in which I defined all the basics for the future solutions, this was much easier. Since all the important decisions (refactoring into different files, deciding how to scale numbers) had already been taken, there was little to do. By shifting all the main implementation to the class Stats, the main programs are very small and easy. Then, all that is left, is to add a new function to Stats
to compute a weighted average.
- Weighted Mean: This requires an extra series with the same number of elements as the main series. Each point in the second series represents a weight of a point in the main series. This average is obtaining by multiplying a point’s value by its weight and dividing by the total weight.
- In real world terms, imagine a main series corresponding again to a series of persons, each contributing with some amount to some worthy cause. The normal mean (aka arithmetic mean) would sum these values exactly once. The weighted mean, instead, considers each person is representing a number of other persons, and contributing that amount for each of them. It’s as if the real series were not the actual persons in the series, but instead all the persons they represent, each of them contributing that point’s value.
In a language like C, you’d create a for loop with an index variable that traverses simultaneously two vectors: the points and the weights. Something like this:
int weightedSum = 0; int totalWeight = 0; for (int i = 0; i < N; i++) { weightedSum += points[i] * weights[i]; totalWeight += weights[i]; } return weightedSum * 1.0 / totalWeight;
Such a construction is not considered Pythonic. Python favours a functional approach, without any side-effects. I won’t go into many details about the definition of side-effect, as I would have to do a fair bit of research. But in general, I can say a side-effect is some change to state outside the function. A pure function can be seen as a black box that receives an input and returns an output. Nothing else in the world is modified by the function, the only way it communicates with us is via its output. This is quite convenient in terms of code correction because it makes it easier to reason about the flow of the program and the values taken by its variables. This in turn makes it feasible to provide an automatic verification of correctness. In general, this is good practice: changing global variables is a well-known source of troubles because it makes reasoning about the value of a variable through time much more difficult, since there are many more ways in which that could have changed. And because the functions affecting that global variable will inevitably spread all over the place, we lose the advantages of locality, where we can easily see in close proximity all the uses of a certain piece of data.
These advantages comes with limitations. For example, writing to a file or printing to a screen is a side effect, since that cannot be represented by the output of the function, and because of this, there are problems that are not easily handled a pure functional approach.
The above C code has no side effects. All the variables it needs are local to that piece of code, which could conceivably be wrapped inside a function. This means it can be written in a form that can be expressed in a single expression, with some help of standard constructions to manipulate and summarize vectors ((the famous map/reduce paradigm so popular in Big Data). This post is about how to do this transformation.
There are some basic operations that are useful for this: Map, Filter, Zip and Reduce (these are generic names. Python has functions with similar names that implement these concepts). If you know anything about Relational Databases, and more to the point Relational Algebras, this will be familiar to you. They correspond to, in this order: Projection, Selection, a restricted kind of Join and Aggregation.
Relational Algebras (
A tuple is a group of a fixed number of values, typically of different types. There is no sense of order or sequence in the elements of a tuple, aside from their meaning being defined by their position. And in a collection as I’m considering here, all tuples have the same format. This is how the above operations affect such a collection. In what follows, uppercase letters represent sequences, and lower-case letters represent functions or values (normally scalars, but can be generalized to tuples).
- Z = Map(S,f): Transforms a sequence into another. Z is a new sequence with the same size of S. Each element of Z is the result of applying function f to the corresponding element of S. Example: a sequence
[("a", 2), ("b",3), ("c", 1)]
is transformed in another sequence[("2a", 4), ("3b", 9), ("1c", 1)]
by the function f(s, n) = (str(n) + s, n^2). - Z = Filter(S,f): Obtains a sub-sequence. Z is a sub-sequence of S: all elements of Z are also elements of S. Only those elements s of S for which f(s) == True are included in Z.
- Z = Zip(S,T): Combines several sequences into one with the same information. Z is a new sequence that contains a tuple with an element of each of S and T. It thus turns two sequences into one, conserving the same information. The result is never larger than the largest of S and T. There are other ways of combining two sequences (eg Cartesian Product) such that the result is larger than both the original sequences (eg:
itertools.product
). - v = Reduce(S, f, g): Summarizes a sequence into a single value (or tuple). v is the value resulting of applying a function f to each tuple, and aggregating all the resulting values with function g, pair-wise.
I had to find a way to remove the index variable in the above loop. It’s fairly easy to filter, map and reduce a single vector, but here I have two, and at the start that presented a difficulty. I was still not thinking in functional terms, nor imagining the vectors in front of me as sets in a relational database. But my needs were obvious from the start: a single list of tuples replacing the original two.
The solution was to zip the two lists into one single list, where each term is a tuple with the element from each list at the corresponding position. For example, if I have lists
1,2,3
a,b,c
,
the zipped result will be
(1,a), (2,b), (3,c)
.
Now, I have still only one list, which I can map and reduce, but it contains all the information of the original two. This allows me to iterate over all the tuples and use them to compute the sum and the weight. Something like this:
weightedSum = 0 totalWeight = 0 for tuple in zip(points, weights): t_point, t_weight = tuple weightedSum = weightedSum + t_point * t_weight totalWeight = totalWeight + t_weight print(weightedSum/totalWeight)
Python has another way to do this, by using list comprehensions. That is a really powerful construction that is a pleasure to use. We can create a new list one element at a time, specifying one or more lists as a base, and a rule to create a new element derived from that base. A list comprehension is another way to specify a Map combined with a Filter, and possibly using the result of a Zip. It has this format
[f(x1, ..., xn) for x1 in seq1 ... for xn in seqn if]
Alternatively, we could have written also
[f(x1, ..., xn) for x1,...,xn in zip(seq1,...seqn) if]
but note that these produce two very different results. The zipped list will be only as long as the smallest of the lists it receives, that is, it produces a tuple only while it can fill all its positions. As soon as a list is exhausted, no more tuples can be produced. Zip can receive more than 2 lists.
On the other hand, the construction for x1 in seq1 for x2 in seq2
produces a cartesian product: it pairs each element of a list to all the elements of the second list, and in the general case, to all the tuples resulting from the zip of the remaining lists.
The function f determines how to produce the elements in the list (Map), the for
clauses determine the source sequences (Cartesian Product) and the if
clause determines which elements to include (Filter).
This is how we can use them here: we zip the original lists to obtain a single list of tuples; for each of these, we multiply both its members and add that to a new list; finally, we compute the list’s sum and divide by the sum of all the weights.
weightedSum = sum([a*b for a,b in zip(points, weights)]) totalWeight = sum(weights) return weightedSum / totalWeight
This is my final solution for this problem, and the code can be found here , commit 04bf4ae
.