coding, programming, working

HackerRank’s Statistics Path – Day 1

In this post, I return to my series explaining solutions to the “10 Days of Statistics” tutorial on Hackerrank.

Day 0 was dedicated to basic definitions, and has been covered here and here, and suggested this diversion on functional programming.

Day 1 had two sets of problems: the first about quartiles and the second about standard deviation.

A quartile is an extension of the concept of median: if median is the point at which the sample is divided in half, according to the value of the statistical variable (which has to be sortable), the quartiles are the points at which the sample is divided in quarters.

A more general concept that is also frequently used is that of “percentile” (this is either followed by a number from 0 to 100, or is instead expressed like “the nth percentile”). The nth percentile indicates the highest value of the variable in the lowest n% of the population.

A common example that must be familiar to all parents of young children is weight and height tables. These are a summary of the most frequent heights in a sample by any given age. For each age, a sample of heights (or weights) is collected and their frequencies (the number of people with that heigth) are recorded. The people in the sample can then be ordered by their height as if they were lined in a queue.

In the tables I link here, you can see percentiles 0.4, 2, 9, 25, 50, 75, 91, 98 and 99.6. To pick just one example, the 75th percentile is the maximum height in the lowest 75% of the population. Another famous example is Mensa, which accepts only members who score on or above the 98th percentile in a certain kind of test, and so are within the 2% best scorers in a given population.

Naturally, quartiles can be expressed as percentiles. In order, they are percentile 25, percentile 50 and percentile 75. To compute the first quartile, we count the quarter of the sample with the lowest heights. The height of the tallest person in this group is the first quartile. Then, we take the second quarter of the sample with the lowest heights, and find the second quartile to be the height of tallest person in this group. We compute the third quartile in the same way. So, quartiles are just points that indicate the maximum value in the corresponding fourth of the population. Conceptually, there could even be a fourth quartile, which would be just the maximum value in the population.

But the easiest way to compute them is by using the median:

  • the second quartile is the median of the original sample
  • the first quartile is the median of the lower half of the original sample
  • the third quartile is the median of the upper half of the original sample

There is a slight adjustment to be made in each group depending on whether the number of points is odd or even, as was the case for the median as well.

The questions in this set were these:

Task 1
Given an array, , of integers, calculate the respective first quartile (), second quartile (), and third quartile (). It is guaranteed that , , and are integers.

Task 2
The interquartile range of an array is the difference between its first () and third () quartiles (i.e., ).

Given an array, , of integers and an array, , representing the respective frequencies of ‘s elements, construct a data set, , where each occurs at frequency . Then calculate and print ‘s interquartile range, rounded to a scale of decimal place (i.e., format).

In keeping with the structure I laid out for the code on Day 0, the main programs for these tasks are very simple:

from utils import Utils
from stats import Stats
	
input()
tokens = input().split()
N = len(tokens)
points = list(map(int, tokens))
points.sort()

Q1 = Stats.Q1(points)
Q2 = Stats.Q2(points)
Q3 = Stats.Q3(points)

print(Utils.scale(Q1,0))
print(Utils.scale(Q2,0))
print(Utils.scale(Q3,0))

and

from utils import Utils
from stats import Stats
	
input()
points = map(int,input().split())
freqs = map(int, input().split())
S = Stats.expandPopulation(points, freqs)

S.sort()

Q1 = Stats.Q1(S)
Q3 = Stats.Q3(S)

print(Utils.scale(Q3-Q1, 1))

A couple of observations in here:
I have changed the way in which I convert the input to integers. Previously, I used list comprehensions; here I use the map function. In Python 3, Map returns an iterator, which only has a very limited interface, namely, traverse the map in one direction. The advantage is that the map can be lazily-evaluated: each member is generated only as needed in the order of enumeration. Constrast this to actually producing a whole list at creation time, which must have allocated storage space for all elements. The map is more conservative, as it only needs to keep a reference to the original list plus a rule to create each successive element. In this case, I do want to use the whole result at once (to sort it), and so I create a list from that map with the function list.

It may be useful to know that the creator of Python, Guido van Rossum (also known as Benevolent Dictator for Life), prefers list comprehensions and similar constructions to map or filter, so this is not the best style and I may return to list comprehensions in future posts.

The other thing to notice is that beyond the expected new functions to compute each quartile, there is also another new function for Task 2: expandPopulation. This is because the input is different in both tasks. In the first case, we input the list explicitly, but in the second, we input frequency buckets: for each value, we indicate how many points have that specific value. I used the median to compute the quartiles, and to compute the median I need to access the middle position of the series. I could have done some clever maths to sum the frequencies and find which bucket has the middle value, but the logic is much simpler (and makes better reutilization of the existing code) if I just create a fully expanded series, where each point appears as many times as its frequency dictates. The obvious drawback, of course, is that this is wasteful, taking unnecessary storage.

However, this is one of agile’s recommendations: don’t over engineer, solve the problem you have today. This solution is cleaner, and easier to maintain and reason about; plus, importantly, it passes HackerRank’s tests. Granted, HR does not seem to have many tests when compared to Codility or CodeFights (that always include speed tests) and so they may not be testing for performance hard cases, but in any case I prefer to save my time for more serious problems.

This function has a particularly functional feel: we have two lists that we want to pass through some kind of black box to produce a single list. So I looked for one such solution, instead of an imperative-style loop. This means making use of some functional-style construction, whether list comprehensions or some map/reduce. Turns out I used both. The first thing I had to realize was that there are two implicit loops in here: we have to add all buckets to the result, after each bucket has been expanded.

Expanding a single bucket is easy: we take the element x and its frequency f and write [x] * f, which is a succint and intuitive way of creating a list with the value x repeated f times.

This can be done with a single list comprehension, but we need to iterate over two lists. As described here, I had to bring both lists together first with a zip, so that they can be iterated in parallel:

[[x]*f for x,f in zip(elements, weights)]

After this step, I have a list of lists, each representing the population in one basket. Now I had to find a way to bring all lists together, still avoiding the loop. The key realization here is that:

  • I have several sub-lists in a list
  • I want to end up with a single list
  • Sub-lists can be added one by one.

This looks like an ideal job for Reduce and that is indeed what I used. Once you see this is a good fit, it’s very easy to write that implementation:

  • Initial value: []
  • Iteratable: the list comprehension above
  • Accumulating function: simple concatenation.

The final result is this:

return reduce(
   	  lambda a,l: a + l, 
	  [[x]*f for x,f in zip(elements, weights)], 
	  []
	 )

The third problem of the day was less interesting: it asked to compute the standard deviation of a series. This is a common and important statistic, but does not inspire me to any interesting tidbit or another post.

Task 3
Given an array, , of integers, calculate and print the standard deviation. Your answer should be in decimal form, rounded to a scale of decimal place (i.e., format). An error margin of will be tolerated for the standard deviation.

This requires computing the mean first, then computing the sum of the square of all errors respect to the mean, then taking the square root and finally dividing by the number of elements. It can be written like this:

	
@staticmethod
def standardDeviation(points):
    m = Stats.mean(points)
    N = len(points)
    s = sum([(x - m)**2 for x in points])
    return math.sqrt(s / N)

Full code is available in my Github, commit 50caa57.

Leave a Reply