analytics, information, innovation

Statistics Hello World – Part 2

The first of the “10 Days of Statistics” (day 0) covers very basic definitions, that are nonetheless easy to mix together. It has 2 problems to solve.

Problem 1

Given an array, , of integers, calculate and print the respective mean, median, and mode on separate lines. If your array contains more than one modal value, choose the numerically smallest one.

These problems are always accompanied by a tutorial, that explains in basic terms what each problem is about. For this task, the tutorial gives mathematical definitions of the 3 quantities in the question, plus a short explanation of precision and scale. This is relevant because all of these problems require the solution to be presented in a given scale. There is also a video explaining mean, median and mode. I’m not a big fan of this. The definitions are well stated and should be enough.

For the benefit of those that don’t remember the definitions, here is a refresher. All of these quantities are a measure of a group of “things with a value”. Let’s call these things points and the group a series (where each event is numbered from 1 to n). Each point has a value associated to it, for example, the height of a person. Each person, in this case, would be a point in the series. We can call this value the variable.

  • Mean (also called average, or expected value): This can only be computed when the variable can be added (numbers can, colours can’t). It is computed by summing the value of all points and divided by their number. In real world terms, you can think of it like an extreme socialist experiment: if a group of people share a certain resource, and some have more than others, the mean is: “how much each person would have, if everyone had exactly the same as everyone else, without increasing or decreasing the total value in the group”.
  • Median: this only makes sense when the variable can be ordered in some way (numbers can be ordered; colours, usually, can not). To find the median, first sort the series according to the values of each point. If there are an odd number of points, the median is the value of the middle one; if not, the median is the average of the two points in the centre of the series.
  • Mode: this can always be computed. It is any value that appears most often in the series. If there are several such values, the series is multi-modal, and has several valid modes.

My first approach was very blunt, just trying to get my Python wings flying. My objective was merely to submit a valid answer that HR would accept. Accordingly, it was all in one file, without much in the way of structure:

def scale(n, scale):
    """Returns n with a number of decimal places equal to scale, after rounding.
    """
    return int(n*10**scale + .5) / 10**scale


_ = input()
line = input()
values = line.split()
numbers = [int(i) for i in values]

# mean
average = sum(numbers) / len(numbers)

# median
numbers.sort()
N = len(numbers)
if (N%2 == 0):          
    leftOfCentre = numbers[int (N/2)-1]
    rightOfCentre = numbers[int (N/2)]
    median = (leftOfCentre + rightOfCentre) / 2
else:
    median = numbers[int (N/2)]
    
# mode
count = {}
max = None
for item in numbers:
    if item in count:
        count[item] = count[item] + 1
    else:
        count[item] = 1
    if max is None:
        max = (item, 1)
    else:
        if count[item] > max[1]:
            max = (item, count[item])
mode = max[0]

print (scale(average,1))
print (scale(median,1))
print (mode)

This works, but is just a first iteration. There are several things I don’t like here. First of all, none of this code is reusable. One of the things I want to take from this tutorial is a small statistical library that I can use later. For that, I decided to create a new class in a separate file (Stats) and everything that could be remotely used in other challenges should be placed as a function in there.

Also, I expect some of these functions may be usable in other problems. I considered creating utility functions to encapsulate frequent actions, for example, reading a line or scaling. And for that, I created another class in another file: Utils. In my mind, Utils.readline() is more readable than input(), but then I figured that input() is shorter and obvious enough for someone familiar with Python that there wouldn’t be a big advantage in doing that.
(Cue ominous music):
it is often said that premature optimization is the root of all evil in programming. I’m often guilty of that, so this is me paying attention not to fall in that trap.

Then there is that scale function. This formats the number according to the demands of the problem. And the reason I use it is because I just didn’t find the documentation for format early enough: when time is so short, I have to make choices, for example: do I write some code now, or do I go searching for documentation? So, instead, I did a simple scaling function. I won’t be using it in later solutions, unless I realize that, actually,

print (scale(average,1))
is more readable than
print("{:0.1f}".format(average),
which it probably is. Hmmmm…

There is also that awkward int (N/2) . This is something I did not expect. Isn’t N/2 an integer, given that N and 2 also are? Well, no. Not any more, at least.

In C and the languages derived from it (and Python before version 3), the division of two integers yields an integer. I found out during these exercises that is no longer the case in Python 3, and that PEP 238 deliberately changed it to always return a float. This is relevant for the several cases of N/2 when accessing an array. Obviously, an array index is an integer, and so Python does not like when it receives a float as an index, even if it is something like 2.0. Because it does not automatically convert to integer, I did it myself. Turns out that Python introduced a compensation for changing the behaviour of operator /, by giving us operator // (already in version 2.2, preparing for an easy transition for the breaking change ahead) that does exactly what the old / did: truncated integer division. Therefore, in my final code I replaced all occurrences of int(N/2) by N//2.

Even finding I had to do this was not as straightforward as I wished. I ran the basic test in HR and my code passed it. It was just when I tried to submit that I got back the reply:

This solution gives an error:
Runtime Error
Ask your friends for help:
Test Case #0
Test Case #1
Test Case #2

There was no information about the error. This is a feature of at least HackerRank and CodeFights: some tests are hidden from you, but your solution is only accepted if they pass. If you want to know what is in those tests, you can pay in points that you earn by submitting correct solutions beforehand. It’s not bad, it makes you think before outright asking for a tip.
I have some grievances against CF tests (the last ones tend to be huge, random and generally unhelpful), but that is for another time. But one thing they’re doing better than HackerRank. They at least give you the error produced by the interpreter. HR doesn’t, and I can’t understand why. It doesn’t give much information about the test case, and in any case knowing why something failed just seems fair to me. In this case, I simply tried some different test cases locally and found what and where the error was, but that trial process would be unnecessary with just that missing little help.

The final code is available at https://github.com/alxmirandap/Coding/, in commit a1646f8.

Leave a Reply