Following several days dedicated to probability distributions, the statistics tutorial then changes tack, and turns to correlation coefficients. A correlation coefficient is a measure of how in accordance two random variables are. The values (sic) of maximum correlation are 1 and -1, meaning that the variables are fully related: when one variable changes in one direction, the other changes in the same direction. More, if we draw a plot with each variable labelling a different axis, and plot a point at the points whose coordinates correspond to the variable’s values, the resulting chart will be a straight line. Pearson’s coefficient measures linear correlation.
The minimum value is 0, meaning that the variables are not related at all. A change in one variable is not everywhere associated with a similar or opposite change of the second variable.
It is easy to find related variables in the world around us, especially if we look for variables where one causes the other. An example might be the variables “amount of precipitation” and “lushness of vegetation”, as the second is probably caused by the other. Another pair might be “average temperature” and “latitude”, (which will be inversely related for latitudes in the northern hemisphere).
This is a good point to recall, though, that correlation only means that the variables are somehow associated. It does not mean there is any reason to explain that association, and in some cases that may just be a coincidence. For example, Tyler Vigen has compiled a whole book of such spurious pairs in his site Spurious Correlations. The associations he found are so ridiculous that one can’t but smile when imagining possible explanations for a causal relationship, even if so many of them seem to related to death statistics. There are some striking ones, especially related to the US scientific and educational output. What do you think of “US Spending on Science, Space and Technology” being almost 98% positively correlated with “Suicides by hanging, strangulation and suffocation”? Or the relation (around 95%) between “US Crude oil imports from Norway” and “Drivers killed in collision with railway train”?
We could go mad weaving conspiracy theories to explain all of these. The way out is simply to remember this mantra:
Correlation is not Causation
But let’s forget that for now, and instead focus on the HackerRank problems for this day.
Task
Given two n-element data sets, X and Y, calculate the value of the Pearson correlation coefficient.
This is a simple task. It is just implementing a formula. This is the correlation factor I remember learning in college, and the first one that is mentioned in Wikipedia, so I’m going to assume it is the standard measure for “correlation”. It is defined as the covariance of the two variables over the product of their standard deviations.
One novelty of this problem is that it requires reading 2 lines of input. Most of the other problems simply had a fixed input that we could hard-code. And before these two lines, we have to read an integer from the input specifying how many entries each line will have. This value is for the benefit of older languages, where we have to specify the size of an array upon its creation. Python allows us to flexibly add to an array, so I am simply going to discard this first value with
_ = input()
This is a nice touch of Python: the underscore represents a placeholder for an assignment left-value, but one we don’t care about and discard. I could have just called input
without the assignment, but for some reason I believe this is clearer and more elegant (since it shows that the function does return some data).
Note: ordinarily, I would pass a string argument to input
, to give a prompt to the user. However, that will make the HackerRank test fail, so I removed those messages from my code.
For reading each line, I defer for a new method in the Utils class with:
X = Utils.readFloatsLine() Y = Utils.readFloatsLine()
class Utils: @staticmethod def readFloatsLine(): tokens = input().split(" ") floatTuples = [Utils.tryParseFloat(s) for s in tokens] return [value for (value, valid) in floatTuples if valid] [...]
The function input
returns a line from the standard output, that is, a string. I then use split
to extract the individual terms of this string. Then, these values are passed through a two-step pipeline, implemented by two list-comprehensions, to obtain a list of floating point values.
Let’s go over these a bit more slowly. The first list comprehension above takes each string extracted from the input line and returns a float value. In most languages, we’d check first if the string properly encodes a float and then we’d convert it. C# actually gives us the tryParse methods that do both in one go, at the expense of having to pass an out
variable into the function. That is called the “Look-Before-You-Leap” (LBYL) approach. The general wisdom around this is that Exceptions should be reserved for unpredictable circumstances, things the programmer cannot avoid, like a disk failure or memory being exhausted. But arithmetic failures are usually avoidable, since we can previously check the conditions that lead to them. Usually, too, exceptions in such languages are slow and heavy, and therefore best used only when there is no alternative.
Python has a different approach, which is called “Easier-to-Ask-for-Forgiveness-than-Permission” (EAFP), which encourages just the opposite behaviour: go ahead with what you want to do, check if it succeeds and apply corrective measures if not. The advantages are that the code is easier to understand, since we avoid lots of preparatory checks that are not the main logic of the program; and on the other hand, we generally avoid doing work twice: one to verify the arguments are correct, and one to operate over them. In a situation where we want to parse a single string, Python recommends we’d enclose the parsing inside a single try-catch block. But here we have an array of values. If we do the parsing for all the array inside a try-catch, a single error in a string will invalidate all others. I wanted to be a bit more benign, and only reject those values that are actually wrong, keeping all the others.
To do this, I created a method in the Utils class that returns two values: whether the conversion to float was successful, and the result if so. The code is this:
@staticmethod def tryParseFloat(s): try: return(float(s), True) except: return(None, False)
This function does the conversion inside a try-catch block: if all goes well, it returns the converted value and a flag indicating success. Otherwise, it catches the exception and returns a null value together with a flag indicating failure.
This shows another cool feature of Python: it can return tuples. For comparison, the tryParse methods of C# (eg Double.TryParse) also return two values, but they are clearly asymmetric: the main return value is a boolean indicating success, whereas the actual result we are interested in (the double value representing the input) is returned as a collateral effect on an out
variable that is passed aside the input string (with the minor difference of a single keyword). I was never happy with this signature, and find Python’s a perfect improvement.
The second pipeline uses a list comprehension with the if
keyword. This creates a filter of the list: only those values that have returned a valid flag will be processed for inclusion in the result list. This process extracts just the parsed value and ignores the flag, producing a final list of all and only the sub-strings that could be parsed as float numbers.
The rest of the main code is standard:
pcc = Stats.PearsonCC(X,Y) print (Utils.scale(pcc,3))
We invoke a new function that computes the Pearson Correlation Coefficient, and present the result with the requested scale. The key calculation for this coefficient is the covariance. The tutorial presents two formulas:
- The first uses a single sum but requires previously computing the means of both variables
- The second expands the means and gives a formula with two nested sums, which leads to a O(n^2) complexity.
The second form may be clearer in math terms, but is inferior for calculation because it is slower for large n. Therefore, I implemented the first one:
@staticmethod def covariance(X,Y): n = len(X) mx = Stats.mean(X) my = Stats.mean(Y) return 1/n * sum([(x - mx) * (y - my) for x,y in zip(X,Y)])
@staticmethod def PearsonCC(X,Y): return Stats.covariance(X,Y) / (Stats.standardDeviation(X) * Stats.standardDeviation(Y))
Task
Given two n-element data sets, X and Y, calculate the value of the Spearman’s rank correlation coefficient.
This is almost the same statement as in the previous task. The only difference if the correlation coefficient requested. The Pearson coefficient measures linear correlation among the variables, that is, how well their relationship is approximated by a line. In turn, Spearman’s rank coefficient does not consider the actual values of the variable, but rather how it moves from one point to the next. It achieves this by considering the rank (the position in a sorting order) of each point, instead of its value.
This smooths out the differences between adjacent points. For example, consider variables X, Y1 and Y2 that take these points:
X = 1, 2, 3, 4, 5, 6
Y1 = 4, 8, 12, 16, 20, 24
Y2 = 2, 4, 8, 16, 32, 64
We can see there is a linear relationship between X and Y1. In fact, we can write Y1 = 4X and so its Pearson coefficient is 1. On the other hand, the relationship between X and Y2 is rather Y2 = 2^X, which is not linear. Its Pearson correlation is 0.906. The more values we include in the series the less the correlation will be, as the exponentiality of the curve becomes more apparent. For example, if I extend the X series to 10, and Y2 to 1024, the correlation will drop to 0.799.
The difference between these two correlations, ultimately, comes from the different slopes of the line between consecutive points. In the linear relationship, the slope is always the same: for example, (20 – 16) / (5 – 4) = (12 – 8) / (3 – 2). It is this constancy that defines a straight line. But in the second case, this slope grows as we advance in X. Using the same points, the slopes would be (32 – 16) / (5 – 4) = 16 and (8 – 4) / (3 – 2) = 4.
These differences are ignored by the Spearman’s coefficient. All it matters is the order of points, and whether the slope from one point to the next is negative or positive. That is, it checks whether the two variables are monotonically correlated: whether one rises (or falls) as the other rise (or falls).
To compute this coefficient for the above variables, we first replace the actual numbers by their ranks, and then compute the Pearson’s coefficient between the result. In this case, that is dead easy: all of the distributions are monotonically increasing and so their ranks are simply 1 to 6 in increasing order.
Xr = 1, 2, 3, 4, 5, 6
Y1r = 1, 2, 3, 4, 5, 6
Y2r = 1, 2, 3, 4, 5, 6
They are maximally correlated.
A different thing would happen if we considered instead a cubic function, for example, and extended the domain to the negative side: Y3 = (W – 3) * (W – 1) * (W + 2)
W = -2, -1, 0, 1, 2, 3
Y3 = 0, 8, 6, 0, -4, 0
Y3r = 2, 4, 3, 2, 1, 2
The Pearson correlation is now -0.504, signalling a very non-linear relationship.
The rank coefficient will also be very far from 1, as we can see Y3 has 2 inflection points while X has none. In fact, it is around -0.517.
Let’s now take a look at the code.
The only difference in the computation of these two coefficients is that in the second case, we must do some precomputation over the data before computing the Pearson’s coefficient.
That can simply be written like this:
@staticmethod def SpearmanRCC(X,Y): rankX = Stats.Rank(X) rankY = Stats.Rank(Y) return Stats.PearsonCC(rankX, rankY)
The Rank function looks at a list and replaces each member by its rank. As per the definition in the tutorial, we only consider distinct values in the whole collection when computing ranks: all points with the same value will have the same rank. In the above Y3 example, we have 6 points in the set, but only 4 different values, and so the ranks go from 1 to 4. We achieve the unique values of X by creating the corresponding set and then sort it with rankSet = sorted(set(X))
. Once we have this sorted set, we can use the index
function of the set object to create a new list.
@staticmethod def Rank(X): rankSet = sorted(set(X)) return [rankSet.index(x)+1 for x in X]
With these two new methods in the Stats class, the main program becomes easy:
_ = input() X = Utils.readFloatsLine() Y = Utils.readFloatsLine() rcc = Stats.SpearmanRCC(X,Y) print (Utils.scale(rcc,3))
The whole code for these two tasks can be found at my GitHub, in commit 25bdeb9
.