network, control, computer science

HackerRank Days 4 and 5: Probability Distributions

In my last post about HackerRank, I left you with a cliffhanger: I covered the approach strategy, showing how to apply distribution functions to solve specific questions about the probability of a certain event. The resulting code was very short and high-level, usually with a single line of code to obtain the desired result.

That means I left out all the plumbing, the actual code doing the computation. For me, this is the boring part, a bit of drudgery: routine and uncreative work that has to be done by someone and that you’ll likely find in some good library. Still, for completeness, I’d like to present it here as well, so that you can run the full examples and see the results… and then go back to the problem-solving part, play with the numbers and see how the probabilities change.

Following my principles of separation of code, all the distribution functions are written in the stats.py file. For the most part, these functions are very simple implementations of the formulas presented in the training tutorials of the site. I have not looked for optimized implementations: that would be interesting to scientific library writers, but that is not my goal here.

The first function is straightforward:

@staticmethod
def geomDist(p,n):
	q = 1-p
	return q**(n-1) * p

The cumulative function of a discrete probability is just the sum of the probability taken at each value of the variable equal or below to the argument. Therefore, we should use some kind of loop to compute it. In the functional approach, that corresponds to summing a series of terms computed for each stop in a range. That sounds like two functions: one computing the value of each term and another summarizing all of them. This corresponds to the reduce framework, explained here, and leads to the code below.

Here, the summarizing function is a sum, while the “worker” function is the probability for every point from 1 (we must have at least 1 trial) to n. Notice that the set produced by the range function is open on the upside, that is, the lower bound is included in the set, but the upper is not.

def geomDistCumulative(p,n):
    q = 1-p
    return reduce(
        lambda a,x:
            a + Stats.geomDist(p,x),                       
        range(1,n+1),
        0)

The functions for the Poisson distribution are also straightforward. I ended up coding only those that were needed for the specific problems. This is because I realized, as I progressed through the series, that there was not much of a point in writing library functions for myself: there are excellent and faster options available out there, like SciPy, NumPy and pandas. Because of that, I did not implement the Poisson cumulative distribution, opting instead for the function that gives the expected value of the squared distribution.

@staticmethod
def poissonDist(l,k):
    return math.exp(-l) * l**k / math.factorial(k)
@staticmethod
def poissonSquareExpectedValue(l):
    return l**2 + l

The normal distribution was the hardest to implement. It is also, perhaps, the hardest to understand. First of all, it’s not a discrete distribution: its domain is the real axis, and because of that the probability of hitting any particular real number is 0. Real numbers are uncountable, and it’s impossible to ever identify any real number with infinite precision. Even if you mean number 20 (an integer), you are implicitly assuming an infinity of 0 beyond the decimal point or just assuming some precision where you stop. To some extent, 20.0000000… is the same as 19.9999999…. And if you specify some number of decimal places for your precision, say 20.000000, this actually identifies a slice of width 0.000001, between 20.000000 and 20.000001 in which there is still an infinity of real numbers.

For that reason, there is no probability mass function for the normal distribution, but only a probability density function (PDF). This in turn does not directly translate to the probability we want. Rather, it is a curve whose integral over an interval of the axis represents the probability that the random variable takes a value in that range.
The probability that the random variable takes a value not greater than X is given by the integral of the PDF from 0 to X. Subtracting the probabilities for X and Y gives the probability that the random variable falls between these two boundaries.

The implementation can be complicated. Following the tutorial, I looked for ways to implement an integral, and briefly considered installing a library capable of doing that. I eventually found that erf, the error function, is actually available in the math namespace already, making everything much simpler.

@staticmethod
def normDistCumulative(mean, stdDev, x):
    return 1/2 * (1 + math.erf((x - mean) / (stdDev * math.sqrt(2))))

The complete code is available in my github, commit 7718278.

Leave a Reply