business, computer, connection

HackerRank Statistics Day 4: Binomial Distribution

On this post, I go on with my trail through the HackerRank series “10 Days of Statistics”.

After two days dedicated to Probabilities, Day 4 starts a long subseries dedicated to statistical distributions that lasts until Day 6. And this is a return to Python programming.

A Statistical Distribution is a model that tries to predict, for a given type of experiment associated to a random variable, the probability that the variable will take a certain value or, more commonly, fall in a certain range. Most of the questions in these days are very similar, consisting mostly of implementing the distribution functions, and then applying these to a specific scenario.

Day 4

Task

The ratio of boys to girls for babies born in Russia is 1.09:1. If there is 1 child born per birth, what proportion of Russian families with exactly 6 children will have at least 3 boys?

Write a program to compute the answer using the above parameters. Then print your result, rounded to a scale of 3 decimal places (i.e., 1.234 format).

There are two parts to this question: the first is to identify the experiment and how to model it; the second is the implementation of the needed calculations.

The problem statement confines us to families of 6 childen, each of which can be either a boy or a girl. This is similar to an experiment in which we toss a coin 6 times and count the probability of coming up Tails at least 3 times. The difference is that this coin would not be fair. This is clearly a binomial experiment: we have a repetition of independent trials, and the result is of a binary nature.

The Tutorial gives us the formulas for this distribution. In this case, we want some kind of cumulative distribution. But there is no compact easy formula for that, in the case of the binomial distribution. We have to sum the values for the probability at each point.

So, we could simply compute the sum of the variable taking values 3 to 6. In fact, using a principle I outlined in this post on negative logic, it’s more efficient to compute the probabilities for 0, 1 and 2 instead, and then subtract the result from 1.

This gives us the following main program:

from stats import Stats
from utils import Utils

boysWeight = 1.09
girlsWeight = 1.0
probBoy = boysWeight / (boysWeight + girlsWeight)

print (Utils.scale( 1 - Stats.binomDistCumulative(2,6,probBoy), 3))

The crucial part, of course, is to implement binomDistCumulative.

We can do it by simply running a loop and summing the probabilities computed at each stage. Naively, we could use the exact formula for each point, computing the binomial coefficient and the powers everytime. But that would be very wasteful, as at each stage, the value for each one of these quantities can be computed from their value in the previous iteration.
The followoing code makes use of this observation and optimizes the calculation by removing any direct computation of the binomial coefficient and computing the powers only gradually, at each step.

def binomDistCumulative(maxX,n,p):
    binom = 1
    pSuc = 1
    pFail = (1-p)**n
    prob = 0
    for x in range(0,maxX+1, 1):
        prob = prob + binom * pSuc * pFail
        pSuc = pSuc * p
        pFail = pFail / (1-p)
        binom = binom * (n-x) / (x+1)
    return prob

For those who prefer a functional write-up, this can be written with a single reduce call. In my opinion, this is at the cost of some readability, and more difficulty in debugging, but if your team is used to writing and debugging functional code, then this may be more attractive.

def binomDistCumulative(maxX,n,p):
    tuple = reduce(lambda a, x: 
            (a[0] + a[1], 
             a[1] * p / (1-p) * (n-x) / (x+1)
            ), 
            range(0,maxX+1, 1), 
            (0, (1-p)**n)
          )
    return tuple[0]

Task

A manufacturer of metal pistons finds that, on average, 12% of the pistons they manufacture are rejected because they are incorrectly sized. What is the probability that a batch of 10 pistons will contain:

  1. No more than 2 rejects?
  2. At least 2 rejects?

After having implemented the distribution functions , this problem becomes only a matter of finding how to apply it to the question.

Again, we have a binomial experiment repeated 10 times independently: we can assume the quality of each piston is not affected by the quality of the others. In this case, we want to measure the probability of a certain number of defects occurring.

The first question translates to the cumulative function for the variable taking a value up to 2. In the second case, we apply negative logic again, and find the complement of the variable taking values 1 or less, which means that it must have a value equal to 2 or more. This gives the following main program:

from utils import Utils
from stats import Stats

pReject = 0.12
n = 10

pNoMoreThan2 = Stats.binomDistCumulative(2,n,pReject)
pAtLeast2 = 1 - Stats.binomDistCumulative(1,n,pReject)

print( Utils.scale (pNoMoreThan2, 3))
print( Utils.scale (pAtLeast2, 3))

Next time, I’ll go on with the Geometric Distribution, still on Day 4. If I’m on a roll, I may add some more distributions, as the problems for these days are very very similar in conception.
See you then…

Leave a Reply