This post focuses on days 4 and 5 of the “10 Days of Statistics” path in HackerRank, because they are very similar. All tasks can be broken in two parts:
- Compute the probability function for a certain probability distribution
- Apply the probability distribution to solve a problem.
Personally, I prefer the second type of question, as it is more of a real world question: “how do I apply some general knowledge to a particular problem?” The general knowledge is, of course, the probability distribution formula, which you will easily find implemented in some library anyway.
Let’s start then with the creative questions first, and leave the implementation of the distributions to the end. Just assume, for now, that I have appropriately named functions in my ongoing stats.py module.
The essential knowledge for all these questions is the following:
- some random event happens with probability dictated by some random distribution
- a random distribution assigns a given probability to each possible value of a random variable:
- the above function is called the “Probability Mass Function” (PMF)
For some distributions, the PMF is 0 at every point (eg, if the variable can take an infinity of values). Because of this, it’s often more useful to use a Cumulative Distribution Function (CDF).
- the Cumulative Distribution Function computes the probability that the random variable takes a value less than or equal to a given value.
- It is written like this: .
- Where the variable X is discrete, this is the sum of the PMF for the relevant values of X. For real variables, this is the integral of a function known as the Probability Density Function.
Task (day 4)
The probability that a machine produces a defective product is 1/3. What is the probability that the 1st defect is found during the 5th inspection?
Let us now see how to frame the problem statement as a CDF or PMF calculation. We have a binary event (the product either has a defect or not) that occurs with probability 1/3 (this is all the parameters of the distribution). The problems asks for the probability of the first defective product being detected in the 5th trial. We’d rather break this down. There are two parts here:
- “The first defective product”: this is our event, and we want to know when it happens. We define our variable as X = “The number of the first defective trial”
- “the fifth trial”: this gives us the range of values we want consider.
- The problem can be written as which can be answered by the Probability Mass Function.
This is all we need to write our code.
print( Utils.scale (Stats.geomDist(1/3, 5), 3))
Task (day 4)
The probability that a machine produces a defective product is 1/3. What is the probability that the 1st defect is found during the first 5 inspections?
This is the same setup, but the question differs in the values the variable can take. This is now a range of values, and can be written like . This calls for an use of the Cumulative Distribution Function.
print( Utils.scale (Stats.geomDistCumulative(1/3, 5), 3))
Task (day 5)
A random variable, X, follows the Poisson distribution with mean of 2.5. Find the probability with which the random variable X is equal to 5.
The problem statement is very clear. We want this probability: . This is evidently a PMF.
print( Utils.scale (Stats.poissonDist(2.5,5), 3))
Task (day 5)
Task
The manager of a industrial plant is planning to buy a machine of either type A or type B For each day’s operation:
- The number of repairs, X, that machine A needs is a Poisson random variable with mean 0.88. The daily cost of operating A is .
- The number of repairs, Y, that machine B needs is a Poisson random variable with mean 1.55. The daily cost of operating B is .
Assume that the repairs take a negligible amount of time and the machines are maintained nightly to ensure that they operate like new at the start of each day. Find and print the expected daily cost for each machine.
This is a rather more interesting problem. It feels very school-like, but involves a number of steps, rather than just applying a function directly. In an exam situation, you’d probably be asked which one machine the manager should buy, and let you figure out you’d have to compute the costs for each.
The problem already says we have two Poisson distributions. What it doesn’t say, but is absolutely needed to solve this problem, is that the expected value of a distribution is a linear quantity: the expected value of a sum of random variables is the sum of their expected values.
This should be broken down for clarity:
- we know the expected values (the mean) of random variables X and Y
- we have formulas for new random variables, the cost of each machine: and
- we must return the expected value of each of these new random variables
- is defined as the sum of a constant value and the variable X multiplied by a scalar.
This now is where linearity comes in. If we have two random variables A and B, any two random variables, even possibly dependent, then , where stands for “expected value”.
In the expression above, we sum two things, but they don’t quite look like A or B. But we can make them do so:
- A constant c can be seen as the result of a deterministic variable. This is a sub-case of a random variable, whose probability mass function assigns probability 1 to the value c and 0 to all other possible values. In this case, let’s define “random” variables F, that always takes the value 160, and G, that always takes the value 128. Then, and .
- A random variable A mulitplied by a scalar k is nothing more than a sum of A: . By linearity, we get and in the general case .
- Finally, we have to deal with the fact in this case : this is given in the tutorial for this day, and we find that for a Poisson variable X, where .
We now have everything we need to compute the expected costs for each variable.
CA = 160 + 40 * Stats.poissonSquareExpectedValue(0.88) CB = 128 + 40 * Stats.poissonSquareExpectedValue(1.55) print( Utils.scale (CA, 3)) print( Utils.scale (CB, 3))
By the way, the expected cost for machine A is 226.176, and for machine B it is 286.1, so the owner should choose machine A.
Task (day 5)
In a certain plant, the time taken to assemble a car is a random variable, X, having a normal distribution with a mean of 20 hours and a standard deviation of 2 hours. What is the probability that a car can be assembled at this plant in:
- Less than 19.5 hours?
- Between 20 and 22 hours?
Day 5 also introduces the normal distribution, which is probably the most important distribution in statistics. Most of that is because of the Central Limit Theorem, which states that in many cases we can approximate a series of independent observations by a single normal distribution. For this reason, it is often used to estimate the probability distribution of many natural phenomena.
Normal distributions are also known as Gaussian distributions, frequently as bell-curves, due to their symmetric bell-like shape. They are characterized by two quantities: the mean (which determines the height of the curve) and the standard deviation (which determines how wide it is). It is defined over the real numbers, and so the Probability Mass Function at any one point is 0. The relevant function for us, then, is the cumulative function, and so typical questions are of the form .
That is precisely the nature of the first question: .
As for the second question, notice that if you count the probability of X being less than 22 hours, you also include all the cases in which it is less than 20. You can break this probability in two sets, as below.
Now, it is simple to obtain the solution to the question, by simply rearranging the terms.
The code to these two questions then is:
mean = 20 stdDev = 2 pLess19_5 = Stats.normDistCumulative(mean, stdDev, 19.5) pBetween22_20 = Stats.normDistCumulative(mean, stdDev, 22) - Stats.normDistCumulative(mean, stdDev, 20) print( Utils.scale (pLess19_5, 3)) print( Utils.scale (pBetween22_20, 3))
Task (day 5)
The final grades for a Physics exam taken by a large group of students have a mean of and a standard deviation of . If we can approximate the distribution of these grades by a normal distribution, what percentage of the students:
- Scored higher than 80?
- Passed the test (i.e., scored at least 60)?
- Failed the test (i.e., scored less than 60)?
The cumulative distribution gives us the probability of the variable being at most some value. The probability of a variable being at least a certain value is exactly the complement (remember that for the normal distribution, the probability of being exactly equal to a value is 0). That is,
. This observation is enough to solve these three questions.
mean = 70 stdDev = 10 pMoreThan80 = 1 - Stats.normDistCumulative(mean, stdDev, 80) pPassed = 1 - Stats.normDistCumulative(mean, stdDev, 60) pFailed = Stats.normDistCumulative(mean, stdDev, 60) print( Utils.scale (pMoreThan80*100, 2)) print( Utils.scale (pPassed*100, 2)) print( Utils.scale (pFailed*100, 2))
And this is it for today. I hope the above examples can give you some insight on how to use probability distributions. Some work remains to be done, namely the implementation of the distribution functions themselves. This, I leave for another day.
See you then.