artificial intelligence, brain, think

Statistics Hello World – Part 1

My first serious experience in trying to learn Machine Learning was in 2016, when I did the Machine Learning course taught by Andrew Ng on Coursera. I made a point of doing all the coursework, and through that actually learned Octave / MatLAB. Of course, what you don’t practice you forget, so I mostly have some faint memories of the language by now, and only kept the key concepts of the course. That is par for the course everywhere, I believe, and when I start actually meddling with real ML problems these foundations will again be useful and will also be the blocks I’ll build on.

For now, though, I’ll start easy with the 10 Days of Statistics tutorial in HackerRank, because that seems to be a fundamental tool in ML. Also, it gives me an opportunity to retrain my Python skills (which like one’s weight when going through a bad diet, I periodically gain and lose, because I don’t have a chance to use Python regularly).

There are always hard parts in every challenge when one is beginning, and this was no exception. While the math is trivial, I did spend a lot of time focussed on the basics of Python. The first problems were prosaic or a matter of personal preference:

  1. what is the correct structure of a python file?
  2. how do I receive the program input?
  3. how do I write code with the proper style (i.e.: pythonically)?
  4. should I try to write code as minimal as possible?
  5. how shall I organize my code?

As for 1, there is not much to it. I had to get reacquainted with the way to define classes, functions and debate whether I should have a main section or not. That may be some day a theme for another post, but not today.

Question 2 is easy to answer, but it’s impossible to make any headway in HackerRank without having it settled. In some other sites, we get all the necessary data in the command line arguments, that is, as inputs to our main function. Not so in HR: we have to read them from standard input. Python has an object representing standard input, called stdin. It lives in sys, so to use we have to do, for example:

import sys
sys.stdin.readline()

But there is a much easier approach, that reminds me of my early BASIC days: input().
When you call input(), the program will wait for the user to enter a line of text and if you give it an argument, the user will even receive a prompt making that clear:

reply = input("Please enter some text\n")
print (reply)

The final three questions are more personal, and so they may not have objective answers.

Question number 3 is something I’ll have to find as I go, by visiting forums like statckoverflow and stackexchange, or reading Python code online. Since my main language is C#, and I am still not accustomed to Python idioms, I’m sure I’ll write many non-pythonic things, but hopefully I’ll be corrected and learn from that. Some useful links:
http://docs.python-guide.org/en/latest/writing/style/
http://pep8.org/

Question number 4 is prompted from my experience with other competitive coding sites, where frequently there seems to be an arms race for the shortest possible answer. In Codefights, there are even competitions where that is the very goal. I’m not running this race. My goal with these exercises is to learn both Python and the subject matter, in a way that I can reuse the code. I have three principles I want to keep:

  1. the code has to be as immediately understandable for the reader as possible.
  2. code must eventually be debugged, so make it possible to obtain the intermediate results of complex calculations (via breakpoint or a simple print)
  3. the code must be as performant (in terms of time and space complexity) as required in the problem statement or, missing that, as performant as possible.

Point 3, for example, is often overlooked in, say, Codefights. There is a battery of tests that check if the code runs in some limited time, but there is not an estimated complexity and it can accept solutions whose asymptotic complexity breaks that requested in the problem statement.

These principles may lead me to write more code than necessary. Or eventually, more complex code (for performance) than an intuitive but slow solution. And often, point number 2 will lead to a tension between a concise functional-style expression and an imperative-style program that allows inspection of each intermediate calculation. All these struggles were already evident on the first day.

Finally, question 5 is for how I want to write my code outside HackerRank, for my continuous use. I like code that is organized with good structure. A particular aspect of this is that I want functions to have their own namespaces, so that I can have variables and functions with similar names where that makes sense. For example, I don’t like to have to name a function getMax() because I want to use a variable max. I find this should make sense:

max = max([1,2,3,4])

But this is bad for a compiler. The solution (evident in languages like C#, Java and Javascript) is that max should be a function in a particular namespace, allowing us to write

max = Math.max([1,2,3,4])

This also allows me to put that namespace in a separate file so that I can reuse it from challenge to challenge and save me some hideous rewriting. For that reason, I’ll break code in different classes, and then join it together for submission to Hackerrank in a single file. From the point of view of solving problems in HR, it has two drawbacks:

  • having to select the parts of your codebase that are relevant to the solution
  • producing final code that is way larger than it needed to be.

But it is just right for my goals.

All of these decisions happened as I was going through the Day 0 exercises. These are kind of Hello World exercises in a way, and although they could be answered with very short and simple solutions, it pays to think of these things if you’re in it for the long term. Next time, I’ll talk about my solutions to the actual questions.

Leave a Reply