“Relying entirely on gut-feel and intuition can’t take a startup too far”, says Prasant Bhattacharji, Data Scientist & Product Planner, HackerRank
In conversation with Prasant Bhattacharji, Data Scientist & Product Planner, HackerRank
Please tell us about your professional & academic background. Can you also brief us about your role in your current organization?
I graduated from the Electrical Engineering department of IIT Kharagpur in 2006 from the Energy Engineering program with a minor degree in Computer Science and Engineering. Computer Science courses equipped me with programming, software design and algorithmic thinking skills. Electrical Engineering courses, especially those related to signal processing, information theory, control and optimization, taught me a lot about applying mathematics and statistics. This combination of skills has been quite helpful while working with data.
After graduating, I worked for a year developing Fixed Income Analytics software at Lehman Brothers (now Nomura) in Mumbai, followed by a couple of years at the Microsoft Headquarters in Redmond, working as a Software Engineer in the Sharepoint Server team. I then worked as a Data Engineer for Factual.com, a big data company, which mined, aggregated and processed large data sets related to geo-data and consumer products. As a side project, I also created an educational portal: TheLearningPoint.net .
My current role is that of a Data Scientist cum Product Planner at HackerRank.com, a fast growing startup with a twin-edged focus on helping programmers sharpen their skills in various domains of Computer Science and Math, and also connect them with companies who might be interested in hiring them for their skill set. I analyze metrics related to our product and users, try to glean insights from the data and then use those insights to make informed decisions about how our product and content should evolve. Clustering, classification, regression, predictive analytics and cohort analysis – all of these are tasks which often need to be executed, as we try to make sense of our usage data and feel the pulse of our product.
Do you think the hype around data science is warranted? If so, why?
To be honest, I don’t think I am qualified to comment on something as broad as that! But, when the Chief Economist of Google says that the role of a Statistician is going to be an important job in the years to come, perhaps there is some element of truth in the hype about data.
It isn’t as if the ‘data age’ has arrived suddenly – after all, search engines have been around for well over a decade, and they are the most commonly used manifestations of ‘big data’. Similarly, students in Electrical Engineering and Computer Science courses have always been exposed to a lot of data, linear algebra, statistics and mathematics. I personally think that the hype around it is partly because the field is now becoming vast and widespread enough to have a demarcated zone of its own, similar to how Computer Science initially evolved out of the Electrical Engineering and Mathematics departments in universities across the world.
How did you get interested in working with data? What was the first data set you remember working with? What did you do with it?
I did some coursework and project-work with a focus on visual data (Image Processing), text and natural language processing and machine learning. All of these made me quite interested in topics which could broadly fall under the Data Science umbrella.
The first time I worked professionally with data was at a start-up Factual.com, a big data shop, which aggregated, classified and processed large data sets related to geo-data and listings of consumer products.
How does TheLearningPoint and HackerRank use data science?
LearningPoint was just a personal side project of mine, on which I hosted academic content, visualizations and quizzes created by student contributors. Apart from that, on this portal, I also hosted my own analysis of data mined from the results of school-leaving exams. The analysis enabled us to get factual insights into the relative performance of different schools. These also highlighted the fact, that the factual ranking of schools differed significantly from the perception based rankings and ratings, which are often influenced by money spent on advertising by certain schools, which is misleading to the public. The analysis was based on data for a ten year period and also revealed that exam scores were being manipulated or rigged. The report is available here.
A surprising score distribution
At HackerRank.com, the founders, Vivek and Hari, are doing their very best to ensure that HackerRank, both as a product and as a company, functions in an extremely data-driven manner. All employees in the company – engineers, content curators, managers and members of the sales team – are encouraged to calibrate the impact of their work by continuously monitoring metrics which relate to it. It is true that not everything can be accurately captured in the form of a quantified metric – but imperfect measurements are preferable to none. One cannot fix a problem without measuring it in some way. Relying entirely on gut-feel and intuition can’t really take a startup too far.
HackerRank also has a full-fledged section designed for programmers and students to learn and sharpen their skills in different areas related to data science – statistics, machine learning, natural language processing and even visual data (Image Analysis).
What are your favorite tools / applications to work with?
For scripting, text processing and even data acquisition tasks I feel most comfortable using Ruby.
R is also an excellent tool for statistical analysis which I’ve used often. I’ve used a fair bit of Python – for the excellent support it offers in terms of libraries like NumPy, SciPy, SkLearn and MatPlotLib. For processing large data sets I’ve used MapReduce as offered via Hadoop infrastructure.
What personal/professional projects have you been working on this year, and why/how are they interesting to you?
I don’t get much time for a lot of personal projects, but this year I did spend a couple of days on a very large data acquisition task – I compiled the school-wise performance data for over 7000 schools affiliated to CBSE and CISCE, and released the information on LearningPoint, which to the best of my knowledge has been the biggest data mining initiative with a focus on the Indian schooling system. This is actually in line with standard practices worldwide where there is recognition of the fact that open data drives accountability – in this particular case, for schools, who will have an incentive to pull up their teachers and students, once they’re conscious of the fact that data about their academic standards is out there in public domain, and available to parents while they pick a school to admit their ward in.
Professionally, I analyze data about the users of HackerRank, with a focus on their behavior, engagement patterns and skills. This enables us to evolve our product in a way that we offer greater value to our users and paying customers.
Can you name some publications, websites, blogs, conferences and/or books that you read/attend and have been helpful to your work?
Coursera’s online classes have helped me a lot. I have used them extensively to brush up my knowledge of statistics, machine learning, NLP, databases and even learn powerful tools like GPU programming which enable one to deal with compute intensive tasks a lot more efficiently. Udacity is another website which offers plenty of classes related to programming, databases and machine learning; though I have personally stuck to Coursera. Blogs such as Priceonomics often analyze real world data sets and identify trends in those. It has several posts which make simple but good examples of how one could inspect data and dig up the stories which it tells.
Any words of wisdom for Data Science students or practitioners who are just starting out?
There’s no fixed formula for how ‘best’ to start out and everyone has their own style of learning. What I personally find an effective approach, is to play around a lot with real world data sets, even while I learn from books or online classes. At the risk of sounding a bit self-promotional, I’d encourage new learners to check out the Statistics and Machine Learning challenges on HackerRank.com – which have been created for this very purpose. There are challenges, tutorials, editorials and real world data sets which learners may experiment with.
Plus, we often host company sponsored contests or challenges, such as the Quora contests where they create data driven challenges relevant to the needs of their company. Top performers in these challenges often end up getting recruited by the companies! Many skilled data scientists and engineers have been recruited by Palantir and Quora through HackerRank.