“Data scientist is like a storyteller, uses data instead of words to get the point across”, says Rahul Kapil, Data Scientist, Snapdeal

In conversation with Rahul Kapil, Data Scientist, Snapdeal

Please tell us about your professional & academic background. Can you also brief us about your role in your current organization?

rahul kapilIt was always clear to me, even during my school days that my interest has always been towards Mathematics and Physics. That led me to join as Engineering Physics major at IIT Delhi. Obviously the curriculum comprised of a lot of Physics, but it also featured quite a bit of advanced Mathematical concepts, which are generally required to handle the theoretical physics domains like Quantum Mechanics, General theory of relativity and Thermal and Statistical Physics. Additionally, I completed few courses on Computational Physics, Machine Learning and Intelligent Systems that gave me some foundation into world of numerical simulations and Self Learning Algorithms.

I was able to apply those techniques in my first internship as Numerical Scientist in Xlim (France), when I was simulating the optical experiments using mathematical models. This skill of creating complex mathematical algorithms and using them to simulate expensive and cumbersome physical experiments enhanced when I completed another 4 month internship with a Material Science Innovation Firm – BIAS in Germany.

I was always fascinated by the power of predictive analytics, and different kind of products that can be powered with the same; for that I created a full-fledged Virtual Doctor app that tries to learn and diagnose the patients based on their medical history and symptoms. Little did I know at the time, that it is a problem even juggernauts like IBM are trying to solve with all their multi-billion dollar big data capabilities! I guess sometimes, you have to be ignorant of the boundaries to challenge what’s possible.

I made my foray into non-research based industry in summer of 2011, by solving forecasting and operational analytics problems for Fortune 500 clients with Parametric Technologies. But, I always had a zest for consumer facing business. That led me to join the fast growing E-commerce industry -Ibibo group, as a Data Scientist, where I was immediately thrown at deep end to try to solve complex Data Science problems for their travel and commerce sites and developing in-house analytics capabilities which were deeply lacking at the time. After an exciting year with Ibibo, I joined in a similar role for another leading E-commerce player in India – Snapdeal.com, where I am leading the Payments Data Science team.

Do you think the hype around data science is warranted? If so, why?

Tech industry always seems to get hold of certain buzz words, blowing them out of proportions with hype and hysteria. I certainly don’t think that is the case with Data Science. I say that because Data Science is not a new concept. It is only that, lately it has been getting more organized as a field of study.

Data Scientists have been working under different titles since ages in Finance, Operations and Academic Research, they were probably too busy to call themselves something fancy as ‘Data Scientist’.

I believe anything that draws methodological inference from set of empirical observations is essentially Data Science. Almost every well run business or organization measures some data about themselves, however, at the end of the day, if they are not making any use of the same to predict, analyze and fine tune their business goals, they definitely are at a competitive disadvantage, OR in some cases- they are leaving some easy money on the table.

Hence, companies from Social Networks to E-commerce marketplace, from philanthropic organizations to big banks, from Email Marketers to Mobile service providers. Everyone is seeking a Data or Decision scientist in some shape or form.

How did you get interested in working with data? What was the first data set you remember working with? What did you do with it?

As I said earlier as well, I was always interested in Mathematics and Statistics from school days, and the kind of magic (seemingly) you can create by arranging numbers in rows and columns. Funnily enough, the first set of real data I worked with was in 2003, when I was gripped with Cricket World Cup fever. I did a lot of manual data entries in Excel, painfully putting the facts and figures about all participating players. Everything from Matches, Strike rates, Averages, Form guide, opponent stats etc. from sources like Cricinfo and Wikipedia. As far as I remember, it was some 500 rows of custom player data. I used it to predict the result of any given match based on past performances in given conditions, previous world cup record, performance against opponents and current form. As the tournament progressed the predictions got better.

My model did predict an Australian win over India in the final, we all know how that turned out!

How does your company use data science?

The Payments team at Snapdeal is using data science extensively. We are constantly improving our recommendation engine, to provide more relevant products and related items to things users are already clicking on. We have to deal with problem of fake addresses and potential fraudulent transactions, for that we do lot of learning from legacy data, and have implemented custom algorithms that trigger our verification systems, in case we have a potential alert.

I have done lot of interesting stuff with improving the search relevancy and giving new SKUs (Store Keeping Unit) more chance to shine in the catalog. Additionally, we are also tackling a lot of marketing related problems like personalized emails and user behavior tracking.Recommendations-for-you-snapdealPrersonalised-e-mails- snapdeal

What are your favorite tools / applications to work with?

My favorite application are R and Octave. As I am all for rapid prototyping, testing and visualizations. For visualizations and charts I use ggplot2 and plyr packages.

I use a lot of Python as well with numpy, scipy, pandas, scikit add ons. It is also good to use some high level BI tools like Tableau, Jasper and Pentaho for data consolidation and creating easy to understand dashboards- these are especially useful to communicate your analysis to C-level decision makers in your organizations.

What personal/professional projects have you been working on this year, and why/how are they interesting to you?

Apart from the things I already mentioned above about recommendation systems, I have been working on user personalization and fraud detection. Personally, I am always trying different things possible in this domain.

I am quite fond of Twitter as a social network because of its capabilities of aggregating social behavior data, and also because of its amazing public API. I used it extensively over the summer to do semantic analysis during the general elections. It was a great exercise to find out the swings in public perception of various political leaders during various stages of campaigns and what kind of issues are priority of general public in different geographies.

Last year, I developed a Twitter application which was able to anticipate what kind of trends are going to be popular during the day before they started trending. I was able to use this information to generate some witty tweets that would be likely to be discovered and subsequently, retweeted. Needless to say, I gained quite a few followers from this exercise!rahul kapil twitter campaign

Can you name some publications, websites, blogs, conferences and/or books that you read/attend and have been helpful to your work?

Plenty of good people are giving a lot back to the community in the form of blogs and knowledge base. I have bookmarked some of the good ones:

I would strongly suggest the following books for someone looking to become a Data Scientist:

  • Introduction to Probability and Statistics – Sheldon Ross
  • The Elements of Statistical Learning – Data Mining, Inference, and Prediction
  • How to solve it- G Polya (Generic book for any problem solver)
  • Doing Data Science: An insightful book, based on Columbia University’s Introduction to Data Science class

Any words of wisdom for Data Science students or practitioners who are just starting out?

Data Science is a convergence of multiple technical domains. As per Hilary Mason’s awesome definition – Data Scientists are usually a mixture of Applied Mathematics, Computer Science, Engineering and Hacking.

Very few data scientists have same job requirements. It is like horses for courses. So, it becomes important for any aspiring professional in this field to focus on inputs rather than outputs. As in, you should focus on enhancing your skills in one of R/Python/Octave, broadening your mathematical base and strengthen your concepts of probability and statistics. Good places for starting with those can be UC Berkeley – Intro to data science course, similarly titled course on Coursera, and Stanford’s Data mining and Analysis course.

Another important point (which is slightly unpopular with engineering people) is that Data Scientist is as much a business person as a technical person. They should be able to see holistic picture with respect to product, engineering and business goals. Should also be excellent in their communications with different stakeholders.

In my experience, being a data scientist is equivalent to being a storyteller, in which, instead of words, you are using data to get your point across.