Young Achievers – Rohan Anand, Data Scientist, Delhivery
Young Achievers – In conversation with Rohan Anand, Data Scientist, Delhivery
Please tell us about your professional & academic background. Can you also brief us about your role in your current organization?
I graduated with an Integrated 5 years M.Tech in Mathematics and Computing from IIT Delhi in 2013. It’s a multi-disciplinary course covering theoretical & applied mathematics and computer science. Post that I joined Citibank out of campus. However, I soon joined Delhivery as a Junior Data Scientist because it was much closer to my academic background and also they were just setting up the division which offered a lot of learning opportunity. My role in Delhivery involves using machine learning techniques, operations research and heuristic algorithms to solve business and operation problems.
Do you think the hype around data science is warranted? If so, why?
I think the hype is majorly because a lot of companies have benefited quite a lot from it so every organization which captures a lot of data thinks of consuming that data in a beneficial way.
One can’t follow a hammer and nail approach with Data Science. Making operations smarter, recommendations meaningful, ads contextual and customer targeting rewarding is something that every organization does. Some do it manually, some use a software suite and some crunch the data by writing raw code but the solutions make a huge impact.
If these things are done by some piece of code which reduces the excel work, the A/B tests, the field experiments, many people who have a traditional analytics background do not acknowledge this and write it off as a hype. It may sound like magic but the magic here is quite transparent.
How did you get interested in working with data? What was the first data set you remember working with? What did you do with it?
I did three courses in IIT Delhi related to Data Science namely Soft Computing, Pattern Recognition and Data Mining so that generated the academic interest because even the synthetic data you use in these courses is derived from some real world problem.
The first assignment I did was using Self Organizing Fuzzy Neural Networks on the (most abused) Iris Data Set. My major thesis was also on “Multi Sentence Summarization of Twitter News Data”, which helped me grasp Natural Language Processing Concepts. I came up with a novel algorithm that gave out a ranked list of news which maximizes information coverage, based on the number of news item a person can read or the time they can spare.
How does Delhivery use data science?
We use it in a lot of transforming ways. We are changing how we deliver our packages by machine translation of addresses. This makes deliveries faster and the network more efficient. It also makes expansion an easier process. We have alarm systems to highlight high stress centers because of abrupt load changes.
We are making the package lifecycle more transparent for both the clients and customers by path optimization in the entire network so that customers can exactly know when their packages will get delivered and clients can also answer similar queries.
We use custom algorithms to come up with new facilities or relocate existing facilities. We have done a lot of actual physical network planning sitting in a war room. This has been achieved within less than one year of the setup of the division. There are a lot of other exciting projects in the pipeline but nothing can be divulged at the moment.
What personal/professional projects have you been working on this year, and why/how are they interesting to you?
We are a very small and niche team right now so whatever projects I have mentioned above are something I have worked on. Interesting to me means something that is intellectually stimulating and makes a big impact. Address Translation is something I have worked on and we already have a couple of other services built on top of it. It was a project where we started from a business problem and converted it to something that beats Google services at many instances. It was a very intense project and I had to get a good grasp on Graphical Models which was something I only had academic knowledge of; moreover, using such new techniques in practical problems keeps up the interest in the work I do.
What are your favorite tools / applications to work with?
We mainly code in Python. We use C and R when things need to be done faster or differently. We don’t use any proprietary applications right now because the problems that we work on are quite complex and the people in the team are not code shy so a good GUI which most applications offer doesn’t make us any more efficient; but on the contrary, the learning curve is a cost in itself. ipython combined with pandas, numpy, scipy, matplotlib and sklearn helps me to get into a very speedy experimentation mode
Can you name some publications, websites, blogs, conferences and/or books that you read/attend and have been helpful to your work?
I have joined some Machine Learning and Natural Language Processing groups on LinkedIn. I use Google Alerts for getting new information on select topics (I’m sad that Google Reader is no more). Mashable and TechCrunch are other favorites as you get to know how some leading companies are using machine learning.
In terms of books, An Introduction to Statistical Learning, Introduction to Algorithms (CLRS), The elements of Statistical Learning have been helpful to my work. However, I wish to finish them one day. We end up reading a lot of research literature for solving problems which remains the major source of knowledge at work. My interest in using genetic algorithms was developed because of coming across it in a lot of research and we have also used it in one of our operation problems.
Any words of wisdom for Data Science students or practitioners who are just starting out?
- Get a good hold of theory. For example : one should know what parameters a SVM optimizes and what do they mean mathematically, one should be able to derive the formula of the highest probability sequence in a Hidden Markov Model and so on. They are very relevant because only when you know the theory, you’ll be able to clearly understand the limitations and assumptions of each model which will help you get rid of the hammer approach.
- Also be inquisitive about anything you read in terms of relating it to practical issues because these challenges get replicated when you apply it for real. For example – you should know why you could use a generative classifier over a discriminative one. Most people know what overfitting the data is but only a few know how to overcome it in a systematic manner. What are the ways of feature selection? How is feature selection different from dimensionality reduction? Are they? and many other questions along same lines.
I talk to about 2-3 people every week for such a role and I have realized that people have used a third party/open source code library, fed it some input and got results which they did not bother to ponder upon to understand why the approach worked in the first place. Libraries are there to decrease code effort but they don’t discount for not knowing what the algorithm does.
I also come across a lot of people who have started using MOOCs like Coursera, Stanford etc. to do courses but one needs to give it genuine attention. These courses are immensely useful but the certification doesn’t mean anything more than an interest. You still need to work hard in them to learn something meaningful from them. Having knowledge about ‘k means clustering’ and basic linear regression doesn’t extend to having Machine Learning knowledge. They are hardly the surface of the huge ocean. If you have time at hand, participate in Kaggle and download random data sets from UCI and try them out. It will help a lot!!
Interested in this field, check out some exciting Data Scientist jobs @ Delhivery –
Delhivery – Data Developer – Python/R (0-3 yrs)
Delhivery – Data Scientist – Machine Learning (4-10 yrs)
You can also reach out to them @ jobs@delhivery.com!
No Responses