It’s a key question for many data scientists – especially those that are new to the field: is Python or R better for data science?
For those first venturing into the world of data science, it’s important to master one language first, rather than looking to be a Jack of all trades from the offset. This is because your processes and techniques are what really matter most, and mastering these in one language before branching out into learning more is what is going to get you a strong footing in the data science world. Once you have a strong set of skills and techniques under your belt, moving into other languages is a great way of skilling up and ensuring that you stay competitive in your field, but your first programming language should allow you to learn as much as you can. And there’s no shortage of languages that you can pick as your weapon of choice for doing so – when it comes to data science, there’s plenty on offer, including (but not limited to): Java, C, C++, Scala, Perl, Clojure, Julia, and more. However, Python and R are undoubtedly the forerunners for the majority of the data science world.
So which do you pick?
For years, R has been the obvious choice for those going into data science – it was designed with statisticians in mind, has a long history of success in the industry, has thousands of publicly released packages, and integrates well with programming languages such as C, C++, Java, etc. Released in 1997, R is common in a whole range of sectors – it’s used by leading commercial companies such as Google and Facebook, and can be found from Wall Street to Silicon Valley as a good alternative to software such as Matlab and SAS.
On the other hand, Python offers plenty of benefits which mean that an increasing number of people are adopting Python for their work. As one of the most popular mainstream programming languages on the market, it’s a practical choice for tech types of all kinds – data scientists included. In particular, Python is taking off in the financial sector – it’s now the Bank of America’s tool of choice for crunching financial data. It’s certain that Python is challenging R’s long-established position as the lingua franca for Data Scientists, but why? Here’s 5 reasons why you might choose Python for data science.
Python is easy to use
Python has got itself a reputation for being easy to learn. With its readable syntax, Python is great for beginners or for data scientists who want to build up their skillset. As data science encompasses a number of predictive modelling techniques for which you can use plenty of different data mining tools, applying these techniques using a new tool can prove difficult, and so you’ll want to use something which has a shorter learning curve. Python is great because its simplicity appeals to a range of different people. Whether you’re an experienced data scientist or analyst, a software engineer who’s starting to work more closely with machine learning, or even a complete beginner, Python is an easy programming language to pick up.
Python is versatile
As a general purpose programme language, Python is a quick and powerful tool which has plenty of capability. Whatever problem you want to solve, Python can help you do the job. From building web services, data mining, Python is a programming language that gives you the opportunity to solve data problems end-to-end.
Python is better for building analytics tools
R and Python are both pretty good if you want to find outliers in a dataset, but when it comes to creating a web service to allow others to find outliers in their datasets, Python is the way forward. At a time when self-service analytics is more and more important, this is really valuable.
Data visualisation with Python
Okay, so this is where R usually wins out against Python. It has an impressive range of visualisation such as ggplot2, rCharts, and googleVis. But although Python doesn’t naturally lend itself to visualisation in the same way as R, it does have a large range of powerful visualisation libraries available, such as Matplotlib , Plot.ly, or Seaborn.
The Python community is growing
Python has a huge community around it, including a strong and growing presence in the the data science community. PyPi (the Python Package Index) is a useful place to explore the full extent of what is being developed by the Python community. For example, NumPy, which was established in 2006, recently received a $645,000 grant to support its development as a core library for scientific computing in Python.
Python is better for deep learning
There are plenty of packages – such as Theano, Keras, and TensorFlow, - which make it really easy to create deep neural networks in Python, and while some of these packages are being ported to R, the support available in Python is far superior.
So, should you use Python for data science? Python is a powerful and versatile tool that allows you to do more in less time. R, meanwhile, is a specialised tool, designed specifically for data analysis. In a market where diversifying is increasingly becoming key to development, adding Python to your repertoire, whether it’s your first language of choice or your second, can only be a good thing – as one of the hottest tools in tech right now, not doing so could leave you in the dust. The great thing about Python is the fact that it’s versatile and easy to pick up, meaning that you can incorporate it into your own workflow, making it work alongside the tools you already use or might use later down the line – yes, including R.
Richard Gall, Communications Manager, Packt
Image Credit: Flickr / janneke staaks