Q&A: Yandex CatBoost, the open-sourced machine learning algorithm

On July 18th Yandex announced the launch of a state-of-the-art open-sourced machine learning algorithm called CatBoost that can be easily integrated with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. Unlike deep learning tools that support only certain types of data, CatBoost works with more diverse types of data to help solve a wide range of problems that businesses face today with best-in-class accuracy. It is especially powerful in two ways: it yields state-of-the-art results without extensive data training typically required by other machine learning methods, and it provides powerful out-of-the-box support for the more descriptive data formats that accompany many business problems. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within Yandex’s services for ranking tasks, weather forecasting, fraud detection and making recommendations. Yandex believes that it can be applied across a wide range of industrial machine learning tasks, in domains ranging from finance to scientific research. It is now available to the open source community and will be integrated across Yandex products and services in the coming months.

1. What is gradient boosting?

Gradient boosting is a machine learning algorithm that is widely applied to the kinds of problems businesses encounter every day like detecting fraud, predicting customer engagement and ranking recommended items like top web pages or most relevant ads. Gradient boosting is ideal for predictive models that analyse many different forms of data, including descriptive data formats with categorical features. In most applications, it is the most powerful “ultimate” model that integrates inputs from many different machine learning techniques, including those from deep learning models. It delivers highly accurate results even in situations where there is relatively little data, unlike deep learning frameworks that need to learn from a massive amount of data. Therefore, it is the most important method in a practitioner’s tool case, one that can be used to leverage a wide range of data formats and combine a variety of more specialised models.

2. Why has Yandex open sourced CatBoost?

Machine learning powers more than 70 per cent of Yandex products and services. Yandex, like many other tech companies, uses various forms of machine learning both homegrown and open source. We have benefited from the open source tools available to us and feel it’s our duty to now share our expertise in machine learning with the open-source community.

Given the fundamental importance and widespread use of gradient boosting, we wanted to contribute to a core need and create something that's easy for data scientists to integrate with other machine learning frameworks. Offering the community a great out-of-the-box gradient boosting tool is something we anticipate will be widely used and highly beneficial.

3. How do you hope CatBoost will impact the global tech community?

By making CatBoost available as an open-source library, we hope to enable data scientists and engineers to obtain highly accurate models with no effort, and ultimately define a new standard of excellence in machine learning.

As a global technology company we find it invaluable to contribute more broadly to the larger tech community. We hope to see CatBoost impact the tech community in a positive way across variety of sectors and countries, whether that it is for retail or insurance or any other commercial use. Yandex is privileged to have a wealth of developer talent in Russia, and it's important to us to share this expertise and help advance technology across the world.

4. What about the consumer-oriented part of Yandex - where does CatBoost fit there?

Machine learning and in particular gradient boosting, is used for a wide range of Yandex products such as web and image search, advertising, personalisation, weather forecasting, speech recognition, and fraud (SPAM). As the successor to MatrixNet, CatBoost will be implemented across all of these areas in the near future.

5. How does CatBoost compare to other competitors?

There are a number of open source gradient boosting tools available. CatBoost differentiates itself in three ways: its ability to be used out-of-the-box without extensive hyperparameter tuning, its accuracy, its ability to effectively leverage categorical features.

In order to illustrate its accuracy gains, CatBoost’s performance was compared to three competing libraries – LightGBM, XGBoost and H20, by running tests on standard benchmark datasets. The comparison table on the CatBoost website shows how CatBoost’s performance stacks up against the other libraries. The log-loss values reported on test data, representing the uncertainty behind a prediction given its variance from the target label, are lowest for CatBoost.

6. Can CatBoost integrate with deep learning frameworks like Google's TensorFlow and Apple's Core ML? If so, how does it work?

We want CatBoost to be easy to use. Integration is a key component to that goal. CatBoost can be integrated with any deep learning framework, including Google’s Tensorflow and Keras as demonstrated in the accompanying tutorials, where TensorFlow-trained models for text provide inputs to CatBoost.   Models trained by CatBoost can also be shipped on iOS devices via Apple’s Core ML framework.  Thus, apps can be built with CatBoost-trained models, bringing intelligent features directly to customers’ devices. Data scientists can train a CatBoost model using Python or R scripts, or the command-line interface, and then convert to Apple's Core ML format in order to ship on Apple devices.

7. How does Yandex fit into the machine learning ecosystem?

For 20 years now, Yandex has been pioneering innovation in machine learning and artificial intelligence to build intelligent products and services that help consumers and businesses better navigate the online and offline world. Machine learning has been a focus for us both academically and in our products.

Yandex has always been excited to share its work and research in machine learning ecosystem. We’re lucky enough to have access to some of the most talented engineers and data scientists in the world. And over the past 20 years, we’ve been contributing the machine learning community. Whether it is teaching and welcoming new students to the Yandex School of Data Analysis, sharing papers at conferences, or publishing our research, we feel it’s our duty to contribute to innovation in the field. We have just celebrated the 10-year anniversary of the Yandex School of Data Analysis, a huge milestone.

We were an early leader in machine learning. We have been developing machine learning and big data processing technologies for our search engine and other internet services since Yandex was established in 1997. We also developed our own proprietary machine learning method and the predecessor to CatBoost, MatrixNet, in 2009. The open-sourcing of CatBoost marks another huge milestone for us. We’re excited to have open-sourced a tool that can be applied widely across the machine learning ecosystem.

8. Where does Yandex see AI evolving?

In the past few years, we have seen artificial intelligence and machine learning move from an area of study to technology that is used over a number of applications that people – both end users and businesses – utilise every day. And while deep learning has been one of the most buzzed about methods in artificial intelligence and machine learning, we feel that gradient boosting is really one of the unsung heroes in this space.  While deep learning frameworks focus very narrowly on certain types of tasks, we believe that the future of AI will require tools that can integrate these frameworks across a wide range of use cases. Gradient boosting may not be as flashy as some of the other methods – it’s been around for years –  but it is the workhorse of the machine learning landscape. As this landscape evolves and the applications for artificial intelligence and machine learning increase, the need for industrial grade gradient boosting tools like CatBoost will become more important.

Misha Bilenko, Head of machine intelligence and research, Yandex
Image Credit: Shutterstock/Mopic