Today, we are open-sourcing our gradient boosting library CatBoost. It is well-suited for training machine learning models on tasks where data is heterogeneous, i.e., is described by a variety of inputs, such as contents, historical statistics and outputs of other machine learning models. The new gradient-boosting algorithm is now available on GitHub under Apache License 2.0.
Developed by Yandex data scientists and engineers, it is the successor of the MatrixNet algorithm that is used within the company for a wide range of tasks, ranging from ranking search results and advertisements to weather forecasting, fraud detection, and recommendations. In contrast to MatrixNet, which uses only numeric data, CatBoost can work with non-numeric information, such as cloud types or state/province. It can use this information directly, without requiring conversion of categorical features into numbers, which may yield better results compared with other gradient-boosting algorithms and also saves time. The range of CatBoost applications includes a variety of spheres and industries, from banking and weather forecasting, to recommendation systems and steel manufacturing.
CatBoost supports Linux, Windows and macOS and can also be operated from a command line or via a user-friendly API for Python or R. In addition to open-sourcing our gradient-boosting algorithm, we are releasing our visualization tool CatBoost Viewer, which enables monitoring training processes in iPython Notebook or in a standalone mode. We are also equipping all CatBoost users with a tool for comparing results of popular gradient-boosting algorithms.
“Yandex has a long history in machine learning. We have the best experts in the field. By open-sourcing CatBoost, we are hoping that our contribution into machine learning will be appreciated by the expert community, who will help us to advance its further development,” says Name Surname, position at Yandex.
CatBoost has already been successfully tested in a variety of applications across a whole range of Yandex services, including weather forecasting for the Meteum technology, content ranking for the personal recommendations service Yandex Zen, and improving search results. Eventually, this algorithm will be rolled out to benefit the majority of Yandex services. Outside of Yandex, CatBoost is already being used by data scientists at the European Organization for Nuclear Research (CERN) to improve data processing performance in their Large Hadron Collider beauty experiment.