CatBoost is an open-source gradient boosting library 
with categorical features support

Get started

Version 0.2 released

Sep 14, 2017

In last few weeks, the CatBoost team have implemented a bunch of improvements.

  • Training speedups: we have speed up the training by 20-30%.
  • Accuracy improvement with categoricals: we have changed computation of statistics for categorical features, which leads to better quality.
  • New type of overfitting detector: Iter. This type of detector was requested by our users. So now you can also stop training by a simple criterion: if after a fixed number of iterations there is no improvement of your evaluation function.
  • TensorBoard support: this is another way of looking on the graphs of different error functions both during training and after training has finished. To look at the metrics you need to provide train_dir when training your model and then run "tensorboard --logdir={train_dir}".

TensorBoard

  • Jupyter notebook improvements: for our Python library users that experiment with Jupyter notebooks, we have improved our visualisation tool. Now it is possible to save image of the graph. We also have changed scrolling behaviour so that it is more convenient to scroll the notebook.
  • NaN features support: we also have added simple but effective way of dealing with NaN features. If you have some NaNs in the train set, they will be changed to a value that is less than the minimum value or greater than the maximum value in the dataset (this is configurable), so that it is guaranteed that they are in their own bin, and a split would separates NaN values from all other values. By default, no NaNs are allowed, so you need to use option nan_mode for that. When applying a model, NaNs will be treated in the same way for the features where NaN values were seen in train. It is not allowed to have NaN values in test if no NaNs in train for this feature were provided.
  • Snapshotting: we have added snapshotting to our Python and R libraries. So if you think that something can happen with your training, for example machine can reboot, you can use snapshot_file parameter - this way after you restart your training it will start from the last completed iteration.
  • R library changes: we have changed an R library interface and added tutorial.
  • Logging customization: we have added allow_writing_files parameter. By default some files with logging and diagnostics are written on disc, but you can turn it off using by setting this flag to False.
  • Multiclass mode improvements: we have added a new objective for multiclass mode - MultiClassOneVsAll. We also added class_names param - now you don't have to renumber your classes to be able to use multiclass. And we have added two new metrics for multiclass: TotalF1 and MCC metrics. You can use the metrics to look how its values are changing during training or to use overfitting detection or cutting the model by best value of a given metric.
  • Cross-validation parameters changes: we changed overfitting detector parameters of CV in python so that it is same as those in training.
  • CTR types: MeanValue => BinarizedTargetMeanValue.
  • Any delimeters support: in addition to datasets in tsv format, CatBoost now supports files with any delimeters.
  • New model format: CatBoost v0.2 model binary not compatible with previous versions.

We also have improved stability of the library.

Feel free to write us with issues on GitHub and contribute to the project!

CatBoost at ICML 2017

July 20, 2017

We will be happy to meet everyone at ICML in Sydney, Australia on August 6-11, 2017, where we will be showcasing CatBoost in the Yandex booth #16.

Our team will be there to showcase the usage and applications of our new gradient-boosting machine learning library. We’ll be happy to demonstrate training CatBoost on a variety of datasets, and go through the tricks CatBoost uses to work well on categorical features. You will learn how to access the CatBoost library from the command line, or via API for Python, sklearn, R or caret, and how to monitor training in iPython Notebook using our visualization tool CatBoost Viewer. We will also let you in on the secret of how to score well in a Kaggle contest!

ICML

We look forward to meeting you at our ICML stand in Sydney. Please drop by – we’ll even have some goodies to share!

Large Hadron Collider particle identification

July 18, 2017

The Large Hadron Collider beauty (LHCb) experiment is one of the four major experiments running at the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator, operating at the European Organization for Nuclear Research (CERN). In order to perform high-level physics measurements, scientists need to analyse data from particle collisions recorded at a rate of 40 million times per second.

These data represent “snapshots” of all the particles generated by collisions of LHC protons and flying through the volume of particle detectors placed around the proton-proton interaction region. In order to understand the entire picture of the underlying physics laws ruling the processes taking place in the collisions, it is extremely important to reconstruct the identity of each particle whose passage is recorded by the detectors. This is the main role of particle identification (PID) algorithms.

Collider

Fast, reliable and accurate PID algorithms are crucial to selecting interesting data. In almost all 400 or so papers published by the LHCb collaboration, it is evident that these aspects of PID algorithms play a crucial role in important discoveries.

To combine the information from the various subcomponents of the LHCb detector in an effort to achieve a more efficient PID performance, a team from the Yandex School of Data Analysis proposed the use of the new algorithm CatBoost. CatBoost is well suited to improve the accuracy of PID response because it works with different features types (including binary observables) and formats with state-of-the-art precision. The algorithm ideally meets LHCb requirements, working as a seamless complement with all inputs.

The algorithm was trained using simulated collisions resembling those taking place at the LHCb proton-proton interaction point. The algorithm uses about 60 input features describing the geometrical position of the detected particles, the aggregated detector response and the kinematic properties of the detected tracks.

After its implementation and deployment into LHCb codebase and event processing pipeline in June 2017, CatBoost’s best-in-class performance proved to improve accuracy without compromising efficiency. Initial tests show encouraging improvements in the identification of charged particles starting from the information that they release in the LHCb detector, with respect to other machine learning approaches available on the market. Ultimately, this new approach will lead to cleaner data to all particle physics experiments, making physicists’ work more efficient.

After seeing these initial positive results, the LHCb team is planning further experimentation with CatBoost in other LHCb projects.

CatBoost Now Available in Open Source

July 18, 2017

Today, we are open-sourcing our gradient boosting library CatBoost. It is well-suited for training machine learning models on tasks where data is heterogeneous, i.e., is described by a variety of inputs, such as contents, historical statistics and outputs of other machine learning models. The new gradient-boosting algorithm is now available on GitHub under Apache License 2.0.

Developed by Yandex data scientists and engineers, it is the successor of the MatrixNet algorithm that is used within the company for a wide range of tasks, ranging from ranking search results and advertisements to weather forecasting, fraud detection, and recommendations. In contrast to MatrixNet, which uses only numeric data, CatBoost can work with non-numeric information, such as cloud types or state/province. It can use this information directly, without requiring conversion of categorical features into numbers, which may yield better results compared with other gradient-boosting algorithms and also saves time. The range of CatBoost applications includes a variety of spheres and industries, from banking and weather forecasting, to recommendation systems and steel manufacturing.

CatBoost supports Linux, Windows and macOS and can also be operated from a command line or via a user-friendly API for Python or R. In addition to open-sourcing our gradient-boosting algorithm, we are releasing our visualization tool CatBoost Viewer, which enables monitoring training processes in iPython Notebook or in a standalone mode. We are also equipping all CatBoost users with a tool for comparing results of popular gradient-boosting algorithms.

“Yandex has a long history in machine learning. We have the best experts in the field. By open-sourcing CatBoost, we are hoping that our contribution into machine learning will be appreciated by the expert community, who will help us to advance its further development,” says Misha Bilenko, Head of Machine Intelligence and Research at Yandex.

CatBoost has already been successfully tested in a variety of applications across a whole range of Yandex services, including weather forecasting for the Meteum technology, content ranking for the personal recommendations service Yandex Zen, and improving search results. Eventually, this algorithm will be rolled out to benefit the majority of Yandex services. Outside of Yandex, CatBoost is already being used by data scientists at the European Organization for Nuclear Research (CERN) to improve data processing performance in their Large Hadron Collider beauty experiment.