CatBoost is an open-source gradient boosting library 
with categorical features support

Get started

New ways to explore your data

Apr 20, 2018
It’s time to release CatBoost v0.8. The aim of this release - efficient tools for data and model exploration.

First of all, catboost now calculates per object feature importances using SHAP values algorithm from the ‘Consistent feature attribution for tree ensembles’ paper. As you can see on the picture below it's very easy to understand what is the influence of each feature on a given object. See tutorial for more details. SHAP values Secondly, CatBoost now has a new algorithm for finding most influential training samples for a given object. This mode calculates the effect of objects from the train dataset on the optimized metric values for the objects from the input dataset:
- Positive values reflect that the optimized metric increases.
- Negative values reflect that the optimized metric decreases.
The higher the deviation from 0, the bigger the impact that an object has on the optimized metric. The method is an implementation of the approach described in the 'Finding Influential Training Samples for Gradient Boosted Decision Trees' paper. See get_object_importance model method in Python package and ostr mode in cli-version. Tutorial for Python is also available.

Third cool staff in 0.8 release is ’save model as code’ feature. For now you could save model as Python code with categorical features and as C++ code without categorical features (сategorical features support for C++ is coming soon). Use --model-format CPP,Python in cli-version and model.save_model(OUTPUT_PYTHON_MODEL_PATH, format="python") in Python.

To find out more details check out release notices on GitHub.

As usual we are eager to see your feedback and contribution.

CatBoost on GPU talk at GTC 2018

Mar 26, 2018
Vasily Ershov CatBoost lead developer, will talk on GTC 2018 about fastest gradient boosting implementation on GPU.

He'll provide a brief overview of problems which could be solved with CatBoost, discuss challenges and key optimizations in the most significant computation blocks and describe how one can efficiently build histograms in shared memory to construct decision trees and how to avoid atomic operation during this step. Also he'll provide benchmarks that shows that our GPU implementation is five to 40 times faster compared to CPU. And finally he'll talk about performance comparison against GPU implementations of gradient boosting in other open-source libraries.

GPU comparison
Picture contains learning speed on GPU comparison between CatBoost, XGBoost and LightGBM on epsilon dataset. Notice XGBoost GPU implementation doesn’t support V100 cards.

Talk scheduled for March 27, 1:00 PM - Room 231. Don’t miss the chance to listen about best-in-class gradient boosting implementation on GPU and ask any question.

Best in class inference and a ton of speedups

Jan 31, 2018
CatBoost version 0.6 has a lot speedups and improvements. Most valuable improvement at the moment is the release of industry fastest inference implementation.

FastInference

Fast inference

CatBoost uses oblivious trees as base predictors. In oblivious trees each leaf index can be encoded as a binary vector with length equal to the depth of the tree. This fact is widely used in CatBoost model evaluator: we first binarize all used float features, statistics and one-hot encoded features and then use binary features to calculate model predictions. That vectors can be built in a data parallel manner with SSE intrinsics. This results in a much faster applier than all existing ones as shown in our comparison below.

CatBoost applier vs LightGBM vs XGBoost

We used LightGBM, XGBoost and CatBoost models for Epsilon (400K samples, 2000 features) dataset trained as described in our previous benchmarks. For each model we limit number of trees used for evaluation to 8000 to make results comparable for the reasons described above. Thus this comparison gives only some insights of how fast the models can be applied. For each algorithm we loaded test dataset in Python, converted it to the algorithm internal representation and measured wall-time of model predictions on Intel Xeon E5-2660 CPU with 128GB RAM. The results are presented in the table below.

1 thread32 thread
XGBoost71 sec (x39)4,5 sec (x31)
LightGBM88 sec (x48)17,1 sec(x118)
CatBoost1,83 sec0,145 sec

From this we can see that on similar sizes of ensembles CatBoost can be applied about 35 and 83 times faster than XGBoost and LightGBM respectively.

Speedups

CatBoost team spent a lot of effort to speedup different parts of library. For now the list is below:
  • 43% speedup for training on large datasets.
  • 15% speedup for QueryRMSE and calculation of querywise metrics.
  • Large speedups when using binary categorical features.
  • Significant (x200 on 5k trees and 50k lines dataset) speedup for plot and stage predict calculations in cmdline.
  • Compilation time speedup.
Please take notice, we added many synonyms to our parameter names, now it is more convenient to try CatBoost if you are used to some other library.

Other improvements, bug fixes as well as builds you could find in release on GitHub.

Feel free to drop us issue or contribute to the project.

Extremely fast learning on GPU has arrived!

Nov 2, 2017

We're excited to announce the CatBoost 0.3 release with GPU support. It's incredibly fast!

  • Training on a single GPU outperforms CPU training up to 40x times on large datasets.
  • CatBoost efficiently supports multi-card per unit configuration. A configuration with a single server with 8 GPUs outperforms a configuration with hundreds of CPUs by execution time.
  • We compared our GPU implementation with competitors. And it is 2 times faster then LightGBM and more then 20 times faster then XGBoost.
  • Finally CatBoost GPU Python wrapper is very easy to use.

To prove our words we prepared a set of benchmarks below. There we compared:

  • CPU vs GPU training speed of CatBoost
  • GPU training performance of CatBoost, XGBoost and LightGBM

CatBoost CPU vs. GPU

Configuration: dual-socket server with 2 Intel Xeon CPU (E5-2650v2, 2.60GHz) and 256GB RAM

Methodology: CatBoost was started in 32 threads (equal to number of logical cores). GPU implementation was run on several servers with different GPU types. Our GPU implementation doesn’t require multi-core server for high performance, so different CPU and machines should not significantly affect GPU benchmark results.

Dataset #1: Criteo (36M samples, 26 categorical, 13 numerical features) to benchmark our categorical features support.

Type128 bins (sec)
CPU1060
K40373
GTX 1080Ti (11GB)      301
2xGTX 1080 (8GB)285
P40123
P100-PCI82
V100-PCI69.8

As you can see CatBoost GPU version significantly outperforms CPU training time even on old generation GPU (Tesla K40) and gains impressive x15 speed up on flagship NVIDIA V100 card.

Dataset #2: Epsilon (400K samples, 2000 features) to benchmark our performance on dense numerical dataset. In the table we report training time for different levels of binarizations: default 128 bins and 32 bins which is often sufficient. Note that Epsilon dataset has not enough samples to fully utilize GPU, and with bigger datasets we observe up to 40x speed ups.

Type128 bins (sec)32 bins (sec)
CPU713653
K40547248
GTX 1080 (8GB)194120
P4016291
GTX 1080Ti (11GB)14588
P100-PCI12770
V100-PCI7749

GPU training performance: comparison with baselines

Configuration: NVIDIA P100 accelerator, dual-core Intel Xeon E5-2660 CPU and 128GB RAM

Dataset: Epsilon (400K samples for train, 100K samples for test).

Libraries: CatBoost, LightGBM, XGBoost (we use histogram-based version, exact version is very slow)

Methodology: We measured mean tree construction time one can achieve without using feature subsampling and/or bagging. For XGBoost and CatBoost we use default tree depth equal to 6, for LightGBM we set leafs count to 64 to have more comparable results. We set bin to 15 for all 3 methods. Such bin count gives the best performance and the lowest memory usage for LightGBM and CatBoost (128-255 bin count usually leads both algorithms to run 2-4 times slower). For XGBoost we could use even smaller bin count but performance gains compared to 15 bins are too small to account for. All algorithms were run with 16 threads, which is equal to hardware core count.

By default CatBoost uses bias-fighting scheme . This scheme is by design 2-3 times slower then classical boosting approach. CatBoost GPU implementation contains a mode based on classic scheme for those who need best training performance. We used classic scheme mode in our benchmark.

AUCvsNumber

Figure 1. AUC vs Number of trees

AUCvsTime

Figure 2. AUC vs Time

We set such learning rate that algorithms start to overfit approximately after 8000 rounds (learning curves are displayed at figure above, quality of obtained models differs by approximately 0.5%). We measured time to train ensembles of 8000 trees. Mean tree construction time for CatBoost was 17.9ms, for XGBoost 488ms, for LightGBM 40ms. As you can see CatBoost 2 times faster then LightGBM and 20 times faster then XGBoost.

Don’t forget to examine CatBoost GPU documentation. As usual, you could find all the code on GitHub.

Any contribution and issues are appreciated!

Version 0.2 released

Sep 14, 2017

In last few weeks, the CatBoost team have implemented a bunch of improvements.

  • Training speedups: we have speed up the training by 20-30%.
  • Accuracy improvement with categoricals: we have changed computation of statistics for categorical features, which leads to better quality.
  • New type of overfitting detector: Iter. This type of detector was requested by our users. So now you can also stop training by a simple criterion: if after a fixed number of iterations there is no improvement of your evaluation function.
  • TensorBoard support: this is another way of looking on the graphs of different error functions both during training and after training has finished. To look at the metrics you need to provide train_dir when training your model and then run "tensorboard --logdir={train_dir}".

TensorBoard

  • Jupyter notebook improvements: for our Python library users that experiment with Jupyter notebooks, we have improved our visualisation tool. Now it is possible to save image of the graph. We also have changed scrolling behaviour so that it is more convenient to scroll the notebook.
  • NaN features support: we also have added simple but effective way of dealing with NaN features. If you have some NaNs in the train set, they will be changed to a value that is less than the minimum value or greater than the maximum value in the dataset (this is configurable), so that it is guaranteed that they are in their own bin, and a split would separates NaN values from all other values. By default, no NaNs are allowed, so you need to use option nan_mode for that. When applying a model, NaNs will be treated in the same way for the features where NaN values were seen in train. It is not allowed to have NaN values in test if no NaNs in train for this feature were provided.
  • Snapshotting: we have added snapshotting to our Python and R libraries. So if you think that something can happen with your training, for example machine can reboot, you can use snapshot_file parameter - this way after you restart your training it will start from the last completed iteration.
  • R library changes: we have changed an R library interface and added tutorial.
  • Logging customization: we have added allow_writing_files parameter. By default some files with logging and diagnostics are written on disc, but you can turn it off using by setting this flag to False.
  • Multiclass mode improvements: we have added a new objective for multiclass mode - MultiClassOneVsAll. We also added class_names param - now you don't have to renumber your classes to be able to use multiclass. And we have added two new metrics for multiclass: TotalF1 and MCC metrics. You can use the metrics to look how its values are changing during training or to use overfitting detection or cutting the model by best value of a given metric.
  • Cross-validation parameters changes: we changed overfitting detector parameters of CV in python so that it is same as those in training.
  • CTR types: MeanValue => BinarizedTargetMeanValue.
  • Any delimeters support: in addition to datasets in tsv format, CatBoost now supports files with any delimeters.
  • New model format: CatBoost v0.2 model binary not compatible with previous versions.

We also have improved stability of the library.

Feel free to write us with issues on GitHub and contribute to the project!

CatBoost at ICML 2017

July 20, 2017

We will be happy to meet everyone at ICML in Sydney, Australia on August 6-11, 2017, where we will be showcasing CatBoost in the Yandex booth #16.

Our team will be there to showcase the usage and applications of our new gradient-boosting machine learning library. We’ll be happy to demonstrate training CatBoost on a variety of datasets, and go through the tricks CatBoost uses to work well on categorical features. You will learn how to access the CatBoost library from the command line, or via API for Python, sklearn, R or caret, and how to monitor training in iPython Notebook using our visualization tool CatBoost Viewer. We will also let you in on the secret of how to score well in a Kaggle contest!

ICML

We look forward to meeting you at our ICML stand in Sydney. Please drop by – we’ll even have some goodies to share!

Large Hadron Collider particle identification

July 18, 2017

The Large Hadron Collider beauty (LHCb) experiment is one of the four major experiments running at the Large Hadron Collider (LHC), the world’s largest and most powerful particle accelerator, operating at the European Organization for Nuclear Research (CERN). In order to perform high-level physics measurements, scientists need to analyse data from particle collisions recorded at a rate of 40 million times per second.

These data represent “snapshots” of all the particles generated by collisions of LHC protons and flying through the volume of particle detectors placed around the proton-proton interaction region. In order to understand the entire picture of the underlying physics laws ruling the processes taking place in the collisions, it is extremely important to reconstruct the identity of each particle whose passage is recorded by the detectors. This is the main role of particle identification (PID) algorithms.

Collider

Fast, reliable and accurate PID algorithms are crucial to selecting interesting data. In almost all 400 or so papers published by the LHCb collaboration, it is evident that these aspects of PID algorithms play a crucial role in important discoveries.

To combine the information from the various subcomponents of the LHCb detector in an effort to achieve a more efficient PID performance, a team from the Yandex School of Data Analysis proposed the use of the new algorithm CatBoost. CatBoost is well suited to improve the accuracy of PID response because it works with different features types (including binary observables) and formats with state-of-the-art precision. The algorithm ideally meets LHCb requirements, working as a seamless complement with all inputs.

The algorithm was trained using simulated collisions resembling those taking place at the LHCb proton-proton interaction point. The algorithm uses about 60 input features describing the geometrical position of the detected particles, the aggregated detector response and the kinematic properties of the detected tracks.

After its implementation and deployment into LHCb codebase and event processing pipeline in June 2017, CatBoost’s best-in-class performance proved to improve accuracy without compromising efficiency. Initial tests show encouraging improvements in the identification of charged particles starting from the information that they release in the LHCb detector, with respect to other machine learning approaches available on the market. Ultimately, this new approach will lead to cleaner data to all particle physics experiments, making physicists’ work more efficient.

After seeing these initial positive results, the LHCb team is planning further experimentation with CatBoost in other LHCb projects.

CatBoost Now Available in Open Source

July 18, 2017

Today, we are open-sourcing our gradient boosting library CatBoost. It is well-suited for training machine learning models on tasks where data is heterogeneous, i.e., is described by a variety of inputs, such as contents, historical statistics and outputs of other machine learning models. The new gradient-boosting algorithm is now available on GitHub under Apache License 2.0.

Developed by Yandex data scientists and engineers, it is the successor of the MatrixNet algorithm that is used within the company for a wide range of tasks, ranging from ranking search results and advertisements to weather forecasting, fraud detection, and recommendations. In contrast to MatrixNet, which uses only numeric data, CatBoost can work with non-numeric information, such as cloud types or state/province. It can use this information directly, without requiring conversion of categorical features into numbers, which may yield better results compared with other gradient-boosting algorithms and also saves time. The range of CatBoost applications includes a variety of spheres and industries, from banking and weather forecasting, to recommendation systems and steel manufacturing.

CatBoost supports Linux, Windows and macOS and can also be operated from a command line or via a user-friendly API for Python or R. In addition to open-sourcing our gradient-boosting algorithm, we are releasing our visualization tool CatBoost Viewer, which enables monitoring training processes in iPython Notebook or in a standalone mode. We are also equipping all CatBoost users with a tool for comparing results of popular gradient-boosting algorithms.

“Yandex has a long history in machine learning. We have the best experts in the field. By open-sourcing CatBoost, we are hoping that our contribution into machine learning will be appreciated by the expert community, who will help us to advance its further development,” says Misha Bilenko, Head of Machine Intelligence and Research at Yandex.

CatBoost has already been successfully tested in a variety of applications across a whole range of Yandex services, including weather forecasting for the Meteum technology, content ranking for the personal recommendations service Yandex Zen, and improving search results. Eventually, this algorithm will be rolled out to benefit the majority of Yandex services. Outside of Yandex, CatBoost is already being used by data scientists at the European Organization for Nuclear Research (CERN) to improve data processing performance in their Large Hadron Collider beauty experiment.