Wednesday, September 5, 2012

Scikit-learn 0.12 released

Last night I uploaded the new version 0.12 of scikit-learn to pypi. Also the updated website is up and running and development now starts towards 0.13.

The new release has some nifty new features (see whatsnew):
* Multidimensional scaling
* Multi-Output random forests (like these)
* Multi-task Lasso
* More loss functions for ensemble methods and SGD
* Better text feature extraction

Eventhough, the majority of changes in this release are somewhat "under the hood".
Vlad  developed and set up a continuous performance benchmark for the main algorithms during his google summer of code. I am sure this will help improve performance.
There already has been a lot of work in improving performance, by Vlad, Immanuel, Gilles and others for this release.

Another improvement was the installation of a set of common tests, that are applied to all our estimators. This led to some improvements  in stability, but arguably more importantly in a more consistent interface, more robust input validation (check that input has as many features in test as in training, check that you have the same number of labels as data points etc) and better error messages.
Work in this direction is not over but I think much progress has been made. And while this is no shiny new algorithm, I think that error messages of the form

"A sparse matrix was passed, but dense data is required. Use X.todense() to convert to dense."
will help users a lot more than some "invalid index" error deep in the code (thanks Gael) .

Even more behind the scenes, to make this possible, the API of scikit-learn objects is now a bit more well defined and stricter.
The number of mixin classes, from which algorithms derive, has been extended to:

* ClusterMixin
* TransformerMixin
* ClassifierMixin
* RegressorMixin
* MetaEstimatorMixin (i.e. RFE, GridSearchCV. needs another estimator to be instantiated)

These now give a very good handle on how estimators behave and how they should be used (for example clustering algorithms all implement a "fit" and "fit_predict" but not necessarily a "predict" etc).
 I think we are not far from a very unified interface with intuitive behavior and parameter names.


Not completely related to the release, but worth noticing:
During the last couple of weeks I had the feeling that there are more and more users that are contributing and several improvements in the release are due to some first time contributors.

Also, people from other packages have been reaching out to join forces.
I talked to Ryan Curtin from mlpack and someone from shogun joined us for celebrating the release on IRC :) Even with different foci, I hope we can all collaborate a bit more in the future for even better software.
Btw, shogun also released yesterday, at the same time as we did. Congratulations!
The also have a pretty new website be sure to check it out.

That's all, enjoy the release and give us a shout if you have any trouble.

No comments:

Post a Comment