Sunday, January 27, 2013

pystruct: more structured prediction with python

Some time ago I wrote about a structured learning project I have been working on for some time, called pystruct.
After not working on it for some time, I think it has come quite a long way the last couple of weeks as I picked up work on structured SVMs again. So here is a quick update on what you can do with it.

To the best of my knowledge this is the only tool with ready-to-use functionality to learn structural SVMs (or max-margin CRFs) on loopy graphs - even though this is pretty standard in the (computer vision) literature.

Friday, January 25, 2013

Machine Learning Cheat Sheet (for scikit-learn)

As you hopefully have heard, we at scikit-learn are doing a user survey (which is still open by the way).
One of the requests there was to provide some sort of flow chart on how to do machine learning.

As this is clearly impossible, I went to work straight away.

This is the result:



[edit2]
clarification: With ensemble classifiers and ensemble regressors I mean random forests, extremely randomized trees, gradient boosted trees, and the soon-to-be-come weight boosted trees (adaboost).
[/edit2]

Needless to say, this sheet is completely authoritative.
 

Monday, January 21, 2013

Scikit-Learn 0.13 released! We want your feedback.

After a little delay, the team finished work on the 0.13 release of scikit-learn.
There is also a user survey that we launched in parallel with the release, to get some feedback from our users.

There is a list of changes and new features on the website.
You can upgrade using easy-install or pip using:

pip install -U scikit-learn
or
easy_install -u scikit-learn


There were more than 60 people contributing to this release, with 24 people having 10 commits or more.

Again many improvements are behind the scenes or only slightly notable. We improved test coverage a lot and we have much more consistent parameter names now. There is now also a user guide entry for the classification metrics, and their naming was improved.

This was one of the many improvements Arnaud Joly, who joined the project very recently but nevertheless wound up being the one with the second most commits in this release!

Now let me get to some of the more visible highlights of this release from my perspective:

- Thanks to Lars and Olivier, the Hashing Trick finally made it into scikit-learn.
  This allows for very fast vectorization of large text corpora and stateless transformers for the same.

- Sample weights were added to the tree module thanks to Noel and Gilles. This enabled the implementation of a smarter resampling for random forests, which leads to a speed-up of random forests of up to a factor of two! Also, this is the basis of including AdaBoost with Trees in the next release.

- I added a method to use totally randomized trees for hashing / embedding features to a high-dimensional, sparse binary representation. It goes along the lines of my last blog post on using non-linear embeddings followed by simple linear classifiers.

- I also added Nystroem kernel approximations, which are really easy to do but should come in quite handy. They still need some more work, though. For details, see my post on kernel approximations.


Thanks to the team for working on this together. I am really happy with the way everybody joins forces, this is an amazing project!

Wednesday, December 26, 2012

Kernel Approximations for Efficient SVMs (and other feature extraction methods) [update]

Recently we added another method for kernel approximation, the Nyström method, to scikit-learn, which will be featured in the upcoming 0.13 release.
Kernel-approximations were my first somewhat bigger contribution to scikit-learn and I have been thinking about them for a while.
To dive into kernel approximations, first recall the kernel-trick.

Saturday, December 15, 2012

Another look at MNIST

I'm a bit obsessed with MNIST.
Mainly because I think it should not be used in any papers any more - it is weird for a lot of reasons.
When preparing the workshop we held yesterday I noticed one that I wasn't aware of yet: most of the 1-vs-1 subproblems, are really easy!

Basically all pairs of numbers can be separated perfectly using a linear classifier!
And even you you just do a PCA to two dimensions, they can pretty much still be linearly separated! It doesn't get much easier than that. This makes me even more sceptical about "feature learning" results on this dataset.

To illustrate my point, here are all pairwise PCA projections. The image is pretty huge. Otherwise you wouldn't be able to make out individual data points.
You can generate it using this very simple gist.



There are some classes that are not obviously separated: 3 vs 5, 4 vs 9, 5 vs 8 and 7 vs 9. But keep in mind, this is just a PCA to two dimensions. It doesn't mean that they couldn't be separated linarly in the original space.

Interestingly the "1"s are very easy to identify, even with seven and nine there is basically no way to confuse them. The ones have a somewhat peculiar shape, though. It would be fun to see what a tour along the "bow" (see img at [2, 2]) would look like.
Manifold-people should be delighted ;)

I think this plot emphasizes again: look at your data!
I hope you enjoyed this perspective.

Friday, December 14, 2012

Workshop on Python, Machine Learning and Scikit-Learn

Today there was a workshop at my uni, organized by my Professor Sven Behnke, together with my colleagues Hannes Schulz, Nenard Birešev and me.

The target group was a local graduate school with a general scientific background, but not much CS or machine learning.

The workshop consisted of us explaining the methods and the students then playing around with them and answering some questions using IPython notebooks that we provided (if you still don't know about IPython Notebooks, watch this talk now).

Using the notebooks worked out great! There is only so much you can teach in a 5 hour workshop but I think we got across some basic concepts of machine learning and working with data in Python.

We got some positive feedback and the students really went exploring.
We covered PCA, k-means, linear regression, logistic regression and nearest neighbors, including some real-world examples.


You can find all resources, including tex and notebooks for generating figures etc. on github.

You are welcome to reuse our material, though dropping us a line would be nice.

I haven't asked my coauthors about licensing but I think it shouldn't be a problem as long as you attribute.

Tuesday, November 6, 2012

A Wordcloud in Python

Last week I was at Pycon DE, the German Python conference. After hacking on scikit-learn a lot last week, I decided to to something different on my way back, that I had planned for quite a while:
doing a wordl-like word cloud.

I know, word clouds are a bit out of style but I kind of like them any way. My motivation to think about word clouds was that I thought these could be combined with topic-models to give somewhat more interesting visualizations.

So I looked around to find a nice open-source implementation of word-clouds ... only to find none. (This has been a while, maybe it has changed since).

While I was bored in the train last week, I came up with this code.
A little today-themed taste: