On July 25th, I’ll be presenting at the Seattle R Meetup about implementing Bayesian nonparametrics in R. If you’re not sure what Bayesian nonparametric methods are, they’re a family of methods that allow you to fit traditional statistical models, such as mixture models or latent factor models, without having to fully specify the number of clusters or latent factors in advance.

Instead of predetermining the number of clusters or latent factors to prevent a statistical algorithm from using as many clusters as there are data points in a data set, Bayesian nonparametric methods prevent overfitting by using a family of flexible priors, including the Dirichlet Process, the Chinese Restaurant Process or the Indian Buffet Process, that allow for a potentially infinite number of clusters, but nevertheless favor using a small numbers of clusters unless the data demands using more.

This flexibility has several virtues:

- You don’t have to decide how many clusters are present in your data before seeing the results of your analysis. You just specify the strength of a penalty for creating more clusters.
- The results of a fitted Bayesian nonparametric model can be used as informed priors when you start processing new data. This is particularly important if you are writing an online algorithm: the use of a prior like the Chinese Restaurant Process that, in principle, allows the addition of new clusters at any time means that you can take advantage of all the information you have about existing clusters without arbitrarily forcing new data into those clusters: if a new data point is unique enough, it will create its own cluster.

To give you a sense of how Bayesian nonparametric methods work, I recently coded up the newly published dp-Means algorithm, which provides a deterministic implementation of a k-means like algorithm that allows for a potentially infinite number of clusters, but, in practice, only uses a small number of clusters by penalizing the creation of additional clusters. This penalty comes from a parameter, `lambda`

, that the user specifies before fitting the model.

The series of images below shows how the dp-Means algorithm behaves when you turn it loose on a data set with four distinct Gaussian clusters. Each image represents a higher penalty for using more clusters.

As you can see, you can shift the model from a preference for using many clusters to using only one by manipulating the lambda parameter. If you’re interested in learning more and will be in the Seattle area in late July, come out.

This looks very interesting, thanks for the post. Just a question, as someone just delving into the Bayesian viewpoint, shouldn’t it also be possible to make it completely ‘nonparametric’ without even needing to specify lambada? E.g. use some information criteria like AIC to select the most informative value of lambada that would then in turn penalize models that add more complexity than ‘information’.

Your thoughts on this would be much appreciated :)

Put a hyperprior on lambda (typically exponential, although a gamma might make more sense, as it would tend towards a mode if there is one) and simulate until convergence between the model and the updates.

To do this on-line you’d want to do mini-batches with resampling, I’d suppose, but the idea is the same.

Upon thinking about the advantages of DP-means over sampling and variational approaches, I would imagine that for the first batch of information, the investigator would run until convergence with a range of values for lambda, then choose the one that minimizes the ratio of the within-clusters sum of squares to the between-clusters sum of squares. Silhouette width, dip tests, etc. would be alternative ways to do the same thing, but for a mixture of Gaussians, the partitioning of the variance is arguably the most useful and universal way to choose a parameter. This could obviously be extended across batches.

Beautiful pics!

I don’t understand the λ=0 case though. Am I correct in thinking that the colours do get re-used and this is a cluster for every dot?

Would love to see your slides after you’ve given the talk.

Hi Pete,

Tim is right in thinking that the standard Bayesian way to avoid setting lambda would involve setting a hyperparameter. While I’m sympathetic to your idea that it would be nice to use something like AIC to avoid setting any parameters, that’s actually not how Bayesian modeling is typically done. Bayesian models gather strength from making some assumptions before you see data: as Dave Blei likes to say, every Bayesian model has a moment of truth when you have to hardcode some parameters into the model.

That said, there are empirical approaches like those Tim describes in his second comment. Those ideas are surprisingly hard to articulate in terms of Bayesian models.

Isomorphismes, I think you’re right that the lambda = 0 case is one in which every point has its own cluster. You can confirm that by looking at the original paper or just running the code I’ve posted.

I’ll post the slides for my talk online once the talk is done.

Hi John,

Thanks for your post. Regarding DP-Means, it’s not immediately obvious what advantage it offers over K-Means, as lambda seems to be monotonically related to the number of clusters. Does DP-Means offer some other advantages?

Chris

Chris,

You’ve got a very good point. I was wondering when someone was going to call me on praising dp-Means over k-Means when you could get the same results using k-Means. Here are some responses:

(1) The dp-Means algorithm’s derivation sets it in a context that’s more easily extended than the k-Means algorithm’s standard derivations.

(2) Lambda does uniquely determine the number of clusters in a fixed data set, but different data sets will get different numbers of clusters for the same lambda. In short, lambda doesn’t hard code the decision about how many clusters will show up, although it does uniquely determine the outcome.

(3) Lambda is continous, so you might hope to try some sort of gradient descent trick inside a resampling loop as a way of setting lambda to an optimal value.

Thanks for your response John.

I always keep an eye out for practical model selection approaches. At first glance, the DP-Means algorithm sounded like it offered something new. Based on your description though, it sounds like K-Means and DP-Means are practically equivalent. Both algorithms yield a countable family of partitions indexed by a single parameter that spans the range of possible clusters. So the model selection problem presented by K-Means remains with DP-Means. It will be interesting to see if anyone expands on this line of thinking to pursue ideas such as your third point. The computational expense of model selection is something that often seems to get swept under the rug.

John, thanks for the post and the pointer to DP-means. I coded this in Python and ran a few experiments comparing timing the behavior of error with k and lambda. I am curious about your point (3) in the comment response above, but don’t see any hints in the error vs. lambda behavior that indicates optimal. I wrote my experiments up on my blog at http://blog.drskippy.com/2012/07/06/comparing-k-means-and-dp-means-clustering/.

Hi Scott,

Your blog post looks really great. Mind if I pull over your Python code in my dp-Means implementation repo?

Optimality for clustering is a very tricky thing to define. From what I gather from a quick reading of your blog post, the reason that you’re not seeing anything interesting going on as you decrease lambda or increase k is that you’re not assessing the clustering’s performance on held out data. That’s essential, because your performance on the training data will always improve with a more complex model. I’d recommend splitting out 20% of your data, fitting dp-Means and k-Means to the remaining 80%, and then calculating the squared errors estimated for that held out 20% when you use the parameters dp-Means and k-Means selects based on the 80% used for fitting. The big idea is that you want to estimate cluster centers that are going to be close to new data points. When you put a cluster center somewhere it’s not needed, you’ll do worse on the held out data.

John-

Yes, please feel free to copy and evolve the code as you need. I agree, a useful next step would be to optimize and test on separate random samples from the data set. With the framework I have, I may get around running this in the next few days.

Cheers.

John-

Yes, please feel free to copy and evolve the code as you need. I agree, a useful next step would be to optimize and test on separate random samples from the data set. With the framework so far, I may get around running this in the next few days.

Cheers.

So upon reading the paper, the real advantage to DP-means seems to be that it’s a principled way to attach multiple-dataset clustering to a hierarchical Dirichlet process (where the base measure from which each “view” of the clusters is drawn, can depend across the clusterings upon a common measure sampled from its own Dirichlet process — the analogy that I’m familiar with is “group Lasso vs. straight L1 penalty”). This provides an advantage not seen in k-means or consensus NMF type clusterings, in that one can do principled subtype discoveries across different views of the data. Of course this presumes that there is a reasonable expectation of shared structure, which seems to loop back to Blei’s point. (Another point is that you pay a steep price for being too lazy to choose K)

The “group Lasso vs. regular Lasso” analogy leads directly into [another Jordan paper](http://arxiv.org/abs/1204.4243), this one with several Chinese collaborators (including the first author), on mixtures of exponential power distributions with generalized inverse Gaussian densities. The principle contribution of the paper, as best as I can tell, is that both nonconvex penalties (MC+, bridge, SCAD, etc.) and Bayesian grouping of variables (via hierarchical structure) can be accomodated by the conjugacy between exponential power distributions (Laplace, Gaussian, and further generalizations in the paper) and the generalized inverse Gaussian distribution.

As with a parsimonious setting for lambda, I have always hated the idea of enforcing grouping on variables when I had no real reason to choose a structure. It’s the same reason I like the empirical Bayesian “cheat” of setting priors based on a first pass at the data itself. Let the data tell you what the model structure ought to be, and use priors to steer clear of obviously insane posteriors.

Anyways, I’m coding up the HDP and empirical-lambda versions of dp-means to see if this is worthwhile. I’ll be in Seattle around the time of the meetup, although I need to catch an 8:15pm flight that night, so I may not be able to make it to the actual presentation. Would like to stop by however.

Hi! I’m new here, coming via your Setup interview: thanks for the great post about this interesting new algorithm!

I feel stupid for not having read the thread comments fully however, because it would have spared me of implementing my own Python translation of your code. Scott’s approach of coding DP-means in terms of a subclassed version of K-means is much more elegant than mine, which is a literal translation of your code (using Numpy and Matplotlib):

https://github.com/cjauvin/dp-means

Hi Christian,

Nice to have a new reader. Thanks for the link to your Python code!

I finally had some time to work out the optimization step. The post linked below demonstrates DP-means clustering code that optimizes lambda by minimizing the DP-means cost function on a cross-validation data set (following up on John’s comment above). It is a little rough, but shows it is possible and fairly robust. It seems neat.

http://blog.drskippy.com/2012/07/19/dp-mean-optimizing-to-get-the-number-of-clusters/

https://github.com/DrSkippy27/Python-DP-Means-Clustering

Thanks for coding this up, Scott. Glad to see it turned out so well. Coincidentally, we’ve had some great talks at MSR recently about algorithms for hyper-parameter tuning, which I’m planning to write up to help diffuse that knowledge. This is a great starting point for that.