Announcing Junebug

Today, my friend Jim Keller and I are officially opening our new dating site, Junebug, to the public. For the moment, we’re focusing on building a user base in the areas around Philadelphia and New York City, but the site is open to everyone in the US. If you’d like to check it out, please go to junebugdating.com.

I decided to get involved with this project because I feel that existing dating sites are not taking advantage of the tools that machine learning provides for building better recommendation systems, which a dating site effectively is at its core. The contrast between the quality of suggestions that you get from Netflix and the quality of matches you get on existing dating sites is so great that I felt compelled to get involved, if only to see if we could do something to improve the state of the art in the field.

While there are many reasons why matching people with other people that they’d be interested in dating is harder than matching people with movies they’d be interested in watching or webpages they’d be interested in reading, I’ve become increasingly convinced that the big players in the online dating market are either unaware of or simply indifferent to the tools that computer scientists can offer them. I’d really like to help remedy that. With the so-called Age of Big Data upon us, I think dating sites need to exploit the amount of information they collect by turning things over to statistical algorithms that can do a more effective job of inferring people’s preferences.

We’ve built Junebug to do just that. We’ve got a lot of great algorithms ready to churn through the user data we’re hoping to collect. We need your help, though, because we don’t have much real data yet. If you’re reading this and are single, please sign up, try out the site, and tell us what we can do to make it more enjoyable to use. And it would be even better if you told your friends to try out the site as well, since that would start to provide us with the amount of data we need to really show much more effective a systematic statistical approach can be for building a better dating site.

If you think that online dating can be improved by the introduction of more sophisticated mathematical analysis, please helps us prove that’s true by spreading the word and getting new people to participate on our site. We can promise to do everything we know how to do to turn the data we get into a valuable service for our users.

Doing Maximum Likelihood Estimation by Hand in R

Lately I’ve been writing maximum likelihood estimation code by hand for some economic models that I’m working with. It’s actually a fairly simple task, so I thought that I would write up the basic approach in case there are readers who haven’t built a generic estimation system before.

First, let’s start with a toy example for which there is a closed-form analytic solution. We’ll ignore that solution and use optimization functions to do the estimation. Starting with this toy example makes it easy to see how well an approximation system can be expected to perform under the best circumstances — and also where it goes wrong if you make poor programming decisions.

Suppose that you’ve got a sequence of values from an unknown Bernoulli variable like so:

1
2
3
p.parameter <- 0.8
sequence <- rbinom(10, 1, p.parameter)
# [1] 0 1 1 1 1 1 1 0 1 0

Given the sequence, we want to estimate the value of the parameter, p, which is not known to us. The maximum likelihood approach says that we should select the parameter that makes the data most probable. For a Bernoulli variable, this is simply a search through the space of values for p (i.e [0, 1]) that makes the data most probable to have observed.

It’s worth pointing out that the analytic solution to the maximum likelihood estimation problem is to use the sample mean. We’ll therefore use mean(sequence) as a measure of the accuracy of our approximation algorithm.

How do we find the parameter numerically? First, we want to define a function that specifies the probability of our entire data set. We assume that each observation in the data is independently and identically distributed, so that the probability of the sequence is the product of the probabilities of each value. For the Bernoulli variables, this becomes the following function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
likelihood <- function(sequence, p.parameter)
{
  likelihood <- 1
 
  for (i in 1:length(sequence))
  {
    if (sequence[i] == 1)
    {
      likelihood <- likelihood * p.parameter
    }
    else
    {
      likelihood <- likelihood * (1 - p.parameter)
    }
  }
 
  return(likelihood)
}

To do maximum likelihood estimation, we therefore only need to use an optimization function to maximize this function. A quick examination of the likelihood function as a function of p makes it clear that any decent optimization algorithm should be able to find the maximum:

1
2
3
4
5
6
7
8
9
10
possible.p <- seq(0, 1, by = 0.001)
jpeg('Likelihood_Concavity.jpg')
library('ggplot2')
qplot(possible.p,
      sapply(possible.p, function (p) {likelihood(sequence, p)}),
      geom = 'line',
      main = 'Likelihood as a Function of P',
      xlab = 'P',
      ylab = 'Likelihood')
dev.off()
Likelihood_Concavity.jpg

For single variable cases, I find that it’s easiest to use R’s base function optimize to solve the optimization problem:

1
2
3
4
5
6
7
8
9
10
mle.results <- optimize(function(p) {likelihood(sequence, p)},
                        interval = c(0, 1),
                        maximum = TRUE)
 
mle.results
# $maximum
# [1] 0.6999843
#
# $objective
# [1] 0.002223566

Here I’ve used an anonymous function that returns the likelihood of our current data given a value of p; I’ve also specified that the values of p must lie in the interval [0, 1] and asked optimize to maximize the result, rather than minimize, which is the default behavior. Examining the output of optimize, we can see that the likelihood of the data set was maximized very near 0.7, the sample mean. This suggests that the optimization approximation can work. It’s worth noting that the objective value is the likelihood of the data set for the specified value of p. The smallness of the objective for large problems can become a major problem. To understand why, it’s worth seeing what happens as the size of the sample grows from 10 to 2500 samples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
error.behavior <- data.frame()
 
for (n in 10:2500)
{
  sequence <- rbinom(n, 1, p.parameter)
 
  likelihood.results <- optimize(function(p) {likelihood(sequence, p)},
                                 interval = c(0, 1),
                                 maximum = TRUE)
 
  true.mle <- mean(sequence)
 
  likelihood.error <- true.mle - likelihood.results$maximum
 
  error.behavior <- rbind(error.behavior,
                          data.frame(N = n,
                                     Error = likelihood.error,
                                     Algorithm = 'Likelihood'))
}
Likelihood_Problems.jpg

As you can see, our approximation approach works great until our data set grows, and then it falls apart. This is exactly the opposite of what asymptotical statistical theory tells us should be happening, so it’s clear that something is going very wrong. A quick examination of the results from the last pass through our loop makes clear what’s wrong:

1
2
3
4
5
6
7
8
9
10
11
12
sequence <- rbinom(2500, 1, p.parameter)
 
likelihood.results <- optimize(function(p) {likelihood(sequence, p)},
                               interval = c(0, 1),
                               maximum = TRUE)
 
likelihood.results
# $maximum
# [1] 0.9999339
#
# $objective
# [1] 0

The likelihood of our data is numerically indistinguishable from 0 given the precision of my machine’s floating point values. Multiplying thousands of probabilities together is simply not a viable approach without infinite precision. Thankfully, there’s a very simple solution: replace all of the probabilities with their logarithms. Instead of maximizing the likelihood, we maximize the log likelihood, which involves summing rather than multiplying, and therefore stays numerically stable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
log.likelihood <- function(sequence, p)
{
  log.likelihood <- 0
 
  for (i in 1:length(sequence))
  {
    if (sequence[i] == 1)
    {
      log.likelihood <- log.likelihood + log(p)
    }
    else
    {
      log.likelihood <- log.likelihood + log(1 - p)
    }
  }
 
  return(log.likelihood)
}

You can check that this problem is as easily solved numerically as the original problem by graphing the log likelihood:

1
2
3
4
5
6
7
8
9
10
sequence <- c(0, 1, 1, 1, 1, 1, 1, 0, 1, 0)
possible.p <- seq(0, 1, by = 0.001)
jpeg('Log_Likelihood_Concavity.jpg')
qplot(possible.p,
      sapply(possible.p, function (p) {log.likelihood(sequence, p)}),
      geom = 'line',
      main = 'Log Likelihood as a Function of P',
      xlab = 'P',
      ylab = 'Log Likelihood')
dev.off()
Log_Likelihood_Concavity.jpg

And then you can rerun our error diagnostics using both approaches to confirm that the log likelihood approach does not suffer from the same numerical problems:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
error.behavior <- data.frame()
 
for (n in 10:2500)
{
  sequence <- rbinom(n, 1, p.parameter)
 
  likelihood.results <- optimize(function(p) {likelihood(sequence, p)},
                                 interval = c(0, 1),
                                 maximum = TRUE)
 
  log.likelihood.results <- optimize(function(p) {log.likelihood(sequence, p)},
                                     interval = c(0, 1),
                                     maximum = TRUE)
 
  true.mle <- mean(sequence)
 
  likelihood.error <- true.mle - likelihood.results$maximum
  log.likelihood.error <- true.mle - log.likelihood.results$maximum
 
  error.behavior <- rbind(error.behavior,
                          data.frame(N = n,
                                     Error = likelihood.error,
                                     Algorithm = 'Likelihood'),
                          data.frame(N = n,
                                     Error = log.likelihood.error,
                                     Algorithm = 'Log Likelihood'))
}
 
jpeg('Long-Term_Error_Behavior.jpg')
ggplot(error.behavior, aes(x = N, y = Error)) + geom_line(aes(group = Algorithm, color = Algorithm)) + opts(title = 'Long-Term Error Behavior of Two Numerical Approaches', xlab = 'Sample Size', ylab = 'Deviation from True MLE')
dev.off()
Long-Term_Error_Behavior.jpg

More generally, given any data set and any model, you can — at least in principle — solve the maximum likelihood estimation problem using numerical optimization algorithms. The general algorithm requires that you specify a more general log likelihood function analogous to the R-like pseudocode below:

1
2
3
4
5
6
7
8
9
10
11
log.likelihood <- function(sequence.as.data.frame, likelihood.function, parameters)
{
  log.likelihood <- 0
 
  for (i in 1:nrow(sequence.as.data.frame))
  {
    log.likelihood <- log.likelihood + log(likelihood.function(sequence.as.data.frame[i,], parameters))
  }
 
  return(log.likelihood)
}

Then you need to apply multivariable, constrained optimization tools to find your maximum likelihood estimates. This actually turns out to be a hard problem in general, so I’m going to bail out on the topic here.

Theme Change

I’ve changed the theme on this blog to the Hybrid theme. I was tired of how busy the old theme was, and I wanted more space for graphics and code snippets. If you have any suggestions for cleaning things up even more, please let me know.

The Price of Calculation

In a world in which the price of calculation continues to decrease rapidly, but the price of theorem proving continues to hold steady or increase, elementary economics indicates that we ought to spend a larger and larger fraction of our time on calculation.1

Over the next ten years, I hope that more and more mathematically minded hackers, empowered by open source tools like the R programming language and emboldened by the popularization of statistical analyses by people like Steve Levitt, will follow Tukey’s suggestion.

  1. J. W. Tukey : The American Statistician : Sunset Salvo

An Alternative to Occam’s Razor

In light of human foibles, I would suggest that this decision rule be used in lieu of Occam’s Razor: of several possible explanations for an observation, the most boring one is probably the most accurate.

iBad: The FSF Kool-Aid and Other Dystopian Hallucinations

The people who worry that the iPad will bring about a dystopian future for home computing keep forgetting something: for the rest of humanity, their ideal world of perfectly hackable machines is already a dystopian nightmare. It’s a world in which nothing works without spending hours setting it up, in which basic features are missing while the manual lists thousands of irrelevant options, in which a million hardware extensions are available for their machine, but none of them help to solve a single one of their day-to-day problems. While being something of a hacker myself, I feel that the hacker’s vision of totally open computing probably should become a niche market, in much the same way that chemistry sets represent a niche market. The fact that not every person has a set of tools in his house that, by default, allows him to conduct arbitrary chemistry experiments has not substantially slowed down the progress of chemistry from what I can tell. The arrival of a world in which the most popular computers are closed to arbitrary hardware extensions and all applications are required to run within a sandbox probably won’t slow down the progress of personal computing much either.

Hackers of the world, your priorities are not simply different from the average user’s: they often represent a direct attack on the average user’s preferences. You keep asserting that you have the normal person’s interests in mind, but I think you’re often simply concealing your own self-interest underneath politicized rhetoric about freedom and openness.

Gay Marriage: Another Data Point

Relevant to my earlier post about the relationship between direct democracy and laws prohibiting gay marriage, Pew Research just published poll data showing that a majority of Americans disapprove of same-sex marriage.

Academics’ Slang: Orthogonal

H. G. Wells famously said that, “statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.” I think we’re getting closer to that day: even the Supreme Court of the United States plans to start using the word ‘orthogonal’ colloquially.

Outlawing Gay Marriage

Given the recent votes on same-sex marriage in New Jersey and Portugal, I wanted to test a seemingly innocuous claim that touches upon very broad issues in political theory: does the degree of directness of a “democratic” vote predict whether the vote will promote or prohibit same-sex marriage? Naively, it seemed clear to me that this was true: every single direct vote in my memory has outlawed same-sex marriage, while the only decisions that have allowed same-sex marriage have originated among elected representatives or unelected judges.

I decided to compile some rough data to test this idea using Wikipedia’s articles on the topic. I took what seemed to be the non-redundant decisions since the 1990′s from the following three articles:

  1. Same-Sex Marriage in the Netherlands
  2. Same-Sex Marriage in Spain
  3. Same-Sex Marriage in the United States

And I arrived at this table:

Region Date Type of Decision Promoted
Hawaii 1993-05-05 Court Ruling 1
U.S.A. 1996-09-21 Congressional Decision 0
Hawaii 1998-11-03 Direct Vote 0
Vermont 1999-12-20 Court Ruling 1
Holland 2000-12-19 Congressional Decision 1
Massachusetts 2003-11-18 Court Ruling 1
Spain 2005-06-30 Congressional Decision 1
California 2008-05-15 Court Ruling 1
Connecticut 2008-10-10 Court Ruling 1
California 2008-11-04 Direct Vote 0
New Hampshire 2009-03-26 Congressional Decision 1
Iowa 2009-04-03 Court Ruling 1
Vermont 2009-04-07 Congressional Decision 1
Connecticut 2009-04-23 Congressional Decision 1
Maine 2009-05-06 Congressional Decision 1
New York 2009-05-12 Congressional Decision 1
Texas 2009-10-02 Court Ruling 1
Maine 2009-11-03 Direct Vote 0
Washington, D.C. 2009-12-01 Court Ruling 1
New York 2009-12-02 Congressional Decision 1
New Jersey 2010-01-07 Congressional Decision 0

You can find a CSV file with this data at the GitHub repository where I’ve stored all of the analyses I’ve completed so far. I’d love for people to add more data or contribute some visualizations of the patterns I’m pointing out. And, as always, I’d love to hear criticisms or suggestions about my approach.

The summary statistics provide pretty clear support for my intuition, though they’re rarely statistically significant, given the small sample sizes:

  • 100% of court rulings promoted same-sex marriage.
  • 80% of congressional decisions promoted same-sex marriage.
  • 0% of direct votes promoted same-sex marriage.

For me the conclusion to be drawn from all of this is not that our respect for the will of the majority should compel us to prohibit same-sex marriage, but rather that we should be more hostile to direct democracy, because it is a powerful force for promoting intolerance in American society. I imagine this will sound un-American to many readers, but I think it’s quite clear that this sentiment was foundational for the American experiment in government. You see this distrust of direct democracy repeatedly reiterated in the Federalist Papers, especially in Federalist Number 10. Indeed, one can argue that the major purpose of our Constitution is to limit the ability of direct democracy to harm minorities within our society.

Killing Yourself: An Addendum

In further support of the claim that a lot of deaths are partly self-induced, here’s a fascinating piece by Wired on the extraordinary rise in the percent of deaths among the young caused by their own poor decisions.

It’s remarkable that, for the young, modern science has already made the world so safe that humanity, rather than nature, is now responsible for a majority of its own suffering.