Jan 6 2009

April May Be the Cruelest, But January Is the Strangest Month

I always find January a strange month, because the weather tends to get colder over the course of the month, even though the days get progressively longer. Given that I had already scrounged up data on the temperature in New York City a while back, I thought I should plot a graph showing the strange disconnect between day length and temperature that characterizes January. I was able to find an Excel spreadsheet that calculated the number of hours of daylight New York City received each day of the year; I used my previous weather data for the average temperatures for each day in 2003, 2005, 2006 and 2007. For the three data points for which I had no temperatures — 1/13/2003, 3/1/2003 and 8/28/2007 –, I used linear interpolation to estimate that day’s average temperature. I skipped 2004 in my analysis because it was a leap year.

The graphs below make clear — some moreso than others — that January is a distinctive month, because the mean temperature noticeably lags behind the mean length of the day. You can also see a similar pattern for the temperatures in July, which are warmer than June’s temperatures even though the days are already getting shorter. Both of these imply that it is not the mere presence of sunlight that determines the temperature each day, but rather the accumulated warmth due to the sunlight of the previous month.

To be clear about the construction of the graphs and their interpretation, the rank of a data point is its relative position in the order of the entire data set. The warmest day in a year is ranked 365 and the coldest day is ranked 1; similarly, the longest day in a year is ranked 365 and the shortest day is ranked 1. Day 1 is January 1st; day 365 is December 31st.

2003.png
2005.png
2006.png
2007.png

Jan 4 2009

Linear Regression Sampling Techniques Revisited

While thinking about linear regression today, I believe that I’ve realized why using clustered sampling and two point average slope calculation works better than least squares regression with scattered sampling. Specifically, it is the least squares formula that itself causes the problem, because squaring the errors gives undue weight to certain errors, skewing the results. This skew quickly goes away as the sample size grows, but in small samples it can make a large difference.

It will probably take me another few weeks before I can find time to prove this mathematically.


Dec 27 2008

Data Collection Strategies Revisited

To follow up on my post earlier today on two approaches to linear model fitting, I decided to do some Monte Carlo simulations to test the relative strength of my two proposals for data collection strategies. To test the merits of sampling clustered data points versus sampling scattered data points, I generated 100,000 data sets of four sizes (N = 10, N = 100, N = 1,000, N = 10,000) using each of these two approaches. For clustered data sets, I calculated the slope of the regression line using the slope formula everyone learns from remedial algebra; for the scattered data sets, I calculated the regression coefficients using standard linear model algorithms. I then compared these slopes with the true value and calculated the absolute error for each approach. Using these individual errors, I calculated the mean absolute error for each approach. The results are plotted in the graph below and the code I used to run the simulations is at the end of this post.

Monte Carlo Results.png

As you can see from this graph, the two approaches seem to have indistinguishable performance for large data sets, but the clustered data set approach seems to perform slightly better for data sets of size N = 10. I was quite surprised by this, as I assumed my slightly ad hoc approach would perform worse than the standard approach. I therefore would appreciate any/all of the following: theoretical analyses of the two approaches’ merits for small data sets, the discovery of errors in my code, or an insight about the R function rnorm() that implies these results regardless of the intrinsic quality of the two approaches. One other conceivable source of error in these results — which I had hoped would wish out with the large number of iterations in my simulations — is that the data sets used to perform the analysis were simply incomparable, which is problematic because I compared classical regression on scattered point data sets to slope estimation on clustered point data sets. As a follow up, I should probably consider the performance of classical regression on clustered point data sets relative to scattered point data sets, though this approach would not itself answer the question of which data collection strategy is best.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Compare performance of scattered point regression with slope estimation.
 
# See whether one algorithm does better with certain size data sets.
 
# Compare difference in performance on 100,000 samples of four sizes:
# 10, 100, 1000, 10000.
sample.sizes = c(10, 100, 1000, 10000);
classical = c();
dual.point = c();
 
for (i in 1:length(sample.sizes))
{
  slopes = c();
  alt.slopes = c();
 
  errors = c();
  alt.errors = c();
 
  # Classical regression.
  for (iteration in 1:100000)
  {
    x = 1:sample.sizes[i];
    y = 2 * x + rnorm(sample.sizes[i]);
    slopes[iteration] = coef(lm(y ~ x))[2];
    errors[iteration] = abs(slopes[iteration] - 2);
  }
 
  # Dual point slope estimatation.
  for (iteration in 1:100000)
  {
    x = c(rep(1, sample.sizes[i] / 2), rep(sample.sizes[i], sample.sizes[i] / 2));
    y = 2 * x + rnorm(sample.sizes[i]);
    alt.slopes[iteration] = (mean(y[(sample.sizes[i] / 2 + 1):sample.sizes[i]]) - mean(y[1:(sample.sizes[i] / 2)])) / (sample.sizes[i] - 1);
    alt.errors[iteration] = abs(alt.slopes[iteration] - 2);
  }
 
  classical[i] = mean(errors);
  dual.point[i] = mean(alt.errors);
}

Dec 26 2008

Again with the Null Hypothesis Significance Testing

As I was finishing reading “The Cult of Statistical Significance” yesterday, the following passage struck me as particularly important:

Rothman computed a p-value function — a continuous function of p-values mapped against a range of effect sizes. The range of effect sizes was here again measured by the relative risk ratio and includes both beneficial and nonbeneficial effects. He shows that another hypothesis, a fantastically beneficial risk ratio, RR = 4.1, shares the same p-value, .14, as the null, RR = 1.0 (2002, 125). This is common in medicine and all the sciences. To think that p-values have a 1-to-1 correspondence with a unique risk ratio is to ignore the symmetry of the p-function.1

I wish the symmetry of the distributions used for testing significance, especially the t distribution, would be emphasized to students during their introduction to statistics. We generally test distributions so that the t-value comparison is strongly positive to see whether we can reject the null hypothesis of zero difference between the means for some two sets of observations. But it is always possible to test another null hypothesis, in which the difference between the means for the two groups is much larger than the difference we observed, that we will also always fail to reject every time that we fail to reject the primary null hypothesis of zero difference. Yet we never test this hypothesis — despite their being no good reason for this mathematically. The only justification is an implicit Bayesian prior in defense of the null hypothesis rather than its dopplegänger hypothesis in which the difference is much larger than we have seen in practice. Is this implicit underweighting of the alternative null hypothesis really sound? That is an empirical question that is, unfortunately, not likely to be answered soon, but it suggests that conventional statistical practice may consistently underestimate the effects being examined using significance testing.

Of course, this problem is itself tied to the erroneous conflation of a failure to reject the null hypothesis with its acceptance — with the result that statistically insignificant effects are treated as non-existent, rather than inconclusively determined by the data at hand. Statistically insignificant differences tend not to be tested empirically a second time, so that it is hard to know how often they are really larger than our first experiments suggested.

  1. Stephen T. Ziliak and Deirdre N. McCloskey : The Cult of Statistical Significance : On Drugs, Disability and Death

Dec 13 2008

Proving the Obvious and Understanding the Not-So Obvious

Continuing on with my exploration of the National Survey of Drug Use and Health, I thought that I should calculate some simple conditional frequency statistics. The graph below strikes me as a very good example of how conditional probabilities play out in the real world. From it, you can see how the right piece of information can radically improve your ability to make guesses about the answer to another question.

Cigarettes and Cocaine.png

To quantify the pattern that you can see in the chart, only 4% of those who’ve tried cocaine have not also tried cigarettes at some point in their lives. In contrast, 49% of those who’ve tried cigarettes have never tried cocaine. In general, people are unlikely to try cocaine, but those who do are almost certain to have tried cigarettes as well. In other words, cocaine use tells you a lot about cigarette use, but cigarette use tells you effectively nothing about cocaine use. If you meet someone who’s tried cocaine, and you assume that they’ve also tried cigarettes, these statistics suggest that your assumption will be wrong less than 5% of the time.


Dec 11 2008

National Survey of Drug Use and Health

Lately, I’ve been exploring the data set that was recently released by the National Survey of Drug Use and Health. There’s enough raw data in it to spend months trying to make sense of it all. That said, for the moment I thought that I would simply post the following chart I generated using a very quick calculation of the relative frequencies of substance abuse broken down by substance.

Substance Abuse.png

The variables used in this analysis were ABUSEALC, ABUSECOC, ABUSEHAL, ABUSEHER, ABUSEINH, ABUSEMRJ, ABUSEANL, ABUSESED, ABUSESTM, ABUSETRN. The meanings of these variables are somewhat obscure, but my hope is that the definition of abuse is similar enough across substances to allow for a relative frequency analysis. Every subject classified as abusing a substance was summed over and then the resulting number was divided by the total number of subjects in the data set to find a frequency of abuse per substance.


Dec 9 2008

Breast Cancer and Early First Pregnancy?

Reading David Freedman’s book “Statistical Models: Theory and Practice” today, I was very struck by this passage:

Example 1. In cross-national comparisons, there is a striking correlation between the number of telephone lines per capita in a country and the death rate from breast cancer in that country. This is not because talking on the telephone causes cancer. Richer countries have more phones and higher cancer rates. The probable explanation for the excess cancer risk is that women in richer countries have fewer children. Pregnancy — especially early first pregnancy — is protective.1

Is Freedman correct about the protective benefits of pregnancy? This would be remarkable if true. And, in the absence of evidence to the contrary, I am likely to believe Freedman’s claims.

  1. David Freedman : Statistical Models: Theory and Practice : Chapter I

Dec 4 2008

Masquerading as Rigorous Science

In our days, serious arguments have been made from data. Beautiful, delicate theorems have been proved; although the connection with data analysis often remains to be established. And an enormous amount of fiction has been produced, masquerading as rigorous science.1

I would like to believe that, if only more statisticians wrote like David Freedman, we might succeed in ridding ourselves of so much of the fashionable nonsense that masquerades as science today.

Hat tip to Jiaying Zhao for bringing this truly amazing article to my attention.

  1. David Freedman : Foundations of Science : Some Issues in the Foundation of Statistics

Nov 30 2008

Suicide Rates and GDP

As part of an ongoing project on the behavioral consequences of tryptophan depletion, I read an article today that claimed to have found a positive correlation between high levels of corn consumption and homicide across many nations. The researchers claimed that corn, being deficient in tryptophan, chronically depletes serotonin levels, thereby increasing incidents of physical violence.

I was fascinated by the claim, albeit rather incredulous. But, rather than pursue the question of tryptophan’s effects on suicide, I decided to look into a question I’ve often wondered about: the correlation of GDP and suicide rates.

After some data diving of my own, using GDP data from the IMF and suicide data from WHO, I found no meaningful correlation between suicide rates and GDP. Interestingly, a simple scatterplot of the relevant data sets reveals that, for each gender separately, there are several very substantial outliers that make any such correlation impossible to find, as you can see below.

Male Suicides and GDP.png
Female Suicides and GDP.png

So the question I’m left with is, “what variables explain the very different suicide rates seen across nations in this data set?”


Nov 8 2008

GRE Scores and Political Correctness

This week a table of average GRE scores for different academic disciplines has been circulating around the economic blogosphere. You can see it at Greg Mankiw’s blog here.

After a colleague pointed the chart out to me today, I decided that I would combine the GRE scores with self-identification scores of political correctness from another chart that made its way around my circle of friends a few months ago. Below you’ll find the resulting scatterplot and regression line.

correlation.png

The correlation coefficient in the chart is -0.63, indicating a noticeable decline in GRE performance as one transitions to groups that are more PC. You can figure out the implications of that statement for yourself.