Jan 6 2009

April May Be the Cruelest, But January Is the Strangest Month

I always find January a strange month, because the weather tends to get colder over the course of the month, even though the days get progressively longer. Given that I had already scrounged up data on the temperature in New York City a while back, I thought I should plot a graph showing the strange disconnect between day length and temperature that characterizes January. I was able to find an Excel spreadsheet that calculated the number of hours of daylight New York City received each day of the year; I used my previous weather data for the average temperatures for each day in 2003, 2005, 2006 and 2007. For the three data points for which I had no temperatures — 1/13/2003, 3/1/2003 and 8/28/2007 –, I used linear interpolation to estimate that day’s average temperature. I skipped 2004 in my analysis because it was a leap year.

The graphs below make clear — some moreso than others — that January is a distinctive month, because the mean temperature noticeably lags behind the mean length of the day. You can also see a similar pattern for the temperatures in July, which are warmer than June’s temperatures even though the days are already getting shorter. Both of these imply that it is not the mere presence of sunlight that determines the temperature each day, but rather the accumulated warmth due to the sunlight of the previous month.

To be clear about the construction of the graphs and their interpretation, the rank of a data point is its relative position in the order of the entire data set. The warmest day in a year is ranked 365 and the coldest day is ranked 1; similarly, the longest day in a year is ranked 365 and the shortest day is ranked 1. Day 1 is January 1st; day 365 is December 31st.

2003.png
2005.png
2006.png
2007.png

Jan 4 2009

Linear Regression Sampling Techniques Revisited

While thinking about linear regression today, I believe that I’ve realized why using clustered sampling and two point average slope calculation works better than least squares regression with scattered sampling. Specifically, it is the least squares formula that itself causes the problem, because squaring the errors gives undue weight to certain errors, skewing the results. This skew quickly goes away as the sample size grows, but in small samples it can make a large difference.

It will probably take me another few weeks before I can find time to prove this mathematically.


Dec 31 2008

Democracy in Action

The Pew Research Center reported yesterday that the voters of 29 states have already approved bans on same-sex marriage.

For me, this observation highlights the absurdity of the naïve apotheosis of populism and democratic institutions that constitutes a core element of the contemporary Western zeitgeist. We tend to take for granted that democracy is something intrinsically good, an assumption that gives strength to the growing scorn we see in our society for “elitism” or any other movement that threatens to usurp the will of the people. We Americans seem to invariably forget that, while democratic institutions may sculpt our society in accord with the will of the people, this in no way implies that the people’s vision of a perfect society is something we should wish to see given form. Democracy does indeed give power to the people, but it does not and can never give the people the moral integrity to put that power to proper use.

Indeed, if the age-old adage that “power corrupts” is true, then democracy might even contribute to the moral and intellectual degradation of the populations of democratic nations. Or, as seems more likely, the age-old adage is simply wrong: corruption is a part of the human inheritance, and power, like alcohol, simply brings that latent vice to the forefront.

Before I close, I should note that this is not a peculiarly American problem, though I know many people who would like to claim so. After all, the Swiss are about to vote on a law that would permanently ban the construction of minarets.

Really, when I think of all the crimes that democratic nations commit against their own moral codes, it’s enough to make me wonder if William Henry Vanderbilt was onto something when he said, “the people be damned.”


Dec 29 2008

And the Teslas Just Keep on Coming

I think this Youtube video does a far better job of showcasing the dangers of MRI machines than “The Magnetic Zone” video that Siemens distributes. I particularly enjoy the “take off” sound that the air cylinder makes three seconds into the clip.


Dec 27 2008

Data Collection Strategies Revisited

To follow up on my post earlier today on two approaches to linear model fitting, I decided to do some Monte Carlo simulations to test the relative strength of my two proposals for data collection strategies. To test the merits of sampling clustered data points versus sampling scattered data points, I generated 100,000 data sets of four sizes (N = 10, N = 100, N = 1,000, N = 10,000) using each of these two approaches. For clustered data sets, I calculated the slope of the regression line using the slope formula everyone learns from remedial algebra; for the scattered data sets, I calculated the regression coefficients using standard linear model algorithms. I then compared these slopes with the true value and calculated the absolute error for each approach. Using these individual errors, I calculated the mean absolute error for each approach. The results are plotted in the graph below and the code I used to run the simulations is at the end of this post.

Monte Carlo Results.png

As you can see from this graph, the two approaches seem to have indistinguishable performance for large data sets, but the clustered data set approach seems to perform slightly better for data sets of size N = 10. I was quite surprised by this, as I assumed my slightly ad hoc approach would perform worse than the standard approach. I therefore would appreciate any/all of the following: theoretical analyses of the two approaches’ merits for small data sets, the discovery of errors in my code, or an insight about the R function rnorm() that implies these results regardless of the intrinsic quality of the two approaches. One other conceivable source of error in these results — which I had hoped would wish out with the large number of iterations in my simulations — is that the data sets used to perform the analysis were simply incomparable, which is problematic because I compared classical regression on scattered point data sets to slope estimation on clustered point data sets. As a follow up, I should probably consider the performance of classical regression on clustered point data sets relative to scattered point data sets, though this approach would not itself answer the question of which data collection strategy is best.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Compare performance of scattered point regression with slope estimation.
 
# See whether one algorithm does better with certain size data sets.
 
# Compare difference in performance on 100,000 samples of four sizes:
# 10, 100, 1000, 10000.
sample.sizes = c(10, 100, 1000, 10000);
classical = c();
dual.point = c();
 
for (i in 1:length(sample.sizes))
{
  slopes = c();
  alt.slopes = c();
 
  errors = c();
  alt.errors = c();
 
  # Classical regression.
  for (iteration in 1:100000)
  {
    x = 1:sample.sizes[i];
    y = 2 * x + rnorm(sample.sizes[i]);
    slopes[iteration] = coef(lm(y ~ x))[2];
    errors[iteration] = abs(slopes[iteration] - 2);
  }
 
  # Dual point slope estimatation.
  for (iteration in 1:100000)
  {
    x = c(rep(1, sample.sizes[i] / 2), rep(sample.sizes[i], sample.sizes[i] / 2));
    y = 2 * x + rnorm(sample.sizes[i]);
    alt.slopes[iteration] = (mean(y[(sample.sizes[i] / 2 + 1):sample.sizes[i]]) - mean(y[1:(sample.sizes[i] / 2)])) / (sample.sizes[i] - 1);
    alt.errors[iteration] = abs(alt.slopes[iteration] - 2);
  }
 
  classical[i] = mean(errors);
  dual.point[i] = mean(alt.errors);
}

Dec 26 2008

Linear Regression and Decisions about Sampling

Lately I’ve been thinking about the optimal strategy for data collection when you plan to run a linear regression. Clearly, you want a sample of widely distributed points if you’re unsure that a strict linearity assumption is appropriate. If you already know from theoretical reasons that linearity is appropriate, then you know that you only need two correct (x, y) data points to uniquely define the regression line. To get this, one conventionally samples many (x, y) pairs and then computes the regression line’s slope and intercept. Why not sample only two x data points over and over again instead? If you are trying to find the formula for the line E[y | x], it seems reasonable to assume that high quality estimates of the points (a, E[y | x = a]) and (b, E[y | x = b]) would be a good way to do this.

Is this a reasonable approach to two variable linear regressions? Is this approach less efficient statistically than sampling at many points? Or is the reason to avoid this strategy in practice is that one is uncertain of the validity of the linearity assumption in all but exceptional cases?


Dec 26 2008

Making the Most of My Mac

For literally years I’ve been meaning to write a post about my favorite programs and utilities for the Mac, but I’ve always managed to put it off. Given that I recently sent my girlfriend my old Powerbook, I thought that I should finally write down a list of the programs and tools that I’ve found worth having as a Mac user. This list is definitely idiosyncratic — with a heavy bias towards programming and scientific tools –, but I think that there are still a lot of very good programs on this list that do not always get as much publicity as they deserve. All that said, here’s my list.

1. Adium: The best chat client for the Mac that I’m aware of. I use it as a client for GMail chat, AIM and MSN. I’d use iChat if it worked with all of those services as well as Adium does, but it just doesn’t as far as I can tell. There are things that iChat does that Adium can’t do — e.g. video chat –, but I don’t have any use for those features. I could also use the AIM and MSN programs provided by AOL and Microsoft, but I much prefer a single integrated program over several separate programs. (site)

2. Caffeine: A simple little program that keeps your Mac from going to sleep, turning off the monitor or activating the screen saver. Very useful when you’re giving presentations. (site)

3. Carbon Emacs: The only build of Emacs that I find reliably renders the keys on my Western Spanish keyboard. It is also the only one that seems to respect traditional Emacs key bindings, which is very important to me. (site)

4. Cubase: My favorite music composition software for the Mac. Cubase is well-deservedly famous as a MIDI sequencer and I’ve found that it’s equally good as a multitrack audio recording system. I use it along with a Toneport UX2 to record guitar and Superior Drummer 2.0 for drum tracking, and I’ve gotten great results so far. (site)

5. Cyberduck: The FTP/SFTP client I use. I’m sure there are better tools than Cyberduck (such as Transmit 3), but Cyberduck is free and does the job more than well enough for my needs. (site)

6. Delicious Library: A program to help you keep a record of all of the books, DVD’s and CD’s you own. I find it especially useful for keeping track of the books I lend to people. (site)

7. Flickr Uploader: If I’m going to upload a lot of photos to Flickr, I really don’t want to have to use a Web interface. Flickr Uploader lets me do all of the editing on my machine and then send the labelled and tagged photos as a single group to Flickr. Most importantly, the progress I’ve made in tagging photos isn’t lost when my Internet connection flakes out. (site)

8. Gimp: I’m too cheap to buy Photoshop, but the recent builds of Gimp for the Mac work well enough for my purposes. (site)

9. Graphviz: If I have to draw any sort of graph, I always use Graphviz. It’s a great interface to compilers for the DOT language developed at Bell Labs to describe graphs. If you can program at all and ever need to write up flowcharts or diagrams of any sort, I think Graphviz is the way to go. You should also know that the Pixelglow build for Macs is much better than the default. (site)

10. Growl: Growl provides one of those clever little hacks to the basic Mac user interface that Windows users always find impressive: it creates a service for displaying notifications on your screen that quickly fade away after you’ve seen them. But the truth is that Growl’s usefulness is only obvious after you’ve used it for a while. (site)

11. Handbrake: The best video transcoder I know of for the Mac. Whenever I need to change one video format into another, Handbrake’s been able to do it for me. (site)

12. Hazel: Another great service for Macs: install Hazel and you have a simple daemon that will regularly move files according to a set of rules you define yourself. I use it to sort every file on my desktop into folders specific to filetypes — moving MP3’s to one folder and PDF’s to another. It’s been a major part of my efforts to be more organized with my files. (site)

13. KeePassX: A password manager that I find very helpful for navigating the mass of passwords I need to remember without leaving the passwords as plain text anywhere on my system. (site)

14. MacFreePOPS: A simple little program that will let you access your Hotmail account from Mail as if it were a POP server. Extremely useful. (site)

15. MacFUSE: Probably the most impressive of all of the hacks created by the Mac user community. MacFUSE allows you to install new file system drivers that run entirely in user space. The result is that you’ll get easy access to NTFS (i.e. Windows) hard drives and a slew of other formats. I think everyone should put MacFUSE on their machine the day they buy it. (site)

16. Mac The Ripper: If I need to make a copy of a DVD I’ve made, Mac The Ripper makes it much easier for me to do so. Unfortunately only the older version is still freely distributed, but it works for most DVD’s. (site)

17. MarsEdit: Just as I don’t like using a web interface to upload photos to Flickr, I don’t much like using one when writing blog posts. So I do all of my writing in MarsEdit, which then handles uploading my finished posts to my server. (site)

18. Mathematica: I use Mathematica fairly frequently when I want to get a quick sense of how functions behave or when I need to evaluate an integral I’ve forgotten how to solve by hand. (site)

19. Matlab: If I need to do a lot of basic number crunching involving matrices, I always use Matlab. Additionally, I tend to use it along with PsychToolBox and DotsX for coding experiments in neuroscience and psychology. (site)

20. MySQL: I always use MySQL as the database system for every dynamic web site I build. It works perfectly on Mac OS X these days, so I tend to demo things on my own machine before moving them off to a stand-alone server. (site)

21. NetNewsWire: My favorite RSS reader for the Mac. I can’t speak highly enough of NetNewsWire’s interface or the fact that the iPhone application is just as great as the desktop version. (site)

22. OpenOffice: Again, I’m too cheap to buy a copy of Office, so I use OpenOffice. It’s managed to serve me pretty well so far. It’s still a little lacking on the Mac, but it’s getting much better with time. (site)

23. Papers: My means for storing and organizing all of my PDF files. Think of it as iTunes for PDF’s. If you read journal articles, Papers will improve your life more than you could possibly expect. (site)

24. Perian: Perian will outfit your Quicktime player with almost all of the codecs you could want. Without it, I find Quicktime almost useless. (site)

25. Perl: The classic programming language needs no introduction, but I think it’s worth noting that you’re always better building your own version of Perl and storing it in /usr/local/, where you won’t be able to destroy the version that OS X ships with by default. I’ve also found it nearly impossible to get many modules to build without some customization. (site)

26. Python: Again, I don’t think Python needs an introduction, but building your own copy seems like a very good idea to me. (site)

27. Quicksilver: A great tool for getting easy access to programs. I don’t use nearly as many of Quicksilver’s features as a lot of people do, but I find it really helpful to be able to avoid using Finder when I don’t need to. (site)

28. R: My language of choice for statistical computing and data analysis. Great tools for producing graphs and an amazing set of facilities for any statistical computation you could ever want to perform. If you want to do statistics like a grown up statistician would, R is the way to go. (site)

29. ReadIris: My favorite OCR software for the Mac. I use this every time I want to copy a long section of text I’ve scanned. I invariably have to make corrections by hand, but that’s much faster than typing everything myself from scratch. Given how well ReadIris performs for me, I have high hope that one day in my life we’ll see a properly Bayesian piece of OCR software that gets everything right. (site)

30. Ruby: Another programming language that needs no introduction, but which I’d recommend building from source and storing in /usr/local. (site)

31. ScreenFlow: The best screencasting software for the Mac I could find. Given the number of features, the quality of the interface and its relatively low cost, I doubt one could find something better for a few years to come. (site)

32. Scrivener: An amazing application that makes writing extended works (for me those are mostly translations) much, much easier. I don’t use it as often these days, but Scrivener is a brilliant tool if you do a lot of writing that can be broken into sections and outlined carefully. (site)

33. ScummVM: When I’m not working, I like to play some old LucasArts games. ScummVM makes that possible. (site)

34. Senuti: If I need to transfer a file off of an iPod (which iTunes makes impossible), Senuti is there for me. It was free for a long time, so I’m somewhat surprised to find that you’re supposed to pay for it now. (site)

35. Sequel Pro: My favorite database client system. The heir to the great CocoaMySQL application. A perfect compliment to MySQL on the Mac. (site)

36. Skype: Who doesn’t use Skype as their VoIP program? (site)

37. TexLive: When I want documents to look clean, I always use LaTeX. TexLive is the current standard distribution of LaTeX for UNIX systems and it has some great tools specifically made for the Mac. (site)

38. TextMate: They claim it, and I agree: TextMate is Emacs for the 21st century. If you are young enough that you find GUI’s helpful and don’t think touching the mouse is a crime against nature, TextMate is the best text editor you will ever find. Every Rails person worth his salt is a TextMate user and there is an endless supply of bundles to customize TextMate for the language of your choice. (I’ve recently used it a lot with Matlab, R and Erlang.) (site)

39. The Unarchiver: If you’ve ever received a compressed file you couldn’t open, get The Unarchiver and your problems will be solved. Everything else is a waste of time and/or money. (site)

40. Twitterific: My favorite Twitter client for the Mac. (site)

41. Unison: Probably my single favorite tool for the Mac. Unison lets my keep all of the files that matter to me in perfect sync between my laptop and my desktop. In practice, that amounts to a brilliant back-up system as well as making my life incredibly easier when I do some work on my laptop and then some more work on my desktop. In the end, Unison seems to be the program destined to replace rsync one day. (site)

42. VLC: The system I always use to watch anything that doesn’t open in Quicktime with Perian installed. (site)

43. VMWare Fusion: Sometimes I need to run Windows or Linux. VMWare Fusion makes it incredibly easy to do so and runs both of those operating systems with remarkable efficiency. (site)

44. Zenmap: If I need to figure out the structure of the network I’m on, nmap is the tool for doing so. Zenmap provides nmap for the Mac and also a (sometimes) helpful GUI. (site)


Dec 26 2008

Again with the Null Hypothesis Significance Testing

As I was finishing reading “The Cult of Statistical Significance” yesterday, the following passage struck me as particularly important:

Rothman computed a p-value function — a continuous function of p-values mapped against a range of effect sizes. The range of effect sizes was here again measured by the relative risk ratio and includes both beneficial and nonbeneficial effects. He shows that another hypothesis, a fantastically beneficial risk ratio, RR = 4.1, shares the same p-value, .14, as the null, RR = 1.0 (2002, 125). This is common in medicine and all the sciences. To think that p-values have a 1-to-1 correspondence with a unique risk ratio is to ignore the symmetry of the p-function.1

I wish the symmetry of the distributions used for testing significance, especially the t distribution, would be emphasized to students during their introduction to statistics. We generally test distributions so that the t-value comparison is strongly positive to see whether we can reject the null hypothesis of zero difference between the means for some two sets of observations. But it is always possible to test another null hypothesis, in which the difference between the means for the two groups is much larger than the difference we observed, that we will also always fail to reject every time that we fail to reject the primary null hypothesis of zero difference. Yet we never test this hypothesis — despite their being no good reason for this mathematically. The only justification is an implicit Bayesian prior in defense of the null hypothesis rather than its dopplegänger hypothesis in which the difference is much larger than we have seen in practice. Is this implicit underweighting of the alternative null hypothesis really sound? That is an empirical question that is, unfortunately, not likely to be answered soon, but it suggests that conventional statistical practice may consistently underestimate the effects being examined using significance testing.

Of course, this problem is itself tied to the erroneous conflation of a failure to reject the null hypothesis with its acceptance — with the result that statistically insignificant effects are treated as non-existent, rather than inconclusively determined by the data at hand. Statistically insignificant differences tend not to be tested empirically a second time, so that it is hard to know how often they are really larger than our first experiments suggested.

  1. Stephen T. Ziliak and Deirdre N. McCloskey : The Cult of Statistical Significance : On Drugs, Disability and Death

Dec 13 2008

Proving the Obvious and Understanding the Not-So Obvious

Continuing on with my exploration of the National Survey of Drug Use and Health, I thought that I should calculate some simple conditional frequency statistics. The graph below strikes me as a very good example of how conditional probabilities play out in the real world. From it, you can see how the right piece of information can radically improve your ability to make guesses about the answer to another question.

Cigarettes and Cocaine.png

To quantify the pattern that you can see in the chart, only 4% of those who’ve tried cocaine have not also tried cigarettes at some point in their lives. In contrast, 49% of those who’ve tried cigarettes have never tried cocaine. In general, people are unlikely to try cocaine, but those who do are almost certain to have tried cigarettes as well. In other words, cocaine use tells you a lot about cigarette use, but cigarette use tells you effectively nothing about cocaine use. If you meet someone who’s tried cocaine, and you assume that they’ve also tried cigarettes, these statistics suggest that your assumption will be wrong less than 5% of the time.


Dec 11 2008

National Survey of Drug Use and Health

Lately, I’ve been exploring the data set that was recently released by the National Survey of Drug Use and Health. There’s enough raw data in it to spend months trying to make sense of it all. That said, for the moment I thought that I would simply post the following chart I generated using a very quick calculation of the relative frequencies of substance abuse broken down by substance.

Substance Abuse.png

The variables used in this analysis were ABUSEALC, ABUSECOC, ABUSEHAL, ABUSEHER, ABUSEINH, ABUSEMRJ, ABUSEANL, ABUSESED, ABUSESTM, ABUSETRN. The meanings of these variables are somewhat obscure, but my hope is that the definition of abuse is similar enough across substances to allow for a relative frequency analysis. Every subject classified as abusing a substance was summed over and then the resulting number was divided by the total number of subjects in the data set to find a frequency of abuse per substance.