<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John Myles White &#187; Statistics</title>
	<atom:link href="http://www.johnmyleswhite.com/notebook/category/statistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johnmyleswhite.com</link>
	<description>&#34;He who refuses to do arithmetic is doomed to talk nonsense.&#34;</description>
	<lastBuildDate>Wed, 26 Oct 2011 11:36:05 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>The Psychology of Music and the &#8216;tuneR&#8217; Package</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/10/25/the-psychology-of-music-and-the-tuner-package/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/10/25/the-psychology-of-music-and-the-tuner-package/#comments</comments>
		<pubDate>Wed, 26 Oct 2011 01:28:41 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Music]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4311</guid>
		<description><![CDATA[Introduction This semester I&#8217;m TA&#8217;ing a course on the Psychology of Music taught by Phil Johnson-Laird. It&#8217;s been a great course to teach because (i) so much of the material is new to me and (ii) because the study of the psychology of music brings together so many of the intellectual tools I enjoy, including [...]]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>
<p>This semester I&#8217;m TA&#8217;ing a course on the <a href="http://psych.princeton.edu/psychology/research/johnson_laird/music.php">Psychology of Music</a> taught by <a href="http://psych.princeton.edu/psychology/research/johnson_laird/">Phil Johnson-Laird</a>. It&#8217;s been a great course to teach because (i) so much of the material is new to me and (ii) because the study of the psychology of music brings together so many of the intellectual tools I enjoy, including music theory, psychophysics and Fourier analysis.</p>
<p>One topic this semester that was completely new to me was the theory of tuning: I had known about the invention of the <a href="http://en.wikipedia.org/wiki/Well_temperament">well-tempered system of tuning</a>, but had never heard of <a href="http://en.wikipedia.org/wiki/Pythagorean_tuning">Pythagorean tuning</a> or <a href="http://en.wikipedia.org/wiki/Just_intonation">just tuning</a> &#8212; and certainly was not aware that the well-tempered system <a href="http://en.wikipedia.org/wiki/The_Well-Tempered_Clavier">Bach celebrated</a> was not identical to our current equal-tempered system of tuning.</p>
<p>As a way of consolidating some of the knowledge I&#8217;ve gained, I decided I&#8217;d write a blog entry after several months of neglecting this blog. (For that neglect, I&#8217;ll blame a combination of grant writing, book writing, ongoing research projects and personal life developments.) In what follows, I&#8217;ll give a brief overview of the theory of tuning at a theoretical level that should be accessible to anyone who&#8217;s familiar with the names of intervals and feels comfortable thinking quantitatively.</p>
<p>After surveying the field, I&#8217;ll turn to a discussion of some code I&#8217;ve written in R that implements these ideas using the &#8216;tuneR&#8217; package, which is one of my favorite hidden gems from CRAN. Along the way, I&#8217;ll introduce some of the simplest tools from the &#8216;tuneR&#8217; package that can be used for generating computer music.</p>
<h3>Tuning Systems: Pythagorean, Just and 12-Tet</h3>
<p>It&#8217;s worth noting right at the start that tuning is a misleading name for the topic we&#8217;ll be discussing: we&#8217;re not talking about how one tunes a fixed instrument so that it sounds in tune, but rather we&#8217;re interested in how one defines the very notes that the instrument should be able to produce when it&#8217;s perfectly in tune.</p>
<p>To make that clear, let&#8217;s assume that we&#8217;ve accepted as a given that a frequency of 440 Hz will be called A. Our problem then becomes one of deciding which of the infinitely many frequencies we could produce  actually deserves the label of A#, B, C, C#, and so on.</p>
<h4>Pythagorean Tuning</h4>
<p>The simplest solution to this problem I know of is the <a href="http://en.wikipedia.org/wiki/Pythagorean_tuning">Pythagorean tuning system</a>. It&#8217;s based on constructing all of the possible notes using a series of perfect fifths. If you remember the <a href="http://en.wikipedia.org/wiki/Circle_of_fifths">Circle of Fifths</a>, you&#8217;ll remember that you can reach every chromatic note by ascending fifths: if you start at A, you&#8217;ll proceed through E, B, F# and so on.</p>
<p>The Pythagorean system implements the Circle of Fifths directly using repeated multiplication of a base frequency. To do this, you first declare that a perfect fifth is at a frequency 3/2 above your base frequency. For example, this definition implies that the perfect fifth above the A at 440 Hz has to be at a frequency of 3/2 * 440 = 660 Hz. Once you do this, you&#8217;ve defined the frequency we&#8217;ll call E.</p>
<p>And following on with this logic, you produce a B at 990 Hz. Of course, this B occurs an octave above the base A at 440 Hz, so you transpose it down an octave to produce the B you&#8217;ll actually use. To do this, you need to assume that an octave is at a frequency 2 times the base frequency. Since we&#8217;ve accepted that 990 Hz is a B, we divide 990 by 2 and conclude that 495 Hz should be B.</p>
<p>With these three notes defined, we have the following table of frequency/note pairs:</p>
<table>
<tr>
<th>Note</th>
<th>Frequency</th>
<th>Ratio with 440 Hz</th>
</tr>
<tr>
<td>A</td>
<td>440 Hz</td>
<td>1</td>
</tr>
<tr>
<td>E</td>
<td>660 Hz</td>
<td>3/2</td>
</tr>
<tr>
<td>B</td>
<td>495 Hz</td>
<td>9/8</td>
</tr>
</table>
<p>If we continue on with this logic and calculate many more multiplications by 3/2 and divisions by 2, we will eventually produce a complete table for all of the notes in the chromatic scale that looks like the following:</p>
<table>
<tr>
<th>Note</th>
<th>Frequency</th>
<th>Ratio</th>
</tr>
<tr>
<td>A</td>
<td>440</td>
<td>1</td>
</tr>
<tr>
<td>A#</td>
<td>463.5391</td>
<td>256/243</td>
</tr>
<tr>
<td>B</td>
<td>495</td>
<td>9/8</td>
</tr>
<tr>
<td>C</td>
<td>521.4815</td>
<td>32/27</td>
</tr>
<tr>
<td>C#</td>
<td>556.875</td>
<td>81/64</td>
</tr>
<tr>
<td>D</td>
<td>586.6667</td>
<td>4/3</td>
</tr>
<tr>
<td>D#</td>
<td>626.4844</td>
<td>729/512</td>
</tr>
<tr>
<td>E</td>
<td>660</td>
<td>3/2</td>
</tr>
<tr>
<td>F</td>
<td>695.3086</td>
<td>128/81</td>
</tr>
<tr>
<td>F#</td>
<td>742.5</td>
<td>27/16</td>
</tr>
<tr>
<td>G</td>
<td>782.2222</td>
<td>16/9</td>
</tr>
<tr>
<td>G#</td>
<td>835.3125</td>
<td>243/128</td>
</tr>
<tr>
<td>A</td>
<td>880</td>
<td>2</td>
</tr>
</table>
<p>One thing about this table might strike you as odd if you&#8217;re mathematically savvy: the octave, which we&#8217;ve defined by fiat as a ratio of 2:1, could never have been produced by successive multiplication by 3/2, since no power of 3 will be evenly divisible by a power of 2. This is the one flub in the Pythagorean system: you can&#8217;t really produce the entire chromatic scale using only multiples of 3/2. Here we&#8217;ve solved that problem by replacing the note we would have called A with a true octave generated using multiplication by 2. Because the exact octave produced by Pythagorean tuning is slightly out of tune with our preferred definition of an octave, you may hear people refer to this discrepancy as the <a href="http://en.wikipedia.org/wiki/Pythagorean_comma">the Pythagorean comma</a>.</p>
<h4>Just Tuning</h4>
<p>Given that we had to cheat a bit to create a proper octave using the Pythagorean tuning system based on multiples of 3/2, it makes sense to ask why we shouldn&#8217;t just allow ourselves to use other multipliers than 3/2. Looking at the Pythagoren tuning table, we see some pretty ugly fractions like 729/512. What if we forced these fractions to be simpler by employing ratios like 4/3 and 5/4 to build up the whole system?</p>
<p>The result of allowing ourselves several fractions beyond just those derived from 3/2 is called the <a href="http://en.wikipedia.org/wiki/Just_intonation">just tuning system</a>. Here we assume that perfect fifths occur at a frequency ratio of 3/2 and that perfect fourths occur at a frequency ratio of 4/3. Continuing on with this process, we eventually end up with the following tuning table:</p>
<table>
<tr>
<th>Note</th>
<th>Frequency</th>
<th>Ratio</th>
</tr>
<tr>
<td>A</td>
<td>440</td>
<td>1</td>
</tr>
<tr>
<td>A#</td>
<td>469.3333</td>
<td>16/15</td>
</tr>
<tr>
<td>B</td>
<td>495</td>
<td>9/8</td>
</tr>
<tr>
<td>C</td>
<td>528</td>
<td>6/5</td>
</tr>
<tr>
<td>C#</td>
<td>550</td>
<td>5/4</td>
</tr>
<tr>
<td>D</td>
<td>586.6667</td>
<td>4/3</td>
</tr>
<tr>
<td>D#</td>
<td>625.7778</td>
<td>64/45</td>
</tr>
<tr>
<td>E</td>
<td>660</td>
<td>3/2</td>
</tr>
<tr>
<td>F</td>
<td>704</td>
<td>8/5</td>
</tr>
<tr>
<td>F#</td>
<td>733.3333</td>
<td>5/3</td>
</tr>
<tr>
<td>G</td>
<td>782.2222</td>
<td>16/9</td>
</tr>
<tr>
<td>G#</td>
<td>825</td>
<td>15/8</td>
</tr>
<tr>
<td>A</td>
<td>880</td>
<td>2</td>
</tr>
</table>
<p>This is the tuning that early Classical music was written in. Looking at the table you con immediately appreciate the theoretical assertion that the relative dissonance of an interval is determined by the simplicity of the ratio of frequencies between the two notes: perfect fifths are 3/2 and major thirds are 5/4, while minor seconds are 16/15 and major sevenths are 15/8. This is one of the things I most enjoy about the theory of harmony: there&#8217;s a match between the aesthetics of fractions and the aesthetics of sounds that, for me, helps to justify my sense that certain fractions are more beautiful than others.</p>
<h4>12 Tet / Equal-Temperament</h4>
<p>Now, if you know the history of Bach&#8217;s Well-Tempered Clavier, you know that there is a problem with the just tuning system: it sounds great in the key you used as the base (here A), but it sounds a bit out of tune in other keys. The modern <a href="http://en.wikipedia.org/wiki/Equal_temperament">12-tet system</a> is the most recent approach to solving this problem: you assume the gap between two semitones (e.g. A to A# or A# to B) is always the exact same multiple. Since you&#8217;ll repeat this multiplication 12 times before reaching an octave, you can conclude that two notes that are a semitone apart must be separated by the 12th root of 2. Building a tuning system using that ratio alone gives us our modern system of tuning, which is shown in the table above using the decimal expansion of the ratios instead of their representation as powers of the 12th root of 2:</p>
<table>
<tr>
<th>Note</th>
<th>Frequency</th>
<th>Ratio</th>
</tr>
<tr>
<td>A</td>
<td>440</td>
<td>1.000000</td>
</tr>
<tr>
<td>A#</td>
<td>466.1638</td>
<td>1.059463</td>
</tr>
<tr>
<td>B</td>
<td>493.8833</td>
<td>1.122462</td>
</tr>
<tr>
<td>C</td>
<td>523.2511</td>
<td>1.189207</td>
</tr>
<tr>
<td>C#</td>
<td>554.3653</td>
<td>1.259921</td>
</tr>
<tr>
<td>D</td>
<td>587.3295</td>
<td>1.334840</td>
</tr>
<tr>
<td>D#</td>
<td>622.2540</td>
<td>1.414214</td>
</tr>
<tr>
<td>E</td>
<td>659.2551</td>
<td>1.498307</td>
</tr>
<tr>
<td>F</td>
<td>698.4565</td>
<td>1.587401</td>
</tr>
<tr>
<td>F#</td>
<td>739.9888</td>
<td>1.681793</td>
</tr>
<tr>
<td>G</td>
<td>783.9909</td>
<td>1.781797</td>
</tr>
<tr>
<td>G#</td>
<td>830.6094</td>
<td>1.887749</td>
</tr>
<tr>
<td>A</td>
<td>880</td>
<td>2.000000</td>
</tr>
</table>
<h3>Listening to the Results</h3>
<p>We&#8217;ve just described three ways to define the notes used in Western music. But how different do they sound? To answer that, I decided to produce a series of simple sine wave audio samples that were tuned using each of the three tuning systems. To produce those audio samples, I used the &#8216;tuneR&#8217; package, which I&#8217;ll describe now. Before you read on, you should install it from CRAN using the standard <code>install.packages('tuneR')</code> invocation.</p>
<h3>A tuneR Tutorial</h3>
<p>The <a href="http://cran.r-project.org/web/packages/tuneR/index.html">tuneR</a> package is an extremely convenient tool for generating audio files from R based on a numeric description of the audio stream. For the purposes of this discussion of tuning systems, we simply need to produce basic sine waves. Thankfully, that&#8217;s very easy to do with tuneR. Here&#8217;s an example:</p>

<div class="wp_codebox"><table><tr id="p43115"><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code" id="p4311code5"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'tuneR'</span><span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span>
&nbsp;
writeWave<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> <span style="color: #ff0000;">'440.wav'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>Here we&#8217;ve loaded the tuneR package, created a 1s snippet of sine wave audio at 16 bits resolution using the <code>sine</code> function, and then written out the audio to a WAV file using <code>writeWave</code>. If you look at your current directory and listen to this file, you&#8217;ll hear a sine wave at 440 Hz.</p>
<p>If you want to explore the use of <code>sine</code>, you can easily play with the duration of the sound by changing the <code>duration</code> parameter. If you want to, you can also change the sample rate and the bit rate, but I don&#8217;t see any reason to do that while exploring ideas about tuning.</p>
<p>More important is knowing that you can superimpose two sine waves using the <code>`+`</code> operator and that you can concatenate them using the <code>bind</code> function. To show off producing octaves, for example, you might use the following code to hear an A at 440 Hz, then an A an octave above it, and finally the harmony they produce together:</p>

<div class="wp_codebox"><table><tr id="p43116"><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code" id="p4311code6"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'tuneR'</span><span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> bind<span style="color: #009900;">&#40;</span>sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
              sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">880</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
              sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">880</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
writeWave<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> <span style="color: #ff0000;">'octaves.wav'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>Unfortunately, this sample code produces an error because of the naive addition we&#8217;ve implemented using the <code>`+`</code> operator. Adding two sine waves directly together overfills the bit rate we&#8217;re using. To safely perform addition of two sine waves, we need to normalize the results of our summation using the <code>normalize</code> function. This gives us just one more line of code:</p>

<div class="wp_codebox"><table><tr id="p43117"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
</pre></td><td class="code" id="p4311code7"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'tuneR'</span><span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> bind<span style="color: #009900;">&#40;</span>sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
              sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">880</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
              sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">880</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> normalize<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> unit <span style="color: #339933;">=</span> <span style="color: #ff0000;">'16'</span><span style="color: #009900;">&#41;</span>
&nbsp;
writeWave<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> <span style="color: #ff0000;">'octaves.wav'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>For reasons that are not clear to me, you have to specify the bit rate to <code>normalize</code> using the <code>unit</code> parameter rather than the <code>bit</code> parameter.</p>
<h3>Demoing Tuning Systems</h3>
<p>Our little octave demo is cute, but we really want to know what more interesting harmonies like major thirds and minor seconds sound like in the various tuning systems we described. To do that, I first wrote a function called <code>interval</code> that spits out the multiplier you need to use to produce a given interval for any of the three tuning systems. That function is in a <a href="https://github.com/johnmyleswhite/computer_music">GitHub repository</a> I&#8217;ve set up with code for making these demos. If you download that repository, you could load my <code>interval</code> function using a simple call to <code>source</code> like the one seen below. And using this <code>interval</code> function, we can generate demos of various intervals as follows:</p>

<div class="wp_codebox"><table><tr id="p43118"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code" id="p4311code8"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'tuneR'</span><span style="color: #009900;">&#41;</span>
source<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'interval.R'</span><span style="color: #009900;">&#41;</span>
&nbsp;
base <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">440</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> sine<span style="color: #009900;">&#40;</span>base<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> sine<span style="color: #009900;">&#40;</span>interval<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'minor-second'</span><span style="color: #339933;">,</span>
                                    tuning <span style="color: #339933;">=</span> <span style="color: #ff0000;">'pythagorean'</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">*</span> base<span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> normalize<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> unit <span style="color: #339933;">=</span> <span style="color: #ff0000;">'16'</span><span style="color: #009900;">&#41;</span>
&nbsp;
writeWave<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> <span style="color: #ff0000;">'minor_second_pythagorean.wav'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>On GitHub there&#8217;s a file called <code>test_intervals.R</code> that will go through and generate all of the intervals in all three tuning systems. If you run that file, you&#8217;ll generate a lot of audio files you can listen to as demos of the three tuning systems we&#8217;ve described. For me, these tuning systems all produce intervals that sound surprisingly similar, though at high volumes I find it moderately easy to hear slight differences between the tuning systems. That said, I very much doubt I would pick up on them in a normal musical context.</p>
<p>That&#8217;s the end of my little introduction to tuning systems and the use of the tuneR package to explore them. If you&#8217;re interested in thinking computationally about music, I highly recommend playing around with tuneR until you feel like you can produce interesting results. I&#8217;m already working on trying to build up some interesting timbres to work with.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/10/25/the-psychology-of-music-and-the-tuner-package/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Twitter Math Puzzle and Solution</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/07/07/twitter-math-puzzle-and-solution/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/07/07/twitter-math-puzzle-and-solution/#comments</comments>
		<pubDate>Thu, 07 Jul 2011 13:48:07 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4302</guid>
		<description><![CDATA[Yesterday I posted a very simple math puzzle to Twitter that I found in Jonathan Baron&#8217;s book, Thinking and Deciding. The puzzle is the following: Show that every number of the form ABC,ABC is divisible by 13. The puzzle comes up in Baron&#8217;s book as an example of an &#8220;insight problem&#8221; in which one goes [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I posted a very simple math puzzle to Twitter that I found in Jonathan Baron&#8217;s book, <a href="http://amzn.to/npM5Uk">Thinking and Deciding</a>. The puzzle is the following:</p>
<blockquote><p>
Show that every number of the form ABC,ABC is divisible by 13.
</p></blockquote>
<p>The puzzle comes up in Baron&#8217;s book as an example of an &#8220;insight problem&#8221; in which one goes from not knowing the answer at all to knowing the complete answering in a sudden moment of insight.</p>
<p>Several people replied to my tweet with solutions: I especially like <a href="https://twitter.com/#!/willtownes/status/88735472028876800">Will Townes&#8217;s</a> solution. In particular, if you&#8217;re familiar with <a href="http://en.wikipedia.org/wiki/Modular_arithmetic">modular arithmetic</a>, I like the logic of Will&#8217;s answer because it gives a simple generalization. First, represent ABC,ABC as ABC * 1000 + ABC * 1 rather than as ABC * 1001. Then notice that</p>
<ol>
<li>1 = 1 mod 13</li>
<li>1000 = -1 mod 13</li>
</ol>
<p>Thus ABC,ABC = ABC * -1 + ABC * 1 = 0 mod 13. This logic can be easily extended to show that (ABC,ABC,)*ABC,ABC = 0 mod 13 no matter how many times you repeat the ABC,ABC pattern.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/07/07/twitter-math-puzzle-and-solution/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Visualizing Periodic Data</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/06/28/visualizing-periodic-data/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/06/28/visualizing-periodic-data/#comments</comments>
		<pubDate>Tue, 28 Jun 2011 18:27:03 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4298</guid>
		<description><![CDATA[Yesterday the Princeton machine learning reading group went through a paper by Tukey on &#8220;Some graphic and semigraphic displays&#8221;. One issue we talked about at length was Tukey&#8217;s idiosyncratic approach to visualizing periodic data in a circular format to emphasize the connections between the &#8220;start&#8221; and the &#8220;end&#8221; of the data set. Allison Chaney pointed [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday the Princeton machine learning reading group went through a paper by Tukey on <a href="http://www.edwardtufte.com/tufte/tukey">&#8220;Some graphic and semigraphic displays&#8221;</a>. One issue we talked about at length was Tukey&#8217;s idiosyncratic approach to visualizing periodic data in a circular format to emphasize the connections between the &#8220;start&#8221; and the &#8220;end&#8221; of the data set.</p>
<p>Allison Chaney pointed out that many fields (for instance, environmental engineering) might want to consider using these circular displays to make periodic trends clear to the viewer. That inspired me to try plotting periodic weather data using both a standard x-y plane display and a polar coordinates display. The results are shown below in two videos that I&#8217;ve uploaded to Vimeo:</p>
<div style="text-align:center;"><iframe src="http://player.vimeo.com/video/25716170?title=0&amp;byline=0&amp;portrait=0" width="400" height="300" frameborder="0"></iframe>
<p><a href="http://vimeo.com/25716170">Visualizing Periodic Data: NYC Weather from 1995 to 2008</a> from <a href="http://vimeo.com/user698502">John Myles White</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
</div>
<div style="text-align:center;"><iframe src="http://player.vimeo.com/video/25717081?title=0&amp;byline=0&amp;portrait=0" width="400" height="300" frameborder="0"></iframe>
<p><a href="http://vimeo.com/25717081">Visualizing Periodic Data: NYC Weather from 1995 to 2008 (Take 2)</a> from <a href="http://vimeo.com/user698502">John Myles White</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
</div>
<p>There&#8217;s a clear tradeoff that&#8217;s being made when choosing between these two approaches: the polar coordinates plot, as promised, correctly connects the two &#8220;ends&#8221; of the data set. But it also makes it much harder to see the height of the graph at each point in time, so that the sinusoidal shape that can easily be seen in the x-y plane display is basically hidden in the polar coordinates display.</p>
<p>Since making these videos, it occurred to me that another potential visualization technique would be to project the data onto a cylinder, rather than a plane, and then progressively rotate the cylinder to reveal the time trend. This would allow heights to be seen properly, while emphasizing the periodicity. The problem with this cylindrical projection is that the entire data set is never fully visible at one time, but can only be seen by completing a full rotation of the data.</p>
<p>In his paper, Tukey describes one other approach: draw the periodic data twice so that the period is clearly visible. It wasn&#8217;t clear to me how to do this without some numeric hacks in ggplot2, so I&#8217;ll leave it to reader to search for Tukey&#8217;s example in <a href="http://www.edwardtufte.com/tufte/tukey">the original paper</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/06/28/visualizing-periodic-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>ProjectTemplate News</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/06/25/projecttemplate-news/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/06/25/projecttemplate-news/#comments</comments>
		<pubDate>Sat, 25 Jun 2011 16:43:14 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4294</guid>
		<description><![CDATA[The news below was recently reported on the ProjectTemplate mailing list. For completeness, I&#8217;m also reporting it here. The first piece of ProjectTemplate news is that I won&#8217;t be the exclusive maintainer for ProjectTemplate anymore. Allen Goodman, who works at BankSimple, is now my co-maintainer and he has full commit privileges. In the next few [...]]]></description>
			<content:encoded><![CDATA[<p>The news below was recently reported on the ProjectTemplate mailing list. For completeness, I&#8217;m also reporting it here.</p>
<ul>
<li>The first piece of ProjectTemplate news is that I won&#8217;t be the exclusive maintainer for ProjectTemplate anymore. Allen Goodman, who works at BankSimple, is now my co-maintainer and he has full commit privileges. In the next few months, the emerging group with commit privileges is likely to grow beyond the two of us, but hopefully just having one more person in charge of ProjectTemplate&#8217;s development will help to keep things moving forward.</li>
<li>There&#8217;s a <a href="https://github.com/johnmyleswhite/ProjectTemplate">new draft of ProjectTemplate available on GitHub</a>. v0.3-1 fixes problems with the YAML configuration system not working on Windows 64 machines by switching over to the DCF format that R naturally supports. Editing your configuration scripts should be trivial, but be prepared for ProjectTemplate to break on your existing v0.2-1 projects until you&#8217;ve updated them to use DCF instead of YAML.</li>
<li>In addition to switching the configuration system over to DCF, ProjectTemplate v0.3-1 now uses namespaces and separate functions to implement all of the automatic data loading functions that were previously nested inside of <code>load.project()</code>. Hopefully this will make it easier for end users to override ProjectTemplate&#8217;s defaults, while allowing ProjectTemplate releases to automatically rolls out bug fixes to less advanced users. On that note, the list of supported file formats for automatic data loading is growing and new patches on that front are always welcome.</li>
<li>A minimal project format: Some people have asked for the option to create projects without some of the clutter that the standard project format creates, such as the diagnostics and profiling directories. There&#8217;s now a minimal project format that you can use by invoking <code>create.project()</code> with the option <code>create.project(minimal = TRUE)</code>.</li>
<li>Starting in two weeks, the version of ProjectTemplate available on CRAN will stay in pace with the version on GitHub. If you&#8217;re still using v0.1-3, please consider upgrading or forking.</li>
<li>There is now an official ProjectTemplate website at <a href="http://projecttemplate.net/">http://projecttemplate.net/</a> that will hopefully be the start of a new era of better documentation for ProjectTemplate. While the material on the site is still in noticeably draft form, I expect the documentation to improve considerably in the near future. If anyone out there is a graphic designer and would like to make the new site look better, please let me know by e-mailing me at <a href="mailto:jmw@johnmyleswhite.com">jmw@johnmyleswhite.com</a>.</li>
</ul>
<p>For now that&#8217;s all, but there&#8217;s more ProjectTemplate news coming soon. Stay tuned!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/06/25/projecttemplate-news/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Speeding Up MLE Code in R</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/06/18/speeding-up-mle-code-in-r/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/06/18/speeding-up-mle-code-in-r/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 00:02:29 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4264</guid>
		<description><![CDATA[Recently, I&#8217;ve been fitting some models from the behavioral economics literature to choice data. Most of these models amount to non-linear variants of logistic regression in which I want to infer the parameters of a utility function. Because several of these models aren&#8217;t widely used, I&#8217;ve had to write my own maximum likelihood code to [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I&#8217;ve been fitting some models from the behavioral economics literature to choice data. Most of these models amount to non-linear variants of logistic regression in which I want to infer the parameters of a utility function. Because several of these models aren&#8217;t widely used, I&#8217;ve had to write my own maximum likelihood code to estimate the parameters of these models.</p>
<p>In the process, I&#8217;ve started to learn something about how to write code that runs quickly in R. In this post, I&#8217;ll try to share some of that knowledge by describing three ways of performing maximum likelihood estimation in R whose runtimes differ by two orders of magnitude. The differences seem to depend upon two factors: (1) how I access the entries of a data frame and (2) whether I use loops or vectorized operations to perform basic arithmetic.</p>
<p>To simplify things, I&#8217;ll present a model that should be familiar to people with a background in economics: the exponentially discounted utility model. To implement it in R, we define the discounted value of <code>x</code> dollars at time <code>t</code> as:</p>

<div class="wp_codebox"><table><tr id="p426416"><td class="line_numbers"><pre>1
2
3
4
</pre></td><td class="code" id="p4264code16"><pre class="c" style="font-family:monospace;">discounted.<span style="color: #202020;">value</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>x<span style="color: #339933;">,</span> t<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>x <span style="color: #339933;">*</span> delta <span style="color: #339933;">^</span> t<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>In addition to the discounted utility model, we assume that choices originate from a stochastic choice model with logistic noise. To invert this noise during inference, we&#8217;ll use the inverse logit transform:</p>

<div class="wp_codebox"><table><tr id="p426417"><td class="line_numbers"><pre>1
2
3
4
</pre></td><td class="code" id="p4264code17"><pre class="c" style="font-family:monospace;">invlogit <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>z<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span> <span style="color: #339933;">/</span> <span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span> <span style="color: #339933;">+</span> exp<span style="color: #009900;">&#40;</span><span style="color: #339933;">-</span>z<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>To test my inference routine, I need to generate &#8220;stochastic&#8221; data of the sort you would expect to see from an exponentially discounting agent that&#8217;s indifferent between having $1 at time t = 0 and $3 at time t = 1. I&#8217;ll refer to the first good as (X1, T1) and the second good as (X2, T2). If the agent chooses (X2, T2), I&#8217;ll write that as <code>C == 1</code>; if they choose (X1, T1), I&#8217;ll write that as <code>C == 0</code>. With those conventions, the sample data is generated as:</p>

<div class="wp_codebox"><table><tr id="p426418"><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code" id="p4264code18"><pre class="c" style="font-family:monospace;">n <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">100</span>
&nbsp;
choices <span style="color: #339933;">&lt;-</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span>X1 <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> each <span style="color: #339933;">=</span> n<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                      T1 <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> each <span style="color: #339933;">=</span> n<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                      X2 <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">3</span><span style="color: #339933;">,</span> each <span style="color: #339933;">=</span> n<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                      T2 <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> each <span style="color: #339933;">=</span> n<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                      C <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> by <span style="color: #339933;">=</span> n <span style="color: #339933;">/</span> <span style="color: #0000dd;">2</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>To fit the exponential model to this data set, we&#8217;ll use the <code>optim</code> function to minimize the negative log likelihood of the data by setting two parameters: <code>a</code>, the variance of the noise in the utility function; and <code>delta</code>, the discount factor in the discounted utility model. The three implementations of this model that I&#8217;ll show only differ in the definition of the log likelihood function, so the final call to <code>optim</code> to perform maximum likelihood estimation is constant across all examples:</p>

<div class="wp_codebox"><table><tr id="p426419"><td class="line_numbers"><pre>1
2
3
4
5
6
</pre></td><td class="code" id="p4264code19"><pre class="c" style="font-family:monospace;">logit.<span style="color: #202020;">estimator</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span> 
  wrapper <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>x<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><span style="color: #339933;">-</span>log.<span style="color: #202020;">likelihood</span><span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> x<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> x<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span>
  optimization.<span style="color: #202020;">results</span> <span style="color: #339933;">&lt;-</span> optim<span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> wrapper<span style="color: #339933;">,</span> method <span style="color: #339933;">=</span> <span style="color: #ff0000;">'L-BFGS-B'</span><span style="color: #339933;">,</span> lower <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> upper <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span>Inf<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>optimization.<span style="color: #202020;">results</span>$par<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>Here, I had to specify bounds for the parameters, <code>a</code> and <code>delta</code>, because it&#8217;s assumed that <code>a</code> must be positive and that <code>delta</code> must lie in the interval [0, 1]. To deal with these bounds, one has to use the L-BFGS-B method in <code>optim</code>.</p>
<p>The first implementation I&#8217;ll show is the one I find most natural to write, even though it turns out to be the least efficient by far:</p>

<div class="wp_codebox"><table><tr id="p426420"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code" id="p4264code20"><pre class="c" style="font-family:monospace;">log.<span style="color: #202020;">likelihood</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  ll <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">0</span>
&nbsp;
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>nrow<span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
    u2 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'X2'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'T2'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
    u1 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'X1'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'T1'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
&nbsp;
    p <span style="color: #339933;">&lt;-</span> invlogit<span style="color: #009900;">&#40;</span>a <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>u2 <span style="color: #339933;">-</span> u1<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
    <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'C'</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">==</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#123;</span>
      ll <span style="color: #339933;">&lt;-</span> ll <span style="color: #339933;">+</span> log<span style="color: #009900;">&#40;</span>p<span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span>
    <span style="color: #b1b100;">else</span>
    <span style="color: #009900;">&#123;</span>
      ll <span style="color: #339933;">&lt;-</span> ll <span style="color: #339933;">+</span> log<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span> <span style="color: #339933;">-</span> p<span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>ll<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>In the second implementation, I define a row level likelihood function, so that the summing and logarithmic transform are vectorized.</p>

<div class="wp_codebox"><table><tr id="p426421"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code" id="p4264code21"><pre class="c" style="font-family:monospace;">rowwise.<span style="color: #202020;">likelihood</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>row<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  u2 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'X2'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'T2'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
  u1 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'X1'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'T1'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
  p <span style="color: #339933;">&lt;-</span> invlogit<span style="color: #009900;">&#40;</span>a <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>u2 <span style="color: #339933;">-</span> u1<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>ifelse<span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'C'</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">==</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> p<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span> <span style="color: #339933;">-</span> p<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
log.<span style="color: #202020;">likelihood</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  likelihoods <span style="color: #339933;">&lt;-</span> apply<span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>rowwise.<span style="color: #202020;">likelihood</span><span style="color: #009900;">&#40;</span>row<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>sum<span style="color: #009900;">&#40;</span>log<span style="color: #009900;">&#40;</span>likelihoods<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>In the third implementation, I define a fully vectorized log likelihood function that avoids any explicit iteration and therefore removes most of the data frame indexing operations:</p>

<div class="wp_codebox"><table><tr id="p426422"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
</pre></td><td class="code" id="p4264code22"><pre class="c" style="font-family:monospace;">log.<span style="color: #202020;">likelihood</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  u2 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>choices$X2<span style="color: #339933;">,</span> choices$T2<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
  u1 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>choices$X1<span style="color: #339933;">,</span> choices$T1<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
  p <span style="color: #339933;">&lt;-</span> invlogit<span style="color: #009900;">&#40;</span>a <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>u2 <span style="color: #339933;">-</span> u1<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  likelihoods <span style="color: #339933;">&lt;-</span> ifelse<span style="color: #009900;">&#40;</span>choices$C <span style="color: #339933;">==</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> p<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span> <span style="color: #339933;">-</span> p<span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>sum<span style="color: #009900;">&#40;</span>log<span style="color: #009900;">&#40;</span>likelihoods<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>The code I used to call all of these implementations and compare them is up on <a href="https://github.com/johnmyleswhite/fastR">GitHub</a> for those interested. The results, which strike me as remarkable, are below:</p>
<ol>
<li>On my laptop, implementation 1 takes ~1.0 second to run.</li>
<li>On my laptop, implementation 2 takes ~0.25 seconds to run.</li>
<li>On my laptop, implementation 3 takes ~0.01 seconds to run.</li>
</ol>
<p>In short, the third implementation is 100x faster than the first implementation with only minor changes to the code I originally wrote. Hopefully this example will help inspire others who have R code they&#8217;d like to speed up, but aren&#8217;t sure where to start.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/06/18/speeding-up-mle-code-in-r/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Norvig and the Nature of Modern Science</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/05/27/norvig-and-the-nature-of-modern-science/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/05/27/norvig-and-the-nature-of-modern-science/#comments</comments>
		<pubDate>Fri, 27 May 2011 15:20:36 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Citations]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4260</guid>
		<description><![CDATA[In this, Chomsky is in complete agreement with O&#8217;Reilly. (I recognize that the previous sentence would have an extremely low probability in a probabilistic model trained on a newspaper or TV corpus.)1 Anyone who considers themself an intellectual should be required to read this new essay by Peter Norvig. It&#8217;s the best summary I&#8217;ve ever [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>
In this, Chomsky is in complete agreement with O&#8217;Reilly. (I recognize that the previous sentence would have an extremely low probability in a probabilistic model trained on a newspaper or TV corpus.)<sup><a href="http://www.johnmyleswhite.com/notebook/2011/05/27/norvig-and-the-nature-of-modern-science/#footnote_0_4260" id="identifier_0_4260" class="footnote-link footnote-identifier-link" title="On Chomsky and the Two Cultures of Statistical Learning">1</a></sup>
</p></blockquote>
<p>Anyone who considers themself an intellectual should be required to read this new essay by Peter Norvig. It&#8217;s the best summary I&#8217;ve ever seen of the many types of science that now exist in our world &#8212; almost all of which are moving away from the simple algebraic, deterministic models of the world that fill high school science textbooks.</p>
<ol class="footnotes"><li id="footnote_0_4260" class="footnote"><a href="http://norvig.com/chomsky.html">On Chomsky and the Two Cultures of Statistical Learning</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/05/27/norvig-and-the-nature-of-modern-science/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Problems with ggplot2 0.8.9 and R 2.13.0 on Mac OS X via plyr 1.5</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/04/14/problems-with-ggplot2-0-8-9-and-r-2-13-0-on-mac-os-x-via-plyr-1-5/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/04/14/problems-with-ggplot2-0-8-9-and-r-2-13-0-on-mac-os-x-via-plyr-1-5/#comments</comments>
		<pubDate>Fri, 15 Apr 2011 02:15:54 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Mac OS X]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4254</guid>
		<description><![CDATA[This morning I tried to completely update my R installation. I first dumped a list of all the packages I have on my system using the installed.packages() function. Then I installed R 2.13.0 using the OS X disk image. And finally I reinstalled all of my packages from scratch. Unfortunately, I ran into some serious [...]]]></description>
			<content:encoded><![CDATA[<p>This morning I tried to completely update my R installation. I first dumped a list of all the packages I have on my system using the <code>installed.packages()</code> function. Then I installed R 2.13.0 using the OS X disk image. And finally I reinstalled all of my packages from scratch.</p>
<p>Unfortunately, I ran into some serious problems along the way. After installing everything from scratch, &#8216;ggplot2&#8242; 0.8.9 was broken. Specifically, I couldn&#8217;t get error bars to work with <code>stat_summary()</code>. For example, this code wouldn&#8217;t work on my system:</p>

<div class="wp_codebox"><table><tr id="p425425"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code" id="p4254code25"><pre class="c" style="font-family:monospace;"><span style="color: #339933;"># Problem with ggplot2 Version &quot;0.8.9&quot;</span>
&nbsp;
library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'ggplot2'</span><span style="color: #009900;">&#41;</span>
&nbsp;
set.<span style="color: #202020;">seed</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
&nbsp;
example.<span style="color: #202020;">data</span> <span style="color: #339933;">&lt;-</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span>Measurement <span style="color: #339933;">=</span> rnorm<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">5</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> Class <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'A'</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">5</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
ggplot<span style="color: #009900;">&#40;</span>example.<span style="color: #202020;">data</span><span style="color: #339933;">,</span> aes<span style="color: #009900;">&#40;</span>x <span style="color: #339933;">=</span> Class<span style="color: #339933;">,</span> y <span style="color: #339933;">=</span> Measurement<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span>
  stat_summary<span style="color: #009900;">&#40;</span>fun.<span style="color: #202020;">data</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">'mean_cl_boot'</span><span style="color: #339933;">,</span> geom <span style="color: #339933;">=</span> <span style="color: #ff0000;">'bar'</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span>
  stat_summary<span style="color: #009900;">&#40;</span>fun.<span style="color: #202020;">data</span> <span style="color: #339933;">=</span> <span style="color: #ff0000;">'mean_cl_boot'</span><span style="color: #339933;">,</span> geom <span style="color: #339933;">=</span> <span style="color: #ff0000;">'errorbar'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>Thankfully, I managed to enlist <a href="http://dirk.eddelbuettel.com/blog/">Dirk Eddelbuettel&#8217;s</a> help through Twitter and he ran the code on his own recently updated system. Things worked fine for him, which suggested that the problem was in my system configuration. We compared package versions and discovered that he had &#8216;plyr&#8217; 1.5.1 on his Ubuntu machine, while I had &#8216;plyr&#8217; 1.5 on my OS X machine. After looking at CRAN, it was clear that the Mac OS X build wasn&#8217;t available on CRAN yet.</p>
<p>To fix this, I grabbed the source for &#8216;plyr&#8217; 1.5.1 and tried to install it myself. That led to the following error:</p>

<div class="wp_codebox"><table><tr id="p425426"><td class="line_numbers"><pre>1
2
</pre></td><td class="code" id="p4254code26"><pre class="sh" style="font-family:monospace;">** preparing package for lazy loading
Error: package 'plyr' is required by 'reshape' so will not be detached</pre></td></tr></table></div>

<p>The problem was that &#8216;reshape&#8217; was being loaded automatically when R was starting up. Since &#8216;reshape&#8217; depends on &#8216;plyr&#8217;, R wasn&#8217;t willing to overwrite my old &#8216;plyr&#8217; 1.5 with the new &#8216;plyr&#8217; 1.5.1. The solution was to edit my <code>.Rprofile</code> file to prevent &#8216;reshape&#8217; from being autoloaded. Once I did this, I was able to run the standard <code>R CMD INSTALL</code> and get the new version of &#8216;plyr&#8217; on my system. And after that &#8216;ggplot2&#8242; 0.8.9 started working properly.</p>
<p>Hopefully no one else will come up against the same issue after the binary for &#8216;plyr&#8217; 1.5.1 gets pushed through all of the CRAN mirrors. But if you get errors while using &#8216;ggplot2&#8242; 0.8.9, look into installing &#8216;plyr&#8217; 1.5.1 from source on your system.</p>
<p>Many thanks to Dirk for giving me so much help today.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/04/14/problems-with-ggplot2-0-8-9-and-r-2-13-0-on-mac-os-x-via-plyr-1-5/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A Request for Foursquare Data</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/03/25/a-request-for-foursquare-data/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/03/25/a-request-for-foursquare-data/#comments</comments>
		<pubDate>Sat, 26 Mar 2011 03:06:01 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4240</guid>
		<description><![CDATA[[UPDATE 3/28/2011: Fixed an enormous bug in the R code.] I&#8217;m trying to collect data sets that showcase how the classical statistical distributions appear in modern contexts. I&#8217;ve already got some data that shows how the gamma distribution appears in video game scores, and now I&#8217;m hoping to find an example where the exponential distribution [...]]]></description>
			<content:encoded><![CDATA[<p>[UPDATE 3/28/2011: Fixed an enormous bug in the R code.]</p>
<p>I&#8217;m trying to collect data sets that showcase how the classical statistical distributions appear in modern contexts. I&#8217;ve <a href="http://www.johnmyleswhite.com/notebook/2011/03/16/canabalt-revisited-gamma-distributions-multinomial-distributions-and-more-jags-goodness/">already got some data</a> that shows how the <a href="http://en.wikipedia.org/wiki/Gamma_distribution">gamma distribution</a> appears in video game scores, and now I&#8217;m hoping to find an example where the <a href="http://en.wikipedia.org/wiki/Exponential_distribution">exponential distribution</a> shows up. I think that checkins for Foursquare might be a good place to start.</p>
<p>To test this intuition, I&#8217;m hoping to collect some pilot data. Below you&#8217;ll find some code that you can use to help me gather data.</p>
<p>First, there&#8217;s a shell script to gather your own checkin data from FourSquare. To use this script, you need to substitute your e-mail address where EMAIL appears and your password where PASSWORD appears in the code below:</p>

<div class="wp_codebox"><table><tr id="p424029"><td class="line_numbers"><pre>1
</pre></td><td class="code" id="p4240code29"><pre class="sh" style="font-family:monospace;">curl -u 'EMAIL:PASSWORD' https://api.foursquare.com/v1/history?l=250 &gt; checkin_history.xml</pre></td></tr></table></div>

<p>And second there&#8217;s an R script you can use to preprocess the data from the last step into a nice format before sending it to me. If you&#8217;re not an R user, you can easily skip this step and send the data you have in its raw XML format.</p>

<div class="wp_codebox"><table><tr id="p424030"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="code" id="p4240code30"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'plyr'</span><span style="color: #009900;">&#41;</span>
library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'XML'</span><span style="color: #009900;">&#41;</span>
filename <span style="color: #339933;">&lt;-</span> <span style="color: #ff0000;">'checkin_history.xml'</span>
tree <span style="color: #339933;">&lt;-</span> xmlTreeParse<span style="color: #009900;">&#40;</span>filename<span style="color: #339933;">,</span> asTree <span style="color: #339933;">=</span> TRUE<span style="color: #009900;">&#41;</span>
checkins <span style="color: #339933;">&lt;-</span> tree$doc$children$checkins
venue.<span style="color: #202020;">names</span> <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
latitudes <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
longitudes <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>length<span style="color: #009900;">&#40;</span>checkins<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  venue.<span style="color: #202020;">names</span> <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span>venue.<span style="color: #202020;">names</span><span style="color: #339933;">,</span> as.<span style="color: #202020;">character</span><span style="color: #009900;">&#40;</span>checkins<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span>$checkin<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'venue'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'name'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'text'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">6</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
  latitudes <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span>latitudes<span style="color: #339933;">,</span> as.<span style="color: #202020;">numeric</span><span style="color: #009900;">&#40;</span>unclass<span style="color: #009900;">&#40;</span>checkins<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span>$checkin<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'venue'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'geolat'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'text'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>$value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  longitudes <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span>longitudes<span style="color: #339933;">,</span> as.<span style="color: #202020;">numeric</span><span style="color: #009900;">&#40;</span>unclass<span style="color: #009900;">&#40;</span>checkins<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span>$checkin<span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'venue'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'geolong'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'text'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>$value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
checkin.<span style="color: #202020;">data</span> <span style="color: #339933;">&lt;-</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span>Venue <span style="color: #339933;">=</span> factor<span style="color: #009900;">&#40;</span>venue.<span style="color: #202020;">names</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> Latitude <span style="color: #339933;">=</span> as.<span style="color: #202020;">numeric</span><span style="color: #009900;">&#40;</span>latitudes<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> Longitude <span style="color: #339933;">=</span> as.<span style="color: #202020;">numeric</span><span style="color: #009900;">&#40;</span>longitudes<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
count.<span style="color: #202020;">data</span> <span style="color: #339933;">&lt;-</span> ddply<span style="color: #009900;">&#40;</span>checkin.<span style="color: #202020;">data</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'Venue'</span><span style="color: #339933;">,</span> nrow<span style="color: #009900;">&#41;</span>
names<span style="color: #009900;">&#40;</span>count.<span style="color: #202020;">data</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'Venue'</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'TotalCheckins'</span><span style="color: #009900;">&#41;</span>
write.<span style="color: #202020;">csv</span><span style="color: #009900;">&#40;</span>count.<span style="color: #202020;">data</span><span style="color: #339933;">,</span> file <span style="color: #339933;">=</span> <span style="color: #ff0000;">'count_data.csv'</span><span style="color: #339933;">,</span> row.<span style="color: #202020;">names</span> <span style="color: #339933;">=</span> FALSE<span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>After running these two pieces of code, the output file, <code>count_data.csv</code>, should look like this:</p>
<table>
<tr>
<th>Venue</th>
<th>TotalCheckins</th>
</tr>
<tr>
<td>&#8220;Brooklyn Boulders&#8221;</td>
<td>13</td>
</tr>
<tr>
<td>&#8230;</td>
<td>&#8230;</td>
</tr>
</table>
<p>Once you&#8217;ve got data, you can send it to me by e-mail at <a href="mailto:jmw@johnmyleswhite.com">jmw@johnmyleswhite.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/03/25/a-request-for-foursquare-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A 3D Version of R&#8217;s curve() Function</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/03/21/a-3d-version-of-rs-curve-function/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/03/21/a-3d-version-of-rs-curve-function/#comments</comments>
		<pubDate>Mon, 21 Mar 2011 22:05:45 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4228</guid>
		<description><![CDATA[I like exploring the behavior of functions of a single variable using the curve() function in R. One thing that seems to be missing from R&#8217;s base functions is a tool for exploring functions of two variables. I asked for examples of such a function on Twitter today and didn&#8217;t get any answers, so I [...]]]></description>
			<content:encoded><![CDATA[<p>I like exploring the behavior of functions of a single variable using the <code>curve()</code> function in R. One thing that seems to be missing from R&#8217;s base functions is a tool for exploring functions of two variables.</p>
<p>I asked for examples of such a function on Twitter today and didn&#8217;t get any answers, so I decided to build my own. As I see it, there are two ways to visualize a function of two variables:</p>
<ol>
<li>Use a 3D surface.</li>
<li>Use a heatmap.</li>
</ol>
<p>But 3D surfaces aren&#8217;t currently available in ggplot2, so I decided to work with heatmaps. The function below provides a very simple implementation of my 3D version of <code>curve()</code>, which I call <code>curve3D()</code>:</p>

<div class="wp_codebox"><table><tr id="p422832"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code" id="p4228code32"><pre class="c" style="font-family:monospace;">curve3D <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>f<span style="color: #339933;">,</span> from.<span style="color: #202020;">x</span><span style="color: #339933;">,</span> to.<span style="color: #202020;">x</span><span style="color: #339933;">,</span> from.<span style="color: #202020;">y</span><span style="color: #339933;">,</span> to.<span style="color: #202020;">y</span><span style="color: #339933;">,</span> n <span style="color: #339933;">=</span> <span style="color: #0000dd;">101</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
	x.<span style="color: #202020;">seq</span> <span style="color: #339933;">&lt;-</span> seq<span style="color: #009900;">&#40;</span>from.<span style="color: #202020;">x</span><span style="color: #339933;">,</span> to.<span style="color: #202020;">x</span><span style="color: #339933;">,</span> <span style="color: #009900;">&#40;</span>to.<span style="color: #202020;">x</span> <span style="color: #339933;">-</span> from.<span style="color: #202020;">x</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">/</span> n<span style="color: #009900;">&#41;</span>
	y.<span style="color: #202020;">seq</span> <span style="color: #339933;">&lt;-</span> seq<span style="color: #009900;">&#40;</span>from.<span style="color: #202020;">y</span><span style="color: #339933;">,</span> to.<span style="color: #202020;">y</span><span style="color: #339933;">,</span> <span style="color: #009900;">&#40;</span>to.<span style="color: #202020;">y</span> <span style="color: #339933;">-</span> from.<span style="color: #202020;">y</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">/</span> n<span style="color: #009900;">&#41;</span>
	eval.<span style="color: #202020;">points</span> <span style="color: #339933;">&lt;-</span> expand.<span style="color: #202020;">grid</span><span style="color: #009900;">&#40;</span>x.<span style="color: #202020;">seq</span><span style="color: #339933;">,</span> y.<span style="color: #202020;">seq</span><span style="color: #009900;">&#41;</span>
	names<span style="color: #009900;">&#40;</span>eval.<span style="color: #202020;">points</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'x'</span><span style="color: #339933;">,</span> <span style="color: #ff0000;">'y'</span><span style="color: #009900;">&#41;</span>
	eval.<span style="color: #202020;">points</span> <span style="color: #339933;">&lt;-</span> transform<span style="color: #009900;">&#40;</span>eval.<span style="color: #202020;">points</span><span style="color: #339933;">,</span> z <span style="color: #339933;">=</span> apply<span style="color: #009900;">&#40;</span>eval.<span style="color: #202020;">points</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>r<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>f<span style="color: #009900;">&#40;</span>r<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'x'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> r<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'y'</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
	p <span style="color: #339933;">&lt;-</span> ggplot<span style="color: #009900;">&#40;</span>eval.<span style="color: #202020;">points</span><span style="color: #339933;">,</span> aes<span style="color: #009900;">&#40;</span>x <span style="color: #339933;">=</span> x<span style="color: #339933;">,</span> y <span style="color: #339933;">=</span> y<span style="color: #339933;">,</span> fill <span style="color: #339933;">=</span> z<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> geom_tile<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
	print<span style="color: #009900;">&#40;</span>p<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>Here&#8217;s an example of the use of <code>curve3D</code> to explore the behavior of <a href="http://ideas.repec.org/a/tpr/qjecon/v107y1992i2p573-97.html">Loewenstein and Prelec&#8217;s Generalized Hyperbolic discounting function</a>:</p>
<pre>
g <- function(x, y) {(1 + y * 2) ^ (-x / y) * (1 + y * 1) ^ (x / y)}

curve3D(g, from.x = 0.01, to.x = 1, from.y = 0.01, to.y = 1)
</pre>
<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2011/03/example.png" alt="example.png" border="0" width="480" height="480" /></div>
<p>I'd love suggestions for cleaning this function up. Two obvious improvements are:</p>
<ol>
<li>Allow the function to accept arbitrary expressions and not just functions as inputs.</li>
<li>Allow the user to see 3D surfaces or heatmaps.</li>
</ol>
<p>I suspect that the first problem would be a great way to learn about functional programming in R -- especially R's methods for quoting, parsing and deparsing expressions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/03/21/a-3d-version-of-rs-curve-function/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Canabalt Revisited: Gamma Distributions, Multinomial Distributions and More JAGS Goodness</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/03/16/canabalt-revisited-gamma-distributions-multinomial-distributions-and-more-jags-goodness/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/03/16/canabalt-revisited-gamma-distributions-multinomial-distributions-and-more-jags-goodness/#comments</comments>
		<pubDate>Thu, 17 Mar 2011 00:30:45 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4213</guid>
		<description><![CDATA[Introduction Neil Kodner recently got me interested again in analyzing Canabalt scores statistically by writing a great post in which he compared the average scores across iOS devices. Thankfully, Neil&#8217;s made his code and data freely available, so I&#8217;ve been revising my original analyses using his new data whenever I can find a free minute. [...]]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>
<p><a href="http://www.neilkodner.com/">Neil Kodner</a> recently got me interested again in <a href="http://www.johnmyleswhite.com/notebook/2009/11/12/canabalt/">analyzing Canabalt scores statistically</a> by writing <a href="http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/">a great post</a> in which he compared the average scores across iOS devices. Thankfully, Neil&#8217;s made his <a href="https://github.com/neilkod/canabalt">code and data</a> freely available, so I&#8217;ve been revising my original analyses using his new data whenever I can find a free minute.</p>
<p>Returning to Canabalt has been a lot of fun, especially because my grasp of statistical theory is a lot stronger now than it was when I published <a href="http://www.johnmyleswhite.com/notebook/2009/11/15/the-top-scores-for-canabalt-take-2/">my last post on Canabalt scores</a>. For example, I actually know now what I was trying to say when I publicly described my search for a <a href="http://en.wikipedia.org/wiki/Poisson_distribution">Poisson distribution</a> in the posted scores. At the time, I had just read about <a href="http://www.ats.ucla.edu/stat/r/dae/poissonreg.htm">Poisson regressions</a> and was therefore eager to fit Poisson models to the scores data, even though the Poisson model gave very poor results. In retrospect, it&#8217;s clear that I was misled by superficial similarities in statistical terminology. What I was really looking for in the data was something closer to a <a href="http://en.wikipedia.org/wiki/Poisson_process">Poisson process</a> than to a Poisson distribution. Unfortunately, I didn&#8217;t really understand Poisson processes at the time I wrote my original posts, so I only succeeded in showing that Canabalt scores could not reasonably be claimed to be Poisson distributed.</p>
<h3>Generating Process</h3>
<p>But now that I have more data and more experience, it&#8217;s easy to see what I was struggling to articulate before: Canabalt scores seem to be generated by something like a truncated Poisson process. From my current perspective, the generating process for a Canabalt score is essentially the following:</p>
<ol>
<li>Initialize the player&#8217;s score to zero.</li>
<li>Iterate the following steps repeatedly:</li>
<ol>
<li>Draw the number of meters before the next obstacle from a Poisson distribution. Add this value to the player&#8217;s current score.</li>
<li>Draw the identity of the next obstacle from a multinomial distribution.</li>
<li>For every type of obstacle, there is a constant probability p that a player will die when they come up against an instance of that type of obstacle. Given the value of p, draw a Bernoulli variable with probability p of coming up heads. If the result is heads, the player dies and their score is the current value of score, which is therefore simply the sum of several Poisson variables. If the result is tails, go back to the top of this loop.</li>
</ol>
</ol>
<p>This intuition, unfortunately, isn&#8217;t trivial to express as a model that I can quickly fit to the data I have. I haven&#8217;t tried very systematically to build a formal version of this model because there seem to be obvious problems of identification. For example, one can imagine that the values for the obstacle-specific probabilities described above can all be lowered by a constant proportion at the same time that the mean of the underlying Poisson distribution is decreased while the distribution of outcome scores is left constant. My intuition tells me that the only way around this would be to exploit the variance of the data and the restriction of the Poisson distribution to integer values, but I haven&#8217;t pushed very hard on this. If others are interested in pursuing it, I think there&#8217;s an interesting open problem here.</p>
<p>Thankfully, there&#8217;s another approach to modeling the data that&#8217;s simpler. As <a href="http://www.econinfo.de/">Owe Jessen</a> noted in a comment on <a href="http://www.johnmyleswhite.com/notebook/2009/11/15/the-top-scores-for-canabalt-take-2/">my earlier post</a>, the distribution of scores looks like one parameterization of the gamma distribution. When Owe made this suggestion originally, I tried to use <code>fitdistr</code> from the <code>MASS</code> package to fit a gamma model to the scores data, but was never able to get the code to work properly. But now that I&#8217;m reasonably fluent in BUGS, it&#8217;s quite easy to fit a gamma distribution to the empirical scores data.</p>
<h3>Gamma Distribution</h3>
<p>Using JAGS, fitting a gamma distribution to the scores data is basically a trivial problem that requires only a few lines of code:</p>

<div class="wp_codebox"><table><tr id="p421337"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code" id="p4213code37"><pre class="c" style="font-family:monospace;">model
<span style="color: #009900;">&#123;</span>
	<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>N<span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		score<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> ~ dgamma<span style="color: #009900;">&#40;</span>shape<span style="color: #339933;">,</span> rate<span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	shape ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
	rate ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>Invoking this BUGS model from R using code that I&#8217;ve put on <a href="https://github.com/johnmyleswhite/bayesian_canabalt">GitHub</a> allows us to estimate the scale and rate parameters for a gamma distribution in a few minutes. The resulting parameterization of the gamma distribution looks very similar to the empirical density that we can estimate using a KDE:</p>
<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2011/03/density_comparison1.png" alt="density_comparison.png" border="0" width="480" height="480" /></div>
<p>Beyond visual comparisons, we can formally test the fit of the gamma model using a <a href="http://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test">K/S test</a>. The K/S test tells us that we should reject the gamma model, but it&#8217;s a fairly weak rejection at p = 0.005. Given that we have over a thousand data points, I think this weak rejection suggests that the gamma model is not such a bad approximation to the true score distribution.</p>
<p>Where the gamma model seems to noticeably fail is in the tail of the scores distribution:</p>
<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2011/03/tail_density_comparison1.png" alt="tail_density_comparison.png" border="0" width="480" height="480" /></div>
<p>To exaggerate the differences here, I&#8217;ve used a square root transform on the y axis so that we can see the bumps in the estimated density plot that are missing from the theoretical gamma model.</p>
<p>Since generating these images, I&#8217;ve had a chance to read a bit about heavy tailed distributions, but haven&#8217;t yet tried fitting any of them to this data set. I&#8217;ll probably start with a Pareto distribution, though I&#8217;d really like to find a discrete heavy tailed distribution over the natural numbers rather than a continuous distribution over the non-negative reals.</p>
<p>Looking further past the gamma model&#8217;s misfit in the tails, there&#8217;s another reason that I like the gamma model: the gamma distribution has an origins story that has some points of connection to the generative model I outlined above. Specifically, adding a bunch of exponential variables together will give you a gamma distribution (also called an <a href="http://en.wikipedia.org/wiki/Erlang_distribution">Erlang distribution</a> in this context). While I&#8217;m skeptical that the distribution of meters between obstacles can be reasonably treated as if it were an exponential distribution, the summation origin for the gamma distribution is a nice point of connection to my intuitions about a data generating mechanism that behaves like a Poisson process.</p>
<h3>Hierarchical Gamma Model</h3>
<p>The gamma model has another point in its favor: it&#8217;s easy to write down a hierarchical model that fits a distinct gamma distribution to subsets of the original data. By fitting multiple gamma distributions, we can easily make comparisons between the estimated score distributions for the different devices that Neil analyzed. As Neil showed, there are enough rows in the current data to do this in a principled way across devices without resorting to gamma distributions, but a hierarchical model provides us with tools for thinking about comparisons across different types of deaths, where we don&#8217;t have enough data to use density estimates or other non-parametric methods.</p>
<p>Without a distributional model, I&#8217;d be skeptical about estimating differences between groups with such unequal sample sizes. (Even with a model, I don&#8217;t think we can make strong conclusions about differences between all of the groups in this data set.) That said, within a hierarchical model, I feel more comfortable estimating several conditional distributions from a small data set, because the hierarchical model provides enough shrinkage to prevent us from arriving at extreme conclusions simply because some groups were undersampled in our data set. (Of course, shrinking the model parameters for small groups towards the global mean may lead us to miss differences that are real.)</p>
<p>All that said, the analyses I describe below implement a hierarchical model that I&#8217;ve used to estimate the mean and standard deviation of the score distribution for each of the iOS devices and each of the death types in our data set.</p>
<p>First, let&#8217;s model the expected score for a player who died because of each of the various possible obstacles they might come across. I&#8217;ll refer to these different types of deaths as the death types. My JAGS code for estimating the shape and rate parameters for each death type is below:</p>

<div class="wp_codebox"><table><tr id="p421338"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="code" id="p4213code38"><pre class="c" style="font-family:monospace;">model
<span style="color: #009900;">&#123;</span>
	<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>N<span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		score<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> ~ dgamma<span style="color: #009900;">&#40;</span>shape<span style="color: #009900;">&#91;</span>death<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> rate<span style="color: #009900;">&#91;</span>death<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>j in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>K<span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		shape<span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> ~ dgamma<span style="color: #009900;">&#40;</span>alpha.<span style="color: #202020;">shape</span><span style="color: #339933;">,</span> beta.<span style="color: #202020;">shape</span><span style="color: #009900;">&#41;</span>
		rate<span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> ~ dgamma<span style="color: #009900;">&#40;</span>alpha.<span style="color: #202020;">rate</span><span style="color: #339933;">,</span> beta.<span style="color: #202020;">rate</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	alpha.<span style="color: #202020;">shape</span> ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
	beta.<span style="color: #202020;">shape</span> ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
&nbsp;
	alpha.<span style="color: #202020;">rate</span> ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
	beta.<span style="color: #202020;">rate</span> ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>As you can see, I estimate group hyperparameters that partially pool data across all of the death types: these hyperparameters are themselves given weakly informative gamma priors in the last four lines of the model. The results give us the following estimates for the mean score for each death type along with the estimated standard deviation:</p>
<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2011/03/death_hierarchical_gamma_1.png" alt="death_hierarchical_gamma_1.png" border="0" width="480" height="480" /></div>
<p>I&#8217;ve chosen to use means and standard deviations rather than the shape and rate parameters that we&#8217;re actually fitting, because the shape and rate parameters are on such different scales that only one of them can be visualized effectively at a time. There are simple formulas for translating between these parameterizations of the gamma distribution: the translation scheme can be found on <a href="http://en.wikipedia.org/wiki/Gamma_distribution">Wikipedia</a>.</p>
<p>Having estimated these parameters across death types, I can also approach the question that Neil addressed in his post by comparing the score distributions across iOS devices. The JAGS code to do so is almost identical to the code for estimating the gamma parameters across death types:</p>

<div class="wp_codebox"><table><tr id="p421339"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="code" id="p4213code39"><pre class="c" style="font-family:monospace;">model
<span style="color: #009900;">&#123;</span>
	<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>N<span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		score<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> ~ dgamma<span style="color: #009900;">&#40;</span>shape<span style="color: #009900;">&#91;</span>device<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> rate<span style="color: #009900;">&#91;</span>device<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>j in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>J<span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		shape<span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> ~ dgamma<span style="color: #009900;">&#40;</span>alpha.<span style="color: #202020;">shape</span><span style="color: #339933;">,</span> beta.<span style="color: #202020;">shape</span><span style="color: #009900;">&#41;</span>
		rate<span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> ~ dgamma<span style="color: #009900;">&#40;</span>alpha.<span style="color: #202020;">rate</span><span style="color: #339933;">,</span> beta.<span style="color: #202020;">rate</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	alpha.<span style="color: #202020;">shape</span> ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
	beta.<span style="color: #202020;">shape</span> ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
&nbsp;
	alpha.<span style="color: #202020;">rate</span> ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
	beta.<span style="color: #202020;">rate</span> ~ dgamma<span style="color: #009900;">&#40;</span><span style="color:#800080;">0.0001</span><span style="color: #339933;">,</span> <span style="color:#800080;">0.0001</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>In this case, the results tell a much simpler story about the differences between the three devices in our data set:</p>
<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2011/03/device_gamma_1.png" alt="device_gamma_1.png" border="0" width="480" height="480" /></div>
<h3>Multinomial Model</h3>
<p>As a closing point, there&#8217;s one more modeling project for which I&#8217;ve used JAGS on the new Canabalt data: I&#8217;ve tried to estimate the probability of suffering each type of death along with an indication of the uncertainty in our estimates. Of course, this is a simple problem to solve using maximum likelihood methods: you can just plug in the empirical frequencies. But, given my growing love for Bayesian methods and the use of credible intervals to summarize uncertainty, I decided that I would estimate the <a href="http://en.wikipedia.org/wiki/Multinomial_distribution">multinomial model</a> for death types using a <a href="http://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet prior</a> centered on a uniform multinomial distribution. Here&#8217;s my JAGS code for estimating the multinomial model:</p>

<div class="wp_codebox"><table><tr id="p421340"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code" id="p4213code40"><pre class="c" style="font-family:monospace;">model
<span style="color: #009900;">&#123;</span>
	deaths<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>K<span style="color: #009900;">&#93;</span> ~ dmulti<span style="color: #009900;">&#40;</span>p<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>K<span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> N<span style="color: #009900;">&#41;</span>
	p<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>K<span style="color: #009900;">&#93;</span> ~ ddirch<span style="color: #009900;">&#40;</span>alpha<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>K<span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
&nbsp;
	<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>K<span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		alpha<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">1</span> <span style="color: #339933;">/</span> K
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>And here are the final results:</p>
<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2011/03/death_type_probabilities.png" alt="death_type_probabilities.png" border="0" width="480" height="480" /></div>
<h3>Conclusion</h3>
<p>It&#8217;s been a lot of fun coming back to this topic. I still want to understand the outliers in the data better, but I&#8217;ll leave that for another day. If you&#8217;re interested in exploring this data set yourself, I encourage you to go to the <a href="https://github.com/johnmyleswhite/bayesian_canabalt">GitHub repository I&#8217;ve set up</a> and explore the JAGS code that I&#8217;ve used to fit these models. There you can also find higher effective resolution PDF&#8217;s of the graphs that you see here, which are admittedly a bit hard to read at this resolution.</p>
<p>Finally, I&#8217;d like to thank Neil Kodner for having put so much work into collecting more Canabalt data and analyzing it. Essentially all of the work I&#8217;ve done here is just a Bayesian reformulation of the questions that Neil addressed already in <a href="http://www.neilkodner.com/2011/02/visualizations-of-canabalt-scores-scraped-from-twitter/">his own post</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/03/16/canabalt-revisited-gamma-distributions-multinomial-distributions-and-more-jags-goodness/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Review of R Graphs Cookbook</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/03/01/review-of-r-graphs-cookbook/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/03/01/review-of-r-graphs-cookbook/#comments</comments>
		<pubDate>Wed, 02 Mar 2011 03:31:40 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4206</guid>
		<description><![CDATA[The kind people at Packt Publishing recently asked me to review one of their newest R books: the R Graphs Cookbook. In general, I think pretty highly of the book: it provides a nice overview of the basic tools for visualizing data in R. If you&#8217;re just getting started with creating graphs in R, this [...]]]></description>
			<content:encoded><![CDATA[<p>The kind people at <a href="http://www.packtpub.com">Packt Publishing</a> recently asked me to review one of their newest R books: the <a href="http://link.packtpub.com/FQqhZX">R Graphs Cookbook</a>. In general, I think pretty highly of the book: it provides a nice overview of the basic tools for visualizing data in R. If you&#8217;re just getting started with creating graphs in R, this book could be a very valuable resource. It&#8217;s clearly targeted at beginners, though it does seem to assume at least a little familiarity with R&#8217;s basic data structures and control flow. Perhaps more importantly for potential readers, the book actually works through some comprehensible, extended examples, which makes it substantially more readable than the default R documentation. I&#8217;d probably have benefited from having this book when I was first starting to learn to program in R.</p>
<p>Another plus for the book is that it covers some of the graphing functionality that&#8217;s provided by the base, lattice and ggplot2 packages. That last bit is particularly important to me: I hope this book represents one of the first members of a new generation of books on R that will treat ggplot2 as a default tool for building graphs in R &#8212; if not the default tool. That said, I&#8217;d actually have been happy if the book had proselytized more strongly for ggplot2, even though I can imagine that it&#8217;s probably a good thing to expose beginner users to the traditional tools in R for visualizing data in addition to the newest favorite contender.</p>
<p>For intermediate users, I suspect that the most useful part of the book will be the discussion in the second chapter of some of the lower level graphic parameters that <code>par()</code> gives you access to. I&#8217;d never learned to use those features very effectively, so the chapter covering those details was particularly helpful for me. Also, there&#8217;s a later chapter on creating maps that&#8217;s quite nice, as well as a final chapter that&#8217;s focused on producing publication-ready graphics. Both of those chapters could be very useful for more advanced readers: for example, I knew absolutely nothing about working with fonts in R before reading those sections.</p>
<p>In short, the <a href="http://link.packtpub.com/FQqhZX">R Graphs Cookbook</a> seems like a useful book to have around for most R hackers and a potentially very valuable resource for new programmers. If you&#8217;re interested in finding a starting point for learning to build graphs in R, I&#8217;d suggest considering this book.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/03/01/review-of-r-graphs-cookbook/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Modern Science and the Bayesian-Frequentist Controversy</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/02/14/modern-science-and-the-bayesian-frequentist-controversy/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/02/14/modern-science-and-the-bayesian-frequentist-controversy/#comments</comments>
		<pubDate>Mon, 14 Feb 2011 18:14:49 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Citations]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4202</guid>
		<description><![CDATA[The Bayesian-Frequentist debate reflects two different attitudes to the process of doing science, both quite legitimate. Bayesian statistics is well-suited to individual researchers, or a research group, trying to use all the information at its disposal to make the quickest possible progress. In pursuing progress, Bayesians tend to be aggressive and optimistic with their modeling [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>
The Bayesian-Frequentist debate reflects two different attitudes to the process of doing science, both quite legitimate. Bayesian statistics is well-suited to individual researchers, or a research group, trying to use all the information at its disposal to make the quickest possible progress. In pursuing progress, Bayesians tend to be aggressive and optimistic with their modeling assumptions. Frequentist statisticians are more cautious and defensive. One definition says that a frequentist is a Bayesian trying to do well, or at least not too badly, against any possible prior distribution. The frequentist aims for universally acceptable conclusions, ones that will stand up to adversarial scrutiny. The FDA for example doesn’t care about Pfizer’s prior opinion of how well it’s new drug will work, it wants objective proof. Pfizer, on the other hand may care very much about its own opinions in planning future drug development.<sup><a href="http://www.johnmyleswhite.com/notebook/2011/02/14/modern-science-and-the-bayesian-frequentist-controversy/#footnote_0_4202" id="identifier_0_4202" class="footnote-link footnote-identifier-link" title="
Bradley Efron : Modern Science and the Bayesian-Frequentist Controversy">1</a></sup>
</p></blockquote>
<p>To me, it&#8217;s amazing how similar the ambiguous regions of behavioral decision theory are to the major questions of theoretical statistics: people seem largely unable to systematically decide whether they want to be minimaxing (which seems very close to Efron&#8217;s vision of frequentist thought as stated here) or whether they want to be minimizing expected risk (which is closer to my own vision of Bayesian thinking). My own sense is that we learn as a global culture, over time, which error functions are least erroneous &#8212; and we do so largely by trial and error.</p>
<p>Most interesting to me is to consider individual differences in the error functions people effectively use: I suspect political preferences correlate with a propensity to focus on worst case thinking rather than average case thinking. Also, I&#8217;m fascinated by the way that a single person switches between worst case and average case thinking: I suspect there&#8217;s as much to be learned here as there was in understanding what drives risk seeking behavior and what drives risk average behavior.</p>
<p>HT: <a href="http://www.johndcook.com/blog/2011/02/14/the-end-of-hard-edged-science/">John D. Cook</a></p>
<ol class="footnotes"><li id="footnote_0_4202" class="footnote"><br />
Bradley Efron : <a href="http://www-stat.stanford.edu/~ckirby/brad/papers/2005NEWModernScience.pdf">Modern Science and the Bayesian-Frequentist Controversy</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/02/14/modern-science-and-the-bayesian-frequentist-controversy/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Inconsistencies in Bayesian Models of Decision-Making</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/01/20/inconsistencies-in-bayesian-models-of-decision-making/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/01/20/inconsistencies-in-bayesian-models-of-decision-making/#comments</comments>
		<pubDate>Fri, 21 Jan 2011 03:48:04 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4199</guid>
		<description><![CDATA[But modeling devices that make sense for an unbiased decisionmaker may not make sense for a biased one. For example, why would individuals have priors and posteriors if they are destined to apply Bayes&#8217; law incorrectly?1 A question I often ask myself. Wolfgang Pesendorfer : Behavioral Economics Comes of Age: A Review Essay on Advances [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>
But modeling devices that make sense for an unbiased decisionmaker may not make sense for a biased one. For example, why would individuals have priors and posteriors if they are destined to apply Bayes&#8217; law incorrectly?<sup><a href="http://www.johnmyleswhite.com/notebook/2011/01/20/inconsistencies-in-bayesian-models-of-decision-making/#footnote_0_4199" id="identifier_0_4199" class="footnote-link footnote-identifier-link" title="Wolfgang Pesendorfer : Behavioral Economics Comes of Age: A Review Essay on Advances in Behavioral Economics">1</a></sup>
</p></blockquote>
<p>A question I often ask myself.</p>
<ol class="footnotes"><li id="footnote_0_4199" class="footnote"><a href="http://www.princeton.edu/~pesendor/">Wolfgang Pesendorfer</a> : <a href="http://www.jstor.org/stable/30032350">Behavioral Economics Comes of Age: A Review Essay on Advances in Behavioral Economics</a></li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/01/20/inconsistencies-in-bayesian-models-of-decision-making/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Academic Jargon: Field-Specific Insults</title>
		<link>http://www.johnmyleswhite.com/notebook/2010/12/12/academic-jargon-field-specific-insults/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2010/12/12/academic-jargon-field-specific-insults/#comments</comments>
		<pubDate>Sun, 12 Dec 2010 23:22:49 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Academia]]></category>
		<category><![CDATA[Economics]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4191</guid>
		<description><![CDATA[Every academic field seems to develop a set of generic insults based on their intellectual toolkit. Here are two examples I hear often: Probabilists and Statisticians: &#8220;I think that&#8217;s an interesting case, but it&#8217;s in a set with measure zero.&#8221; Economists: &#8220;X group&#8217;s behavior is clearly rent-seeking.&#8221; Do any readers have good examples from other [...]]]></description>
			<content:encoded><![CDATA[<p>Every academic field seems to develop a set of generic insults based on their intellectual toolkit. Here are two examples I hear often:</p>
<ol>
<li><b>Probabilists and Statisticians</b>: &#8220;I think that&#8217;s an interesting case, but it&#8217;s in a set with measure zero.&#8221;</li>
<li><b>Economists</b>: &#8220;X group&#8217;s behavior is clearly rent-seeking.&#8221;</li>
</ol>
<p>Do any readers have good examples from other fields?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2010/12/12/academic-jargon-field-specific-insults/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>A Draft of ProjectTemplate v0.2-1</title>
		<link>http://www.johnmyleswhite.com/notebook/2010/12/03/a-draft-of-projecttemplate-v0-2-1/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2010/12/03/a-draft-of-projecttemplate-v0-2-1/#comments</comments>
		<pubDate>Sat, 04 Dec 2010 01:31:37 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4186</guid>
		<description><![CDATA[I&#8217;ve just uploaded a new binary of ProjectTemplate to GitHub. This is a draft version of the next release, v0.2-1, which includes some fairly substantial changes and is backwards incompatible in several ways with previous versions of ProjectTemplate. Foremost of the changes is that most of the logic for load.project() is now built into the [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve just uploaded a new binary of <a href="http://cran.r-project.org/web/packages/ProjectTemplate/index.html">ProjectTemplate</a> to <a href="https://github.com/johnmyleswhite/ProjectTemplate">GitHub</a>. This is a draft version of the next release, v0.2-1, which includes some fairly substantial changes and is backwards incompatible in several ways with previous versions of ProjectTemplate.</p>
<p>Foremost of the changes is that most of the logic for <code>load.project()</code> is now built into the <code>load.project()</code> function directly, rather than spread out into autogenerated scripts that you can edit by hand. While this makes ProjectTemplate harder for non-experts to modify, the change will make it much easier to make revisions to ProjectTemplate in the future without having to worry about existing projects falling behind because of vestigial code that&#8217;s not being automatically updated when you install a new version of ProjectTemplate.</p>
<p>Because more system logic is now hardcoded into functions, each project&#8217;s configuration is handled through a YAML file in <code>config/global.yaml</code>. Incidentally, this introduces the new directory, <code>config/</code>, where configuration files will go from now on.</p>
<p>The data loading system is also more complex than it was before. First, there&#8217;s a new hierarchy of data sources: now the system will look for data in a <code>cache/</code> directory before moving on to the <code>data/</code> directory. This makes it possible for you to permanently store changes to your data set in <code>cache/</code> that will allow you to skip loading the raw data set. This is helpful when the original data set is enormous and you only need a radically reduced form of it for your future analyses that you&#8217;ll store in <code>cache/</code>.</p>
<p>In addition, preprocessing is now handled through a series of ordered scripts in a <code>munge/</code> directory rather than just a single preprocessing script in the <code>lib/</code> directory. There&#8217;s also a <code>log/</code> directory, used by the new integrated <a href="https://github.com/johnmyleswhite/log4r">log4r</a> support, which is off by default, but can be easily set up after installing <a href="http://cran.r-project.org/web/packages/log4r/index.html">log4r from CRAN</a>.</p>
<p>Finally, there&#8217;s a <code>src/</code> directory where we&#8217;re going to encourage users to place their primary analyses, so that the main directory always has the same files and directories across all projects.</p>
<p>In addition to all of these changes, many of which were inspired by conversations with <a href="http://mikedewar.org/">Mike Dewar</a>, I&#8217;ve incorporated some very helpful patches in this release. Specifically, <a href="http://www.diegovalle.net/">Diego Valle-Jones</a> fixed a bug in <code>clean.variable.name()</code> that lead to trouble when filenames in the <code>data/</code> directory began with numbers and <a href="http://www.patrickschalk.com/">Patrick D. Schalk</a> contributed code that adds support for <a href="http://www.sqlite.org/">SQLite</a> to ProjectTemplate along with general improvements to the database access codebase.</p>
<p>Thanks for all of the support since the last release. Please let me know if there any changes that need to be made before I turn v0.2-1 loose on CRAN.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2010/12/03/a-draft-of-projecttemplate-v0-2-1/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.617 seconds -->
<!-- Cached page served by WP-Cache -->

