<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John Myles White</title>
	<atom:link href="http://www.johnmyleswhite.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johnmyleswhite.com</link>
	<description>&#34;He who refuses to do arithmetic is doomed to talk nonsense.&#34;</description>
	<lastBuildDate>Mon, 14 May 2012 17:54:25 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Criticism 3 of NHST: Essential Information is Lost When Transforming 2D Data into a 1D Measure</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/05/14/criticism-3-of-nhst-essential-information-is-lost-when-transforming-2d-data-into-a-1d-measure/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/05/14/criticism-3-of-nhst-essential-information-is-lost-when-transforming-2d-data-into-a-1d-measure/#comments</comments>
		<pubDate>Mon, 14 May 2012 14:30:52 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Academia]]></category>
		<category><![CDATA[Economics]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4448</guid>
		<description><![CDATA[Introduction Continuing on with my series on the weaknesses of NHST, I&#8217;d like to focus on an issue that&#8217;s not specific to NHST, but rather one that&#8217;s relevant to all quantitative analysis: the destruction caused by an inappropriate reduction of dimensionality. In our case, we&#8217;ll be concerned with the loss of essential information caused by [...]]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>
<p>Continuing on with my series on the weaknesses of NHST, I&#8217;d like to focus on an issue that&#8217;s not specific to NHST, but rather one that&#8217;s relevant to all quantitative analysis: the destruction caused by an inappropriate reduction of dimensionality. In our case, we&#8217;ll be concerned with the loss of essential information caused by the reduction of a two-dimensional world of uncertain measurements into the one-dimensional world of p-values.</p>
<h3>p-Values Mix Up the Strength of Effects with the Precision of Their Measurement</h3>
<p>For NHST, the two independent dimensions of measurement are (1) the strength of an effect, measured using the distance of a point estimate from zero; and (2) the uncertainty we have about the effect&#8217;s true strength, measured using something like the expected variance of our measurement device. These two dimensions are reduced into a single p-value in a way that discards much of the meaning of the original data. </p>
<p>When using confidence intervals, the two dimensions I&#8217;ve described are equivalent to the position of the center of the confidence interval relative to zero and the width of the confidence interval. Clearly these two dimensions can vary independently. Working through the math, it is easy to show that p-values are simply a one-dimensional representation of these two dimensions.<sup><a href="http://www.johnmyleswhite.com/notebook/2012/05/14/criticism-3-of-nhst-essential-information-is-lost-when-transforming-2d-data-into-a-1d-measure/#footnote_0_4448" id="identifier_0_4448" class="footnote-link footnote-identifier-link" title="Indeed, p-values are effectively constructed by dividing the distance of the point estimate from zero by the width of the confidence interval and then passing this normalized distance through a non-linear function. I&amp;#8217;m particularly bewildered by the use of this non-linear function: most people have trouble interpreting numbers already, and this transformation seems almost designed to make the numbers harder to interpret.">1</a></sup></p>
<p>To illustrate how many different kinds of data sets receive the same p-value under NHST, let&#8217;s consider three very different data sets in which we test for a difference across two groups and then get the same p-value out of our analysis:</p>
<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/05/three_studies1.png" alt="three_studies.png" border="0" width="1000" height="750" /></div>
<p>Clearly these data sets are substantively different, despite producing identical p-values. Really, we&#8217;ve seen three qualitatively different types of effects under study:</p>
<ol>
<li>An effect that is probably trivial, but which has been measured with considerable precision.</li>
<li>An effect with moderate importance that has been measured moderately well.</li>
<li>An effect that could be quite important, but which has been measured fairly poorly.</li>
</ol>
<p>No one can argue that these situations are not objectively different. Importantly, I think many of us also feel that the scientific merits of these three types of research are very different: we have some use for the last two types of studies and no real use for the first. Sadly, I suspect that the scientific literature increasingly focuses on the first category, because it is always possible to measure anything precisely if you are willing to invest enough time and money. If the community&#8217;s metric for scientific quality is a p-value, which can be no more than a statement about the precision of measurements, then you will find that scientists produce precise measurements of banalities rather than tentative measurements of important effects.</p>
<h3>How Do We Solve This Problem?</h3>
<p>Unlike previous posts, this problem with the use of NHST can be solved without any great effort to teach people to use better methods: to compute a p-value, you need to estimate both the strength of an effect and the precision of its measurement. Moving forward, we must be certain that we report both of these quantities instead of the one-number p-value summary.</p>
<p>Sadly, people have been arguing for this change for years without much success. To solve our impasse, I think we need to push on our community to impose a flat out ban: going forward, researchers should only be allowed to report confidence intervals. Given that p-values can always be derived from the more informative confidence intervals while the opposite transformation is not possible, how compelling could any argument be for continuing to tolerate p-values?</p>
<h3>References</h3>
<p>Ziliak, S.T. and McCloskey, D.N. (2008), &#8220;The cult of statistical significance: How the standard error costs us jobs, justice, and lives&#8221;, Univ of Michigan Press</p>
<ol class="footnotes"><li id="footnote_0_4448" class="footnote">Indeed, p-values are effectively constructed by dividing the distance of the point estimate from zero by the width of the confidence interval and then passing this normalized distance through a non-linear function. I&#8217;m particularly bewildered by the use of this non-linear function: most people have trouble interpreting numbers already, and this transformation seems almost designed to make the numbers harder to interpret.</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/05/14/criticism-3-of-nhst-essential-information-is-lost-when-transforming-2d-data-into-a-1d-measure/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Criticism 2 of NHST: NHST Conflates Rare Events with Evidence Against the Null Hypothesis</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/05/12/criticism-2-of-nhst-nhst-conflates-rare-events-with-evidence-against-the-null-hypothesis/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/05/12/criticism-2-of-nhst-nhst-conflates-rare-events-with-evidence-against-the-null-hypothesis/#comments</comments>
		<pubDate>Sat, 12 May 2012 14:40:09 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Academia]]></category>
		<category><![CDATA[Economics]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Science]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4440</guid>
		<description><![CDATA[Introduction This is my second post in a series describing the weaknesses of the NHST paradigm. In the first post, I argued that NHST is a dangerous tool for a community of researchers because p-values cannot be interpreted properly without perfect knowledge of the research practices of other scientists &#8212; knowledge that we cannot hope [...]]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>
<p>This is my second post in a series describing the weaknesses of the NHST paradigm. <a href="http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/">In the first post</a>, I argued that NHST is a dangerous tool for a community of researchers because p-values cannot be interpreted properly without perfect knowledge of the research practices of other scientists &#8212; knowledge that we cannot hope to attain.</p>
<p>In this post, I will switch gears and focus on a weakness of NHST that afflicts individual researchers as severely as it afflicts communities of researchers. This post focuses on the concern that NHST is not robust to &#8220;absolutely rare&#8221; events, by which I mean events that have low probability under all possible hypotheses. NHST erroneously treats such &#8220;absolutely rare&#8221; events as evidence against the null hypothesis. In practice, this means that many researchers will treat such &#8220;absolutely rare&#8221; events as evidence in support of the alternative hypothesis, even when the null hypothesis assigns higher probability to the observed data than the alternative hypothesis and should therefore be considered the superior model.</p>
<p>To explain this concern using a detailed example, I will borrow an idea from Jacob Cohen. Before I do that, I need to outline a version of NHST that I believe is an accurate description of the way many practicing scientists actually use NHST.</p>
<p>In this formulation, the scientist observes an event; then posits a model, called the null hypothesis, that accounts for this and similar events using only chance mechanisms; and then estimates the probability of observing this sort of event under the null hypothesis. If the event and its kind are assigned low probability, the null hypothesis is rejected. For many scientists, this rejection is taken as evidence in favor of their preferred hypothesis, which we call the alternative hypothesis.</p>
<h3>Cohen&#8217;s Example: Americans and Members of Congress</h3>
<p>In our example, we will imagine meeting a new person named John Smith. We entertain two hypotheses about him: one, the null hypothesis, asserts that John Smith is an American. The other, the alternative hypothesis, asserts that John Smith is not an American. Because these hypotheses are mutually exclusive, it seems acceptable to say that any evidence against the null hypothesis should constitute evidence in support of the alternative hypothesis. We will demonstrate that p-values cannot constitute such evidence.</p>
<p>We do this by creating a simple null model: under the null hypothesis, the probability that John Smith is a Member of Congress is roughly <code>535 / 311,000,000</code> which is close to two in a million. In short, John Smith being a Member of Congress is a very improbable event under the null hypothesis.</p>
<p>Under the alternative hypothesis, the probability that John Smith is a Member of Congress is <code>0</code>, because non-Americans cannot serve in Congress. This is an uncommon virtue in probabilistic models, because the existence of an event with zero probability means that the alternative hypothesis is actually falsifiable.</p>
<p>Now suppose that we do not know John Smith&#8217;s citizenship, but we do find out that he is a Member of Congress. Using NHST, we will perversely reject the null hypothesis that John is an American because Americans are very rarely Members of Congress. If we continue on from this rejection of the null to an acceptance of the alternative hypothesis, we will conclude that John Smith is not an American.</p>
<p>This is troubling, because a simple calculation using Bayes&#8217; Theorem will demonstrate that the probability that John Smith is an American given that he is a Member of Congress is <code>1</code>: we are absolutely certain that John Smith is an American. Yet NHST would have us reject this absolutely certain conclusion. For those who harbor an inveterate suspicion of Bayes&#8217; Theorem, we can note that the likelihood ratio in this example is infinite in support of the null hypothesis and that a significance test of this ratio leads us to fail to reject the null hypothesis.<sup><a href="http://www.johnmyleswhite.com/notebook/2012/05/12/criticism-2-of-nhst-nhst-conflates-rare-events-with-evidence-against-the-null-hypothesis/#footnote_0_4440" id="identifier_0_4440" class="footnote-link footnote-identifier-link" title="In passing, I note that it should trouble readers that two different types of NHST give opposite answers. This is just one example of what Bayesians would call the incoherence of NHST as a paradigm.">1</a></sup></p>
<h3>How Can NHST Make Such An Obvious Mistake?</h3>
<p>For many readers, I suspect that this example will seem too outlandish to be trusted: it seems hard to believe that the basic machinery of NHST could be so fragile as to allow an example like ours to break it. Such readers are right to suspect that some trickery is required to power our thought experiment: we must assume that we have observed data that is &#8220;absolutely rare&#8221;, in the sense that the probability that a person is a Member of Congress, across both the null hypothesis and the alternative hypothesis, is only one in ten million. As such, when applied to randomly selected human beings, our hypothesis test will perform well: we will make the mistake I have highlighted only one in ten million times. As a bet, NHST is safe: indeed, there are few safer bets if we only select randomly among Americans while using a test with such a low false positive rate. But when a rare event does occur, NHST will produce erroneous results. In other words, NHST treats rare events interchangeably with evidence against the null hypothesis, even though this equivalence is not defensible when the data is viewed from another perspective. Moreover, if we begin to select randomly from the true population of the Earth, our use of a procedure with such an incredibly low false positive rate is actually a serious error, because our method has very low power to detect non-Americans &#8212; even though they are the vast majority of all human beings. For those who believe that the null hypothesis is almost always false in research studies, the use of underpowered methods is particularly troubling.</p>
<p>Sadly, unambiguous examples like this one are harder to construct using the types of data typically employed in the modern sciences, because our hypotheses and measurements are typically continuous rather than binary and are rarely absolutely falsifiable. And because we typically lack verifiable base rate information, we cannot expect to use Bayes&#8217; Theorem to calculate the relative probability of our hypotheses without subjective decisions about the a priori plausibility of our hypotheses. For some readers, this may make our example seem misleading: NHST only looks bad when clearly superior methods are available, which they seldom are.</p>
<p>But I think that this example nevertheless reveals a deep weakness with NHST: absolutely rare events will happen. A method that rejects the null hypothesis when rare events occur without even testing whether the alternative hypothesis is plausible seems very questionable to me, especially when we use a liberal threshold like p < 0.05. Because events only need to be very weakly rare to merit rejection under conventional NHST standards, the sort of hidden multiple testing I discussed in the last post may insure that rare events occur frequently enough to be reported much more frequently in the literature than our nominal false positive rate suggests. If a rare event only needs to be p = 0.05 level rare to be publishable, then one only needs to conduct twenty studies in a row to produce such a rare event. Surely people can work hard enough to produce that sort of rarity when their careers depend upon it.</p>
<h3>How Can We Do Better?</h3>
<p>What lesson should we take away from this bizarre example in which NHST leads us to reject a hypothesis that the data incontrovertibly supports? In my mind, the lesson that we should take away is that we cannot hope to learn about the relative plausibility of two hypotheses without assessing both hypotheses explicitly. We should not fault the null hypothesis for failing to predict data that the alternative hypothesis would not have predicted any better. Our interest as scientists should always lie in creating and selecting models with superior predictive power: we learn little to nothing by defeating a hypothesis that may not have been given a fair chance because of the quirky data being used to test it.</p>
<p>NHST does not satisfy the demand to evaluate both hypotheses explicitly, because it does all calculations using only the null hypothesis while entirely ignoring the alternative hypothesis. This is particularly troubling when so many practicing scientists treat a low p-value as evidence in support of the alternative hypothesis, even though this conclusion cannot generally be supported and was not part of Fisher&#8217;s original intention when introducing NHST. In short, NHST too often leads us to reject true hypotheses and too easily leads us to transform our rejection of true hypotheses into an acceptance of inferior hypotheses.</p>
<h3>References</h3>
<p>Cohen, J. (1994), &#8216;The Earth is Round (p < .05)', American Psychologist <a href="http://ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf">Ungated Copy</a></p>
<ol class="footnotes"><li id="footnote_0_4440" class="footnote">In passing, I note that it should trouble readers that two different types of NHST give opposite answers. This is just one example of what Bayesians would call the incoherence of NHST as a paradigm.</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/05/12/criticism-2-of-nhst-nhst-conflates-rare-events-with-evidence-against-the-null-hypothesis/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Criticism 1 of NHST: Good Tools for Individual Researchers are not Good Tools for Research Communities</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/#comments</comments>
		<pubDate>Thu, 10 May 2012 13:58:49 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Academia]]></category>
		<category><![CDATA[Economics]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4437</guid>
		<description><![CDATA[Introduction Over my years as a graduate student, I have built up a long list of complaints about the use of Null Hypothesis Significance Testing (NHST) in the empirical sciences. In the next few weeks, I&#8217;m planning to publish a series of blog posts, each of which will articulate one specific weakness of NHST. The [...]]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>
<p>Over my years as a graduate student, I have built up a long list of complaints about the use of Null Hypothesis Significance Testing (NHST) in the empirical sciences. In the next few weeks, I&#8217;m planning to publish a series of blog posts, each of which will articulate one specific weakness of NHST. The weaknesses I will discuss are not novel observations about NHST: people have been complaining about the use of p-values since the 1950&#8242;s. My intention is simply to gather all of the criticisms of NHST in a single place and to articulate each of the criticisms in a way that permits no confusion. I&#8217;m hoping that readers will comment on these pieces and give me enough feedback to sharpen the points into a useful resource for the community.</p>
<p>In the interest of absolute clarity, I should note at the start of this series that I am primarily unhappy with the use of p-values as (1) a threshold that scientific results are expected to pass before they are considered publishable and (2) a measure of the evidence in defense of a hypothesis. I believe that p-values cannot be used for either of these purposes, but I will concede upfront that p-values can be useful to researchers who wish to test their own private hypotheses.</p>
<p>With that limitation of scope in mind, let&#8217;s get started.</p>
<h3>Communities of Researchers Face Different Problems than Individual Researchers</h3>
<p>Many scientists who defend the use of p-values as a threshold for publication employ an argument that, in broad form, can be summarized as follows: &#8220;a community of researchers can be thought of as if it were a single decision-maker who must select a set of procedures for coping with the inherent uncertainties of empiricism &#8212; foremost of which is the risk that purely chance processes will give rise to data supporting false hypotheses. To prevent our hypothetical decision-maker from believing in every hypothesis for which there exists some supporting data, we must use significance testing to separate results that could plausibly be the product of randomness from those which provide strong evidence of some underlying regularity in Nature.&#8221;</p>
<p>While I agree with part of the argument above &#8212; p-values, when used appropriately, can help an individual researcher resist their all-too-human inclination to discover patterns in noise &#8211;, I do not think that this sort of argument applies with similar force to a community of researchers, because the types of information necessary for correctly interpreting p-values are always available to individual researchers acting in isolation, but are seldom available to the members of a community of researchers who learn about each other&#8217;s work from published reports. For example, the community will frequently be ignorant of the exact research procedures used by its members, even though the details of these procedures can have profound effects on the interpretation of published p-values. To illustrate this concern, let&#8217;s work through a specific hypothetical example of a reported p-value that cannot be taken at face value.</p>
<h3>The Hidden Multiple Testing Problem</h3>
<p>Imagine that Researcher A has measured twenty variables, which we will call X1 through X20. After collecting data, Researcher A attempts to predict one other variable, Y, using these twenty variables as predictors in a standard linear regression model in which Y ~ X1 + &#8230; + X20. Imagine, for the sake of argument, that Researcher A finds that X17 has a statistically significant effect on Y at p < .05 and rushes to publish this result in the new hit paper: "Y Depends upon X17!". How will Researcher B, who sees only this result and no mention of the 19 variables that failed to predict Y, react?</p>
<p>If Researcher B embraces NHST as a paradigm without misgivings or suspicion, B must react to A's findings with a credulity that could never be defended in the face of perfect information about Researcher A's research methods. As I imagine most scientists are already aware, Researcher A's result is statistically invalid, because the significance threshold that has been passed depended upon a set of assumptions violated by the search through twenty different variables for a predictive relationship. When you use standard NHST p-values to evaluate a hypothesis, you must acquire a new set of data and then test exactly one hypothesis on the entire data set. In our case, each of the twenty variables that was evaluated as a potential predictor of Y constitutes a separate hypothesis, so that Researcher A has not conducted one hypothesis test, but rather twenty. This is conventionally called multiple testing; in this case, the result of multiple testing is that the actual probability of at least one variable being found to predict Y due purely to luck is closer to 50% than to the 5% level suggested by a reported p-value of p < 0.05.</p>
<p>What is worrisome is that this sort of multiple testing can be effortlessly hidden from Researcher B, our hypothetical reader of a scientific article. If Researcher A does not report the tests that failed, how can Researcher B know that they were conducted? Must Researcher B learn to live in fear of his fellow scientists, lest he be betrayed by their predilection to underreport their methods?</p>
<p>As I hope is clear from our example, NHST as a method depends upon a faith in the perfection of our fellow researchers that will easily fall victim to any mixture of incompetence or malice on their part. Unlike a descriptive statistic such as a mean, a p-value purports to tell us something that it cannot do without perfect information about the exact scientific methods used by every researcher in our community. An individual researcher will necessarily have this sort of perfect information about their own work, but a community will typically not. The imperfect information available to the community implies that reasoning about the community's ideal standards for measuring evidence based on the ideal standards for a hypothetical individual will be systematically misleading.</p>
<p>If an individual researcher conducts multiple tests without correcting p-values for this search through hypotheses, the individual researcher will develop false hypotheses and harm only themselves. But if even one member of a community of researchers conducts multiple tests and publishes results whose interpretation cannot be sustained in the light of knowledge of the hidden tests that took place, the community as a whole will have only a permanent record of a hypothesis supported by illusory evidence. And this illusion of evidence cannot be easily discovered after the fact without investing effort into explicit replication studies. Indeed, after Researcher A dies, any evidence of their statistical errors will likely disappear, except for the puzzling persistence of a paper reporting a relationship between Y and X17 that has not been found again.</p>
<h3>Conclusion</h3>
<p>What should we take away from this example? We should acknowledge that there are deep problems with the theoretical framework used to justify NHST as a scientific institution. NHST, as it stands, is based upon an inappropriate analogy between a community of researchers and a hypothetical decision-maker who evaluates the research of a whole community using NHST. The actual community of researchers suffers from imperfect information about the research methods being used by its members. The sort of fishing through data for positive results described above may result from either statistical naivete or a genuine lack of scruples on the part of our fellow scientists, but it is almost certainly occurring. NHST is only exacerbating the problem, because there is no credible mechanism for insuring that we know how many hypotheses have been tested before discovering a hypothesis that satisfies our community&#8217;s threshold.</p>
<p>Because the framework of NHST is not appropriate for use by a community with imperfect information, I suspect that the core objective of NHST &#8212; the prevention of false positive results &#8212; is not being achieved. At times, I even suspect that NHST has actually increased the frequency of reporting false positive results, because the universality of the procedure encourages blind searching through hypotheses for one that passes a community&#8217;s p-value threshold.</p>
<p>This is an unfortunate situation, because I am very sympathetic to those proponents of NHST who feel that it is an unambiguous, algorithmic procedure that diminishes the extent of subjective opinion in evaluating research work. While I agree that diminishing the dependence of science on subjectivity and personal opinion is always valuable, we should not, in our quest to remove subjectivity, substitute in its stead a method that depends upon an assumption of the perfect wisdom and honesty of our fellow scientists. Despite our strong desires to the contrary, human beings make mistakes. As Lincoln might have said, some researchers make mistakes all of the time and all researchers make mistakes some of the time. Because NHST is being used by a community of researchers rather than the theoretical individual for which it was designed, NHST is not robust to the imperfections of our fellow scientists.</p>
<h3>References</h3>
<p>Simmons et al. (2011), &#8216;False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant&#8217; <a href="http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704">SSRN</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/05/10/criticism-1-of-nhst-good-tools-for-individual-researchers-are-not-good-tools-for-research-communities/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>cumplyr: Extending the plyr Package to Handle Cross-Dependencies</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/05/03/cumplyr-extending-the-plyr-package-to-handle-cross-dependencies/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/05/03/cumplyr-extending-the-plyr-package-to-handle-cross-dependencies/#comments</comments>
		<pubDate>Thu, 03 May 2012 14:44:49 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4427</guid>
		<description><![CDATA[Introduction For me, Hadley Wickham&#8216;s reshape and plyr packages are invaluable because they encapsulate omnipresent design patterns in statistical computing: reshape handles switching between the different possible representations of the same underlying data, while plyr automates what Hadley calls the Split-Apply-Combine strategy, in which you split up your data into several subsets, perform some computation [...]]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>
<p>For me, <a href="http://had.co.nz/">Hadley Wickham</a>&#8216;s <a href="http://had.co.nz/reshape/">reshape</a> and <a href="http://plyr.had.co.nz/">plyr</a> packages are invaluable because they encapsulate omnipresent design patterns in statistical computing: reshape handles switching between the different possible representations of the same underlying data, while plyr automates what Hadley calls the <a href="http://www.jstatsoft.org/v40/i01/paper">Split-Apply-Combine strategy</a>, in which you split up your data into several subsets, perform some computation on each of these subsets and then combine the results into a new data set. Many of the computations implicit in traditional statistical theory are easily described in this fashion: for example, comparing the means of two groups is computationally equivalent to splitting a data set of individual observations up into subsets based on the group assignments, applying mean to those subsets and then pooling the results back together again.</p>
<h3>The Split-Apply-Combine Strategy is Broader than plyr</h3>
<p>The only weakness of plyr, which automates so many of the computations that instantiate the Split-Apply-Combine strategy, is that plyr implements one very specific version of the Split-Apply-Combine strategy: plyr always splits your data into disjoint subsets. By disjoint, I mean that any row of the original data set can occur in only one of the subsets created by the splitting function. For computations that involve cross-dependencies between observations, this makes plyr inapplicable: cumulative quantities like running means and broadly local quantities like kernelized means cannot be computed using plyr. To highlight that concern, let&#8217;s consider three very simple data analysis problems.</p>
<h4>Computing Forward-Running Means</h4>
<p>Suppose that you have the following data set:</p>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
<p>To compute a forward-running mean, you need to split this data into three subsets:</p>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
</table>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
</table>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
<p>In each of these clearly non-disjoint subsets, you would then compute the mean of <code>Value</code> and combine the results to give:</p>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>3</td>
<td>3</td>
</tr>
</table>
<p>This sort of computation occurs often enough in a simpler form that R provides tools like <code>cumsum</code> and <code>cumprod</code> to deal with cumulative quantities. But the splitting problem in our example is not addressed by those tools, nor by plyr, because the cumulative quantities have to computed on subsets that are not disjoint.</p>
<h4>Computing Backward-Running Means</h4>
<p>Consider performing the same sort of calculation as described above, but moving in the opposite direction. In that case, the three non-disjoint subsets are:</p>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
<p>And the final result is:</p>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>3</td>
</tr>
<tr>
<td>2</td>
<td>4</td>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
<h4>Computing Local Means (AKA Kernelized Means)</h4>
<p>Imagine that, instead of looking forward or backward, we only want to know something about data that is close to the current observation being examined. For example, we might want to know the mean value of each row when pooled with its immediately proceeding and succeeding neighbors. This computation must create the following subsets of data:</p>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
</table>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>3</td>
<td>5</td>
</tr>
</table>
<p>Within these non-disjoint subsets, means are computed and the result is:</p>
<table>
<tr>
<th>Time</th>
<th>Value</th>
</tr>
<tr>
<td>1</td>
<td>2</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
</tr>
</table>
<h3>A Strategy for Handling Non-Disjoint Subsets</h3>
<p>How can we build a general purpose tool to handle these sorts of computations? One way is to rethink how plyr works and then extend it with some trivial variations on its core principles. We can envision plyr as a system that uses a splitting operation that partitions our data into subsets in which each subset satisfies a group of equality constraints: you split the data into groups in which <code>Variable 1 = Value 1 AND Variable 2 = Value 2</code>, etc. Because you consider the conjunction of several equality constraints, the resulting subsets are disjoint.</p>
<p>Seen in this fashion, there is a simple relaxation of the equality constraints that allows us to solve the three problems described a moment ago: instead of looking at the conjunction of equality constraints, we use a conjunction of inequality constraints. For the time being, I&#8217;ll describe just three instantiations of this broader strategy.</p>
<h3>Using Upper Bounds</h3>
<p>Here, we divide data into groups in which <code>Variable 1 <= Value 1 AND Variable 2 <= Value 2</code>, etc. We will also allow equality constraints, so that the operations of plyr are a strict subset of the computations in this new model. For example, we might use the constraint <code>Variable = Value 1 AND Variable 2 <= Value 2</code>. If the upper bound is the <code>Time</code> variable, these contraints will allow us to compute the forward-moving mean we described earlier.</p>
<h3>Using Lower Bounds</h3>
<p>Instead of using upper bounds, we can use lower bounds to divide data into groups in which <code>Variable >= Value 1 AND Variable 2 >= Value 2</code>, etc. This allows us to implement the backward-moving mean described earlier.</p>
<h3>Using Norm Balls</h3>
<p>Finally, we can consider a combination of upper and lower bounds. For simplicity, we'll assume that these bounds have a fixed tightness around the "center" of each subset of our split data. To articulate this tightness formally, we look at a specific hypothetical equality constraint like <code>Variable 1 = Value 1</code> and then loosen it so that <code>norm(Variable 1 - Value 1) <= r</code>. When <code>r = 0</code>, this system gives the original equality constraint. But when <code>r > 0</code>, we produce a "ball" of data around the constraint whose tightness is <code>r</code>. This lets us estimate the local means from our third example.</p>
<h3>Implementation</h3>
<p>To demo these ideas in a usable fashion, I've created a draft package for R called <a href="http://bit.ly/IEPGnW"><code>cumplyr</code></a>. Here is an extended example of its usage in solving simple variants of the problems described in this post:</p>

<div class="wp_codebox"><table><tr id="p44272"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
</pre></td><td class="code" id="p4427code2"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'cumplyr'</span><span style="color: #009900;">&#41;</span>
&nbsp;
data <span style="color: #339933;">&lt;-</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span>Time <span style="color: #339933;">=</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span><span style="color: #0000dd;">5</span><span style="color: #339933;">,</span> Value <span style="color: #339933;">=</span> seq<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">9</span><span style="color: #339933;">,</span> by <span style="color: #339933;">=</span> <span style="color: #0000dd;">2</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
iddply<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">,</span>
       equality.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'Time'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       lower.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       upper.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       norm.<span style="color: #202020;">ball</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> list<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       func <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>df<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>with<span style="color: #009900;">&#40;</span>df<span style="color: #339933;">,</span> mean<span style="color: #009900;">&#40;</span>Value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
&nbsp;
iddply<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">,</span>
       equality.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       lower.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'Time'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       upper.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       norm.<span style="color: #202020;">ball</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> list<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       func <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>df<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>with<span style="color: #009900;">&#40;</span>df<span style="color: #339933;">,</span> mean<span style="color: #009900;">&#40;</span>Value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
&nbsp;
iddply<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">,</span>
       equality.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       lower.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       upper.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'Time'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       norm.<span style="color: #202020;">ball</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> list<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       func <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>df<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>with<span style="color: #009900;">&#40;</span>df<span style="color: #339933;">,</span> mean<span style="color: #009900;">&#40;</span>Value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
&nbsp;
iddply<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">,</span>
       equality.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       lower.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       upper.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       norm.<span style="color: #202020;">ball</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> list<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'Time'</span> <span style="color: #339933;">=</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       func <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>df<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>with<span style="color: #009900;">&#40;</span>df<span style="color: #339933;">,</span> mean<span style="color: #009900;">&#40;</span>Value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
&nbsp;
iddply<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">,</span>
       equality.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       lower.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       upper.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       norm.<span style="color: #202020;">ball</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> list<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'Time'</span> <span style="color: #339933;">=</span> <span style="color: #0000dd;">2</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       func <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>df<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>with<span style="color: #009900;">&#40;</span>df<span style="color: #339933;">,</span> mean<span style="color: #009900;">&#40;</span>Value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
&nbsp;
iddply<span style="color: #009900;">&#40;</span>data<span style="color: #339933;">,</span>
       equality.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       lower.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       upper.<span style="color: #202020;">bound</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       norm.<span style="color: #202020;">ball</span>.<span style="color: #202020;">variables</span> <span style="color: #339933;">=</span> list<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'Time'</span> <span style="color: #339933;">=</span> <span style="color: #0000dd;">5</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
       func <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>df<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>with<span style="color: #009900;">&#40;</span>df<span style="color: #339933;">,</span> mean<span style="color: #009900;">&#40;</span>Value<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>You can download this package from <a href="http://bit.ly/IEPGnW">GitHub</a> and play with it to see whether it helps you. Please submit feedback using GitHub if you have any comments, complaints or patches.</p>
<h3>Comparing plyr with cumplyr</h3>
<p>In the long run, I'm hoping to make the functions in <a href="http://bit.ly/IEPGnW">cumplyr</a> robust enough to submit a patch to plyr. I see these tools as one logical extension of plyr to encompass more of the framework described in Hadley's paper on the Split-Apply-Combine strategy.</p>
<p>For the time being, I would advise any users of <a href="http://bit.ly/IEPGnW">cumplyr</a> to make sure that you do not use cumplyr for anything that plyr could already do. cumplyr is very much demo software and I am certain that both its API and implementation will change. In contrast, plyr is fast and stable software that can be trusted to perform its job.</p>
<p>But, if you have a problem that cumplyr will solve and plyr will not, I hope you'll try cumplyr out and submit patches when it breaks.</p>
<p>Happy hacking!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/05/03/cumplyr-extending-the-plyr-package-to-handle-cross-dependencies/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Implementing the Exact Binomial Test in Julia</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/04/14/implementing-the-exact-binomial-test-in-julia/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/04/14/implementing-the-exact-binomial-test-in-julia/#comments</comments>
		<pubDate>Sat, 14 Apr 2012 15:19:24 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4390</guid>
		<description><![CDATA[One major benefit of spending my time recently adding statistical functionality to Julia is that I&#8217;ve learned a lot about the inner guts of algorithmic null hypothesis significance testing. Implementing Welch&#8217;s two-sample t-test last week was a trivial task because of the symmetry of the null hypothesis, but implementing the exact binomial test has proven [...]]]></description>
			<content:encoded><![CDATA[<p>One major benefit of spending my time recently adding <a href="https://github.com/johnmyleswhite/stats.jl">statistical functionality to Julia</a> is that I&#8217;ve learned a lot about the inner guts of algorithmic null hypothesis significance testing.</p>
<p>Implementing <a href="https://github.com/johnmyleswhite/stats.jl/blob/master/src/t_test.jl">Welch&#8217;s two-sample t-test</a> last week was a trivial task because of the symmetry of the null hypothesis, but implementing the exact binomial test has proven to be more challenging because the asymmetry of a skewed null defined over a bounded set means that one has to think a bit more carefully about what is being computed.</p>
<p>To see why, let&#8217;s first recap the logic of the standard two-sided hypothesis test. In all NHST situations, you assign a p-value to the observed data by working under the null hypothesis and using this assumption to calculate the probability of observing data sets that are as extreme or more extreme than the observed data.</p>
<p>For the normal, this calculation is easy: you find the probability of seeing an equal or higher z-score than the observed data&#8217;s z-score and then you double this probability to account for the lower tail in which the hypothetical z-score is lower than the observed z-score.</p>
<p>But for the binomial, the right quantity to use as a definition of extremity is less obvious (at least to me). Suppose that you&#8217;ve seen <code>x</code> successes after <code>n</code> samples from a Bernoulli variable with probability <code>p</code> of success.</p>
<p>You might try defining extremity by saying that a hypothetical data set <code>y</code> is more extreme than <code>x</code> if <code>abs(y - n) > abs(x - n)</code>: in short, you could use the count space to assess extremity.</p>
<p>This approach will not work. Consider the case in which <code>x = 4</code>, <code>n = 10</code> and <code>p = 0.2</code>. Under this definition <code>y = 0</code> would be as extreme as <code>y = 4</code>, but <code>p(y = 0) > p(y = 4)</code>, so <code>0</code> should not be considered as extreme as <code>4</code>. You need to use probability space and not count space to assess extremity.</p>
<p>This logic leads to the conclusion that the proper definition is one in which <code>y</code> is as extreme or more extreme than <code>x</code> if <code>p(y, n, p) &lt; p(x, n, p)</code>. This is the correct definition for the exact binomial test. Implementing it leads to this piece of code in Julia for computing p-values for the binomial test:</p>

<div class="wp_codebox"><table><tr id="p43906"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
</pre></td><td class="code" id="p4390code6"><pre class="python" style="font-family:monospace;">load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;extras/Rmath.jl&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
function binom_p_value<span style="color: black;">&#40;</span>x, n, p<span style="color: black;">&#41;</span>
  <span style="color: #008000;">sum</span><span style="color: black;">&#40;</span><span style="color: #008000;">filter</span><span style="color: black;">&#40;</span>d -<span style="color: #66cc66;">&gt;</span> d <span style="color: #66cc66;">&lt;</span>= dbinom<span style="color: black;">&#40;</span>x, n, p<span style="color: black;">&#41;</span>,
             <span style="color: #008000;">map</span><span style="color: black;">&#40;</span>i -<span style="color: #66cc66;">&gt;</span> dbinom<span style="color: black;">&#40;</span>i, n, p<span style="color: black;">&#41;</span>,
                 <span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span>:n<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
end
&nbsp;
binom_p_value<span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">10</span>, <span style="color: #ff4500;">0.8</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<p>As far as I know, this procedure may be as efficient as possible, but it seems odd to me that we should need to assess the PDF at <code>n + 1</code> numbers when, in principle, we should only need to assess the CDF of the binomial distribution at two points to find a p-value.</p>
<p>For that reason, you might hope to replace your loop over <code>n + 1</code> numbers to one that, for extreme data sets, is more efficient by estimating lower and upper bounds on the count values with lower PDF values than the observed data. For the moment, I&#8217;m experimenting with doing this as follows:</p>

<div class="wp_codebox"><table><tr id="p43907"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code" id="p4390code7"><pre class="python" style="font-family:monospace;">load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;extras/Rmath.jl&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
function binom_p_value<span style="color: black;">&#40;</span>x, n, p<span style="color: black;">&#41;</span>
  lower_bound = floor<span style="color: black;">&#40;</span>n <span style="color: #66cc66;">*</span> p - <span style="color: #008000;">abs</span><span style="color: black;">&#40;</span>n <span style="color: #66cc66;">*</span> p - x<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
  upper_bound = ceil<span style="color: black;">&#40;</span>n <span style="color: #66cc66;">*</span> p + <span style="color: #008000;">abs</span><span style="color: black;">&#40;</span>n <span style="color: #66cc66;">*</span> p - x<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
  <span style="color: #008000;">sum</span><span style="color: black;">&#40;</span><span style="color: #008000;">filter</span><span style="color: black;">&#40;</span>d -<span style="color: #66cc66;">&gt;</span> d <span style="color: #66cc66;">&lt;</span>= dbinom<span style="color: black;">&#40;</span>x, n, p<span style="color: black;">&#41;</span>,
  	     <span style="color: #008000;">map</span><span style="color: black;">&#40;</span>i -<span style="color: #66cc66;">&gt;</span> dbinom<span style="color: black;">&#40;</span>i, n, p<span style="color: black;">&#41;</span>,
	         vcat<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span>:lower_bound<span style="color: black;">&#93;</span>, <span style="color: black;">&#91;</span>upper_bound:n<span style="color: black;">&#93;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
end
&nbsp;
binom_p_value<span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">10</span>, <span style="color: #ff4500;">0.8</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<p>Unfortunately, I haven&#8217;t yet done the analytic work to demonstrate that these bounds are actually correct. (One of them must be, since one of them is <code>a</code> and the strict monotonicity of the distribution function about <code>n * p</code> guarantees that <code>a</code> must be either a lower or an upper bound for itself.) Of course, if these bounds are sufficiently conservative, they&#8217;ll function to save computation without any risk of giving corrupt answers &#8212; even if they&#8217;re not the tightest possible bounds.</p>
<p>Note that, in principle, it should be possible to go further: if you know exact bounds, then the summing and filtering operations are entirely superfluous and we can run:</p>

<div class="wp_codebox"><table><tr id="p43908"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code" id="p4390code8"><pre class="python" style="font-family:monospace;">load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;extras/Rmath.jl&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
function binom_p_value<span style="color: black;">&#40;</span>x, n, p<span style="color: black;">&#41;</span>
  lower_bound = exact_lower_bound<span style="color: black;">&#40;</span>x, n, p<span style="color: black;">&#41;</span>
  upper_bound = exact_upper_bound<span style="color: black;">&#40;</span>x, n, p<span style="color: black;">&#41;</span>
&nbsp;
  pbinom<span style="color: black;">&#40;</span>lower_bound, n, p<span style="color: black;">&#41;</span> + <span style="color: #ff4500;">1</span> - pbinom<span style="color: black;">&#40;</span>upper_bound - <span style="color: #ff4500;">1</span>, n, p<span style="color: black;">&#41;</span>
end
&nbsp;
binom_p_value<span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">10</span>, <span style="color: #ff4500;">0.8</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<p>I don&#8217;t know that these exact bounds can be computed exactly without a lot of work, but if they can be, they give a much more efficient implementation of the exact binomial test.</p>
<p>I wrote all of this up because (a) I&#8217;d appreciate knowing if the exact bounds can be computed efficiently and (b) I thought it was a very nice example of (1) thinking through the logic of hypothesis testing in detail (including considerations of what extremity really means) and (2) the constant problem that mathematically equivalent definitions suggest algorithms with very different computational costs.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/04/14/implementing-the-exact-binomial-test-in-julia/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Floating Point Arithmetic and The Descent into Madness</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/04/13/floating-point-arithmetic-and-the-descent-into-madness/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/04/13/floating-point-arithmetic-and-the-descent-into-madness/#comments</comments>
		<pubDate>Sat, 14 Apr 2012 02:36:13 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4387</guid>
		<description><![CDATA[While I should confess upfront that I&#8217;ve always had a weaker command of the details of floating point arithmetic than I feel I ought to have, this sort of thing still blows my mind when I stumble upon it. These moments invariably make me realize that floating point math will simply never satisfy my naive [...]]]></description>
			<content:encoded><![CDATA[<p>While I should confess upfront that I&#8217;ve always had a weaker command of the details of floating point arithmetic than I feel I ought to have, this sort of thing still blows my mind when I stumble upon it. These moments invariably make me realize that floating point math will simply never satisfy my naive hopes as a mathematician:</p>

<div class="wp_codebox"><table><tr id="p438710"><td class="line_numbers"><pre>1
2
3
</pre></td><td class="code" id="p4387code10"><pre class="c" style="font-family:monospace;"><span style="color:#800080;">0.1</span> <span style="color: #339933;">+</span> <span style="color:#800080;">0.1</span> <span style="color: #339933;">==</span> <span style="color:#800080;">0.2</span> <span style="color: #339933;"># True</span>
<span style="color:#800080;">0.1</span> <span style="color: #339933;">+</span> <span style="color:#800080;">0.1</span> <span style="color: #339933;">+</span> <span style="color:#800080;">0.1</span> <span style="color: #339933;">==</span> <span style="color:#800080;">0.3</span> <span style="color: #339933;"># False</span>
<span style="color:#800080;">0.1</span> <span style="color: #339933;">+</span> <span style="color:#800080;">0.1</span> <span style="color: #339933;">+</span> <span style="color:#800080;">0.1</span> <span style="color: #339933;">+</span> <span style="color:#800080;">0.1</span> <span style="color: #339933;">==</span> <span style="color:#800080;">0.4</span> <span style="color: #339933;"># True</span></pre></td></tr></table></div>

<p>On my Intel Core 2 Duo machine running OS X, those statements have the indicated truth values in all three of Julia, R and Python.</p>
<p>Consider this evidence for the truth of the combined propositions, &#8220;God created the integers. All else is the work of man,&#8221; and &#8220;Out of the crooked timber of humanity no straight thing was ever made.&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/04/13/floating-point-arithmetic-and-the-descent-into-madness/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Comparing Julia and R&#8217;s Vocabularies</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/04/09/comparing-julia-and-rs-vocabularies/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/04/09/comparing-julia-and-rs-vocabularies/#comments</comments>
		<pubDate>Mon, 09 Apr 2012 14:00:19 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4369</guid>
		<description><![CDATA[While exploring the Julia manual recently, I realized that it might be helpful to put the basic vocabularies of Julia and R side-by-side for easy comparison. So I took Hadley Wickham&#8217;s R Vocabulary section from the book he&#8217;s putting together on the devtools wiki, put all of the functions Hadley listed into a CSV file, [...]]]></description>
			<content:encoded><![CDATA[<p>While exploring the <a href="http://julialang.org/manual">Julia manual</a> recently, I realized that it might be helpful to put the basic vocabularies of Julia and R side-by-side for easy comparison. So I took <a href="http://had.co.nz/">Hadley Wickham&#8217;s</a> <a href="https://github.com/hadley/devtools/wiki/vocabulary">R Vocabulary</a> section from the book he&#8217;s putting together on the <a href="https://github.com/hadley/devtools/wiki">devtools wiki</a>, put all of the functions Hadley listed into a CSV file, and proceeded to fill in entries where I knew of an obvious Julia equivalent to an R function.</p>
<p>The results are on <a href="https://github.com/johnmyleswhite/JuliaVsR/blob/master/vocab.csv">GitHub</a> and, as they stand today, are shown below:</p>
<table summary="Julia and R's Vocabularies">
<tr>
<th>
			R
		</th>
<th>
			Julia
		</th>
<th>
			Category
		</th>
<th>
			Subcategory
		</th>
</tr>
<tr>
<td>
<p>https://github.com/hadley/devtools/wiki/vocabulary</p>
</td>
<td>
<p>http://julialang.org/manual/standard-library-reference/</p>
</td>
<td>
			Resources
		</td>
<td>
			Vocabulary
		</td>
</tr>
<tr>
<td>
			?
		</td>
<td>
			help
		</td>
<td>
			Basics
		</td>
<td>
			First Functions
		</td>
</tr>
<tr>
<td>
			str
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			First Functions
		</td>
</tr>
<tr>
<td>
			%in%
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			match
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			=
		</td>
<td>
			=
		</td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			&lt;-
		</td>
<td>
			=
		</td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			&lt;&lt;-
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			assign
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			$
		</td>
<td>
			[]
		</td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			[]
		</td>
<td>
			[]
		</td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			[[]]
		</td>
<td>
			[]
		</td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			replace
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			head
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			tail
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			subset
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			with
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			within
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Operators
		</td>
</tr>
<tr>
<td>
			all.equal
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			identical
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			!=
		</td>
<td>
			!=
		</td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			==
		</td>
<td>
			==
		</td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			&gt;
		</td>
<td>
			&gt;
		</td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			&gt;=
		</td>
<td>
			&gt;=
		</td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			&lt;
		</td>
<td>
			&lt;
		</td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			&lt;=
		</td>
<td>
			&lt;=
		</td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			is.na
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			is.nan
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			is.finite
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			complete.cases
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Comparison
		</td>
</tr>
<tr>
<td>
			*
		</td>
<td>
			*
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			+
		</td>
<td>
			+
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			-
		</td>
<td>
			-
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			/
		</td>
<td>
			/
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			^
		</td>
<td>
			^
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			%%
		</td>
<td>
			mod (%%)
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			%/%
		</td>
<td>
			div
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			abs
		</td>
<td>
			abs
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			sign
		</td>
<td>
			sign
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			acos
		</td>
<td>
			acos
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			acosh
		</td>
<td>
			acosh
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			asin
		</td>
<td>
			asin
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			asinh
		</td>
<td>
			asinh
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			atan
		</td>
<td>
			atan
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			atan2
		</td>
<td>
			atan2
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			atanh
		</td>
<td>
			atanh
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			sin
		</td>
<td>
			sin
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			sinh
		</td>
<td>
			sinh
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			cos
		</td>
<td>
			cos
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			cosh
		</td>
<td>
			cosh
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			tan
		</td>
<td>
			tan
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			tanh
		</td>
<td>
			tanh
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			ceiling
		</td>
<td>
			ceil
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			floor
		</td>
<td>
			floor
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			round
		</td>
<td>
			round
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			trunc
		</td>
<td>
			trunc
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			signif
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			exp
		</td>
<td>
			exp
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			log
		</td>
<td>
			log
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			log10
		</td>
<td>
			log10
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			log1p
		</td>
<td>
			log1p
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			log2
		</td>
<td>
			log2
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			logb
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			sqrt
		</td>
<td>
			sqrt
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			cummax
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			cummin
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			cumprod
		</td>
<td>
			cumprod
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			cumsum
		</td>
<td>
			cumsum
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			diff
		</td>
<td>
			diff
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			max
		</td>
<td>
			max
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			min
		</td>
<td>
			min
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			prod
		</td>
<td>
			prod
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			sum
		</td>
<td>
			sum
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			range
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			mean
		</td>
<td>
			mean
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			median
		</td>
<td>
			median
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			cor
		</td>
<td>
			cor_pearson
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			cov
		</td>
<td>
			cov_pearson
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			sd
		</td>
<td>
			std
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			var
		</td>
<td>
			var
		</td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			pmax
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			pmin
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			rle
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Basic Math
		</td>
</tr>
<tr>
<td>
			function
		</td>
<td>
			function
		</td>
<td>
			Basics
		</td>
<td>
			Functions
		</td>
</tr>
<tr>
<td>
			missing
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Functions
		</td>
</tr>
<tr>
<td>
			on.exit
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Functions
		</td>
</tr>
<tr>
<td>
			return
		</td>
<td>
			return
		</td>
<td>
			Basics
		</td>
<td>
			Functions
		</td>
</tr>
<tr>
<td>
			invisible
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Functions
		</td>
</tr>
<tr>
<td>
			&amp;
		</td>
<td>
			&amp;
		</td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			|
		</td>
<td>
			|
		</td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			!
		</td>
<td>
			!
		</td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			xor
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			all
		</td>
<td>
			all
		</td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			any
		</td>
<td>
			any
		</td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			intersect
		</td>
<td>
			intersect
		</td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			union
		</td>
<td>
			union
		</td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			setdiff
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			setequal
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			which
		</td>
<td>
			find
		</td>
<td>
			Basics
		</td>
<td>
			Logical &amp; Set Operations
		</td>
</tr>
<tr>
<td>
			c
		</td>
<td>
			[] ({})
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			matrix
		</td>
<td>
			[] ({})
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			length
		</td>
<td>
			size (length)
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			dim
		</td>
<td>
			size
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			ncol
		</td>
<td>
			size(x, 1)
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			nrow
		</td>
<td>
			size(x, 2)
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			cbind
		</td>
<td>
			hcat
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			rbind
		</td>
<td>
			vcat
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			names
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			colnames
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			rownames
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			t
		</td>
<td>
			&#8216;
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			diag
		</td>
<td>
			eye
		</td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			sweep
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			as.matrix
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			data.matrix
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Vectors and Matrices
		</td>
</tr>
<tr>
<td>
			c
		</td>
<td>
			[] ({})
		</td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			rep
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			seq
		</td>
<td>
			[from:by:to]
		</td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			seq_along
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			seq_len
		</td>
<td>
			[1:len]
		</td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			rev
		</td>
<td>
			reverse
		</td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			sample
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			choose
		</td>
<td>
			factorial
		</td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			factorial
		</td>
<td>
			factorial
		</td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			combn
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			(is/as).(character/numeric/logical)
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Making Vectors
		</td>
</tr>
<tr>
<td>
			list
		</td>
<td>
			HashTable ([])
		</td>
<td>
			Basics
		</td>
<td>
			Lists &amp; Data Frames
		</td>
</tr>
<tr>
<td>
			unlist
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Lists &amp; Data Frames
		</td>
</tr>
<tr>
<td>
			data.frame
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Lists &amp; Data Frames
		</td>
</tr>
<tr>
<td>
			as.data.frame
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Lists &amp; Data Frames
		</td>
</tr>
<tr>
<td>
			split
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Lists &amp; Data Frames
		</td>
</tr>
<tr>
<td>
			expand.grid
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Lists &amp; Data Frames
		</td>
</tr>
<tr>
<td>
			if
		</td>
<td>
			if
		</td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			&amp;&amp;
		</td>
<td>
			&amp;&amp;
		</td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			||
		</td>
<td>
			||
		</td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			for
		</td>
<td>
			for
		</td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			while
		</td>
<td>
			while
		</td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			next
		</td>
<td>
			continue
		</td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			break
		</td>
<td>
			break
		</td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			switch
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			ifelse
		</td>
<td></td>
<td>
			Basics
		</td>
<td>
			Control Flow
		</td>
</tr>
<tr>
<td>
			fitted
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			predict
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			resid
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			rstandard
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			lm
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			glm
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			hat
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			influence.measures
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			logLik
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			df
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			deviance
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			formula
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			~
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			I
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			anova
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			coef
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			confint
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			vcov
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			contrasts
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Linear Models
		</td>
</tr>
<tr>
<td>
			apropos(&#8216;\\.test$&#8217;)
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Miscellaneous Statistical Tests
		</td>
</tr>
<tr>
<td>
			beta
		</td>
<td>
			beta
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			binom
		</td>
<td>
			binom
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			cauchy
		</td>
<td>
			cauchy
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			chisq
		</td>
<td>
			chisq
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			exp
		</td>
<td>
			exp
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			f
		</td>
<td>
			f
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			gamma
		</td>
<td>
			gamma
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			geom
		</td>
<td>
			geom
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			hyper
		</td>
<td>
			hyper
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			lnorm
		</td>
<td>
			lnorm
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			logis
		</td>
<td>
			logis
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			multinom
		</td>
<td>
			multinom
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			nbinom
		</td>
<td>
			nbinom
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			norm
		</td>
<td>
			norm
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			pois
		</td>
<td>
			pois
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			signrank
		</td>
<td>
			signrank
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			t
		</td>
<td>
			t
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			unif
		</td>
<td>
			unif (rand)
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			weibull
		</td>
<td>
			weibull
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			wilcox
		</td>
<td>
			wilcox
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			birthday
		</td>
<td>
			birthday
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			tukey
		</td>
<td>
			tukey
		</td>
<td>
			Statistics
		</td>
<td>
			Random Numbers
		</td>
</tr>
<tr>
<td>
			crossprod
		</td>
<td>
			*
		</td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			tcrossprod
		</td>
<td>
			*
		</td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			eigen
		</td>
<td>
			eig
		</td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			qr
		</td>
<td>
			qr
		</td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			svd
		</td>
<td>
			svd
		</td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			%*%
		</td>
<td>
			*
		</td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			%o%
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			outer
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			rcond
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			solve
		</td>
<td>
			\
		</td>
<td>
			Statistics
		</td>
<td>
			Matrix Algebra
		</td>
</tr>
<tr>
<td>
			duplicated
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			unique
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			merge
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			order
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			rank
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			quantile
		</td>
<td>
			quantile
		</td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			sort
		</td>
<td>
			sort
		</td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			table
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			ftable
		</td>
<td></td>
<td>
			Statistics
		</td>
<td>
			Ordering and Tabulating
		</td>
</tr>
<tr>
<td>
			ls
		</td>
<td>
			whos
		</td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			exists
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			get
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			rm
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			getwd
		</td>
<td>
			getcwd
		</td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			setwd
		</td>
<td>
			setcwd
		</td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			q
		</td>
<td>
			Ctrl-D
		</td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			source
		</td>
<td>
			load
		</td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			install.packages
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			library
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			require
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Workspace
		</td>
</tr>
<tr>
<td>
			help
		</td>
<td>
			help
		</td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			?
		</td>
<td>
			help
		</td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			help.search
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			apropos
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			RSiteSearch
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			citation
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			demo
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			example
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			vignette
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Help
		</td>
</tr>
<tr>
<td>
			traceback
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			browser
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			recover
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			options(error =)
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			stop
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			warning
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			message
		</td>
<td></td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			tryCatch
		</td>
<td>
			try/catch
		</td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			try
		</td>
<td>
			try
		</td>
<td>
			Working with R
		</td>
<td>
			Debugging
		</td>
</tr>
<tr>
<td>
			print
		</td>
<td>
			print (println)
		</td>
<td>
			I/O
		</td>
<td>
			Output
		</td>
</tr>
<tr>
<td>
			cat
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Output
		</td>
</tr>
<tr>
<td>
			message
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Output
		</td>
</tr>
<tr>
<td>
			warning
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Output
		</td>
</tr>
<tr>
<td>
			dput
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Output
		</td>
</tr>
<tr>
<td>
			format
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Output
		</td>
</tr>
<tr>
<td>
			sink
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Output
		</td>
</tr>
<tr>
<td>
			data
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			count.fields
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			read.csv
		</td>
<td>
			csvread
		</td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			read.delim
		</td>
<td>
			dlmread
		</td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			read.fwf
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			read.table
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			library(foreign)
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			write.table
		</td>
<td>
			dlmwrite
		</td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			readLines
		</td>
<td>
			readlines
		</td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			writeLines
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			load
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			save
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			readRDS
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			saveRDS
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Reading and Writing Data
		</td>
</tr>
<tr>
<td>
			dir
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			basename
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			dirname
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			file.path
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			path.expand
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			file.choose
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			file.copy
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			file.create
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			file.remove
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			path.rename
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			dir.create
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			file.exists
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			tempdir
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			tempfile
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			download.file
		</td>
<td></td>
<td>
			I/O
		</td>
<td>
			Files and Directories
		</td>
</tr>
<tr>
<td>
			ISOdate
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			ISOdatetime
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			strftime
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			strptime
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			date
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			difftime
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			julian
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			months
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			quarters
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			weekdays
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			library(lubridate)
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Date / Time
		</td>
</tr>
<tr>
<td>
			grep
		</td>
<td>
			match
		</td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			agrep
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			gsub
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			strsplit
		</td>
<td>
			split
		</td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			chartr
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			nchar
		</td>
<td>
			strlen
		</td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			tolower
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			toupper
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			substr
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			paste
		</td>
<td>
			join
		</td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			library(stringr)
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Character Manipulation
		</td>
</tr>
<tr>
<td>
			factor
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			levels
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			nlevels
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			reorder
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			relevel
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			cut
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			findInterval
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			interaction
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			options(stringsAsFactors = FALSE)
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Factors
		</td>
</tr>
<tr>
<td>
			array
		</td>
<td>
			[]
		</td>
<td>
			Special Data
		</td>
<td>
			Array Manipulation
		</td>
</tr>
<tr>
<td>
			dim
		</td>
<td>
			size
		</td>
<td>
			Special Data
		</td>
<td>
			Array Manipulation
		</td>
</tr>
<tr>
<td>
			dimnames
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Array Manipulation
		</td>
</tr>
<tr>
<td>
			aperm
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Array Manipulation
		</td>
</tr>
<tr>
<td>
			library(abind)
		</td>
<td></td>
<td>
			Special Data
		</td>
<td>
			Array Manipulation
		</td>
</tr>
</table>
<p>I&#8217;d like to note that holes in the list of Julia functions can exist for several reasons:</p>
<ol>
<li>The language does not yet have the relevant features. This is true of things like <code>factor()</code> or <code>data.frame()</code>.</li>
<li>The language has draft implementations of the relevant features, but they are not yet ready to make their way into this list. This is true of Doug Bates&#8217; GLM code, for example.</li>
<li>I simply don&#8217;t know what the Julia equivalent is for an R function, but it may well exist. If you know of one, please fork the GitHub repository I&#8217;m using and revise the CSV file appropriately. I&#8217;ll integrate relevant pull requests as soon as I can find time.</li>
</ol>
<p>In addition to explaining the presence of the many holes you can see this in this list, I&#8217;d also like to note how quickly these holes are being filled in: Doug Bates already finished a wrapper for the Rmath library, which means that Julia now has tools for calculating the PDF&#8217;s, CDF&#8217;s, and inverse CDF&#8217;s of most statistical distributions as well as the ability to draw random samples from them. That means that almost any sort of MCMC you&#8217;d like to do is already possible in Julia. (I, for one, am really interested to see if someone will use Julia&#8217;s sparse matrix support and these new Rmath functions to build MCMC code that&#8217;s easy on the eyes while also running at an appropriately fast speed on complicated, big data problems like matrix factorizations.)</p>
<p>On my end, I&#8217;ve been working on filling some of the missing entries in this list by adding in pieces that I think I understand well enough to implement from scratch, such as:</p>
<ul>
<li>Optimization algorithms (<a href="https://github.com/johnmyleswhite/optim.jl">optim.jl</a>):</li>
<ul>
<li>Simulated annealing</li>
<li>Gradient descent</li>
<li>Newton&#8217;s method</li>
</ul>
<li>Statistical hypothesis tests (<a href="https://github.com/johnmyleswhite/stats.jl">stats.jl</a>):</li>
<ul>
<li>t-Tests</li>
</ul>
<li>Utility functions (<a href="https://github.com/johnmyleswhite/utils.jl">utils.jl</a>):</li>
<ul>
<li>range</li>
<li>keys</li>
<li>cummax</li>
<li>cummin</li>
</ul>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/04/09/comparing-julia-and-rs-vocabularies/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Simulated Annealing in Julia</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/04/04/simulated-annealing-in-julia/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/04/04/simulated-annealing-in-julia/#comments</comments>
		<pubDate>Wed, 04 Apr 2012 20:38:53 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4366</guid>
		<description><![CDATA[Building Optimization Functions for Julia In hopes of adding enough statistical functionality to Julia to make it usable for my day-to-day modeling projects, I&#8217;ve written a very basic implementation of the simulated annealing (SA) algorithm, which I&#8217;ve placed in the same JuliaVsR GitHub repository that I used for the code for my previous post about [...]]]></description>
			<content:encoded><![CDATA[<h3>Building Optimization Functions for Julia</h3>
<p>In hopes of adding enough statistical functionality to <a href="http://julialang.org/">Julia</a> to make it usable for my day-to-day modeling projects, I&#8217;ve written a very basic implementation of the simulated annealing (SA) algorithm, which I&#8217;ve placed in the same <a href="https://github.com/johnmyleswhite/JuliaVsR">JuliaVsR GitHub</a> repository that I used for the code for <a href="http://www.johnmyleswhite.com/notebook/2012/03/31/julia-i-love-you/">my previous post about Julia</a>. For easier reading, my current code for SA is shown below:</p>
<h3>The Simulated Annealing Algorithm</h3>

<div class="wp_codebox"><table><tr id="p436614"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
</pre></td><td class="code" id="p4366code14"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;"># simulated_annealing()</span>
<span style="color: #808080; font-style: italic;"># Arguments:</span>
<span style="color: #808080; font-style: italic;"># * cost: Function from states to the real numbers. Often called an energy function, but this algorithm works for both positive and negative costs.</span>
<span style="color: #808080; font-style: italic;"># * s0: The initial state of the system.</span>
<span style="color: #808080; font-style: italic;"># * neighbor: Function from states to states. Produces what the Metropolis algorithm would call a proposal.</span>
<span style="color: #808080; font-style: italic;"># * temperature: Function specifying the temperature at time i.</span>
<span style="color: #808080; font-style: italic;"># * iterations: How many iterations of the algorithm should be run? This is the only termination condition.</span>
<span style="color: #808080; font-style: italic;"># * keep_best: Do we return the best state visited or the last state visisted? (Should default to true.)</span>
<span style="color: #808080; font-style: italic;"># * trace: Do we show a trace of the system's evolution?</span>
&nbsp;
function simulated_annealing<span style="color: black;">&#40;</span>cost,
                             s0,
                             neighbor,
                             temperature,
                             iterations,
                             keep_best,
                             trace<span style="color: black;">&#41;</span>
&nbsp;
  <span style="color: #808080; font-style: italic;"># Set our current state to the specified intial state.</span>
  s = s0
&nbsp;
  <span style="color: #808080; font-style: italic;"># Set the best state we've seen to the intial state.</span>
  best_s = s0
&nbsp;
  <span style="color: #808080; font-style: italic;"># We always perform a fixed number of iterations.</span>
  <span style="color: #ff7700;font-weight:bold;">for</span> i = <span style="color: #ff4500;">1</span>:iterations
    t = temperature<span style="color: black;">&#40;</span>i<span style="color: black;">&#41;</span>
    s_n = neighbor<span style="color: black;">&#40;</span>s<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;$i: s = $s&quot;</span><span style="color: black;">&#41;</span> end
    <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;$i: s_n = $s_n&quot;</span><span style="color: black;">&#41;</span> end
    y = cost<span style="color: black;">&#40;</span>s<span style="color: black;">&#41;</span>
    y_n = cost<span style="color: black;">&#40;</span>s_n<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;$i: y = $y&quot;</span><span style="color: black;">&#41;</span> end
    <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;$i: y_n = $y_n&quot;</span><span style="color: black;">&#41;</span> end
    <span style="color: #ff7700;font-weight:bold;">if</span> y_n <span style="color: #66cc66;">&lt;</span>= y
      s = s_n
      <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Accepted&quot;</span><span style="color: black;">&#41;</span> end
    <span style="color: #ff7700;font-weight:bold;">else</span>
      p = exp<span style="color: black;">&#40;</span>- <span style="color: black;">&#40;</span><span style="color: black;">&#40;</span>y_n - y<span style="color: black;">&#41;</span> / t<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
      <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;$i: p = $p&quot;</span><span style="color: black;">&#41;</span> end
      <span style="color: #ff7700;font-weight:bold;">if</span> rand<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">&lt;</span>= p
        s = s_n
        <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Accepted&quot;</span><span style="color: black;">&#41;</span> end
      <span style="color: #ff7700;font-weight:bold;">else</span>
        s = s
        <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Rejected&quot;</span><span style="color: black;">&#41;</span> end
      end
    end
    <span style="color: #ff7700;font-weight:bold;">if</span> trace println<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> end
    <span style="color: #ff7700;font-weight:bold;">if</span> cost<span style="color: black;">&#40;</span>s<span style="color: black;">&#41;</span> <span style="color: #66cc66;">&lt;</span> cost<span style="color: black;">&#40;</span>best_s<span style="color: black;">&#41;</span>
      best_s = s
    end
  end
&nbsp;
  <span style="color: #ff7700;font-weight:bold;">if</span> keep_best
    best_s
  <span style="color: #ff7700;font-weight:bold;">else</span>
    s
  end
end</pre></td></tr></table></div>

<p>I&#8217;ve tried to implement the algorithm as it was presented by <a href="http://www.mit.edu/~dbertsim/">Bertsimas</a> and <a href="http://www.mit.edu/~jnt/home.html">Tsitsiklis</a> in their <a href="http://www.mit.edu/~jnt/Papers/J045-93-ber-anneal.pdf">1993 review paper in Statistical Science</a>. The major differences between my implementation and their description of the algorithm is that (1) I&#8217;ve made it possible to keep the best solution found during the search rather than always use the last solution found and (2) I&#8217;ve made no effort to select a value for their <code>d</code> parameter carefully: I&#8217;ve simply set it to 1 for all of my examples, though my code will allow you to specify another rule for setting the temperature of the annealer at time t other than the <code>1 / log(t)</code> cooling scheme I&#8217;ve been using. (In fact, the code currently forces you to select a cooling scheme, since there are no default arguments in Julia yet.)</p>
<p>I chose simulated annealing as my first optimization algorithm to implement in Julia because it&#8217;s a natural relative of the <a href="http://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis algorithm</a> that I used in the previous post. Indeed, coding up an implementation of SA made me appreciate the fact that the Metropolis algorithm as used in Bayesian statistics can be considered a special case of the SA algorithm in which the temperature is always 1 and in which the cost function for a state with posterior probability <code>p</code> is <code>-log(p)</code>.</p>
<p>Coding up the SA algorithm for myself also me made realize why it&#8217;s important that it uses an additive comparison of cost functions rather than a ratio (as in the Metropolis algorithm): the ratio goes haywire when the cost function can take on both positive and negative values (which, of course, doesn&#8217;t matter for Bayesian methods because probabilities are strictly non-negative). I discovered this when I initially tried to code up SA from my inaccurate memory without first consulting the literature and discovered that I couldn&#8217;t get a ratio-based algorithm to work no matter how many times I tried changing the cooling schedule.</p>
<p>To test out the SA implementation I&#8217;ve written, I&#8217;ve written two tests cases that attempt to minimize the <a href="http://en.wikipedia.org/wiki/Rosenbrock_function">Rosenbrock</a> and <a href="http://en.wikipedia.org/wiki/Himmelblau%27s_function">Himmelbrau</a> functions, which I found listed as examples of hard-to-minimize functions in the <a href="http://en.wikipedia.org/wiki/Nelder–Mead_method">Wikipedia description of the Nelder-Mead method</a>. You can see those two examples below this paragraph. In addition, I&#8217;ve used R to generate plots showing how the SA algorithm works under repeated application on the same optimization problem; in these plots, I&#8217;ve used a heatmap to show the cost functions value at each <code>(x, y)</code> position, colored crosshairs to indicate the position of a true minimum of the function in question, and red dots to indicate the purported solutions found by my implementation of SA.</p>
<h3>Finding the Minimum of the Rosenbrock Function</h3>

<div class="wp_codebox"><table><tr id="p436615"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code" id="p4366code15"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;"># Find a solution of the Rosenbrock function using SA.</span>
load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;simulated_annealing.jl&quot;</span><span style="color: black;">&#41;</span>
load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;../rng.jl&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
function rosenbrock<span style="color: black;">&#40;</span>x, y<span style="color: black;">&#41;</span>
  <span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span> - x<span style="color: black;">&#41;</span>^<span style="color: #ff4500;">2</span> + <span style="color: #ff4500;">100</span><span style="color: black;">&#40;</span>y - x^<span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>^<span style="color: #ff4500;">2</span>
end
&nbsp;
function neighbors<span style="color: black;">&#40;</span>z<span style="color: black;">&#41;</span>
  <span style="color: black;">&#91;</span>rand_uniform<span style="color: black;">&#40;</span>z<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> - <span style="color: #ff4500;">1</span>, z<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>, rand_uniform<span style="color: black;">&#40;</span>z<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span> - <span style="color: #ff4500;">1</span>, z<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
end
&nbsp;
srand<span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
&nbsp;
solution = simulated_annealing<span style="color: black;">&#40;</span>z -<span style="color: #66cc66;">&gt;</span> rosenbrock<span style="color: black;">&#40;</span>z<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>, z<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
                               <span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span>, <span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>,
                               neighbors,
                               log_temperature,
                               <span style="color: #ff4500;">10000</span>,
                               true,
                               false<span style="color: black;">&#41;</span></pre></td></tr></table></div>

<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/04/rosenbrock.png" alt="rosenbrock.png" border="0" width="600" height="600" /></div>
<h3>Finding the Minima of the Himmelbrau Function</h3>

<div class="wp_codebox"><table><tr id="p436616"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code" id="p4366code16"><pre class="python" style="font-family:monospace;"><span style="color: #808080; font-style: italic;"># Find a solution of the Himmelbrau function using SA.</span>
load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;simulated_annealing.jl&quot;</span><span style="color: black;">&#41;</span>
load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;../rng.jl&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
function himmelbrau<span style="color: black;">&#40;</span>x, y<span style="color: black;">&#41;</span>
  <span style="color: black;">&#40;</span>x^<span style="color: #ff4500;">2</span> + y - <span style="color: #ff4500;">11</span><span style="color: black;">&#41;</span>^<span style="color: #ff4500;">2</span> + <span style="color: black;">&#40;</span>x + y^<span style="color: #ff4500;">2</span> - <span style="color: #ff4500;">7</span><span style="color: black;">&#41;</span>^<span style="color: #ff4500;">2</span>
end
&nbsp;
function neighbors<span style="color: black;">&#40;</span>z<span style="color: black;">&#41;</span>
  <span style="color: black;">&#91;</span>rand_uniform<span style="color: black;">&#40;</span>z<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> - <span style="color: #ff4500;">1</span>, z<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>, rand_uniform<span style="color: black;">&#40;</span>z<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span> - <span style="color: #ff4500;">1</span>, z<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span> + <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
end
&nbsp;
srand<span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
&nbsp;
solution = simulated_annealing<span style="color: black;">&#40;</span>z -<span style="color: #66cc66;">&gt;</span> himmelbrau<span style="color: black;">&#40;</span>z<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>, z<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>,
                               <span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span>, <span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>,
                               neighbors,
                               log_temperature,
                               <span style="color: #ff4500;">10000</span>,
                               true,
                               false<span style="color: black;">&#41;</span></pre></td></tr></table></div>

<div style="text-align:center;"><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/04/himmelbrau.png" alt="himmelbrau.png" border="0" width="600" height="600" /></div>
<h3>Moving Forward</h3>
<p>Now that I&#8217;ve got a form of SA working, I&#8217;m interested in coding up a suite of optimization functions for Julia so that I can start to do maximum likelihood estimation in pure Julia. Once that&#8217;s possible, I can use Julia to do real science, e.g. when I need to fit simple models for which finding the MLE is appropriate. (I will leave the development of cleaner statistical functions for special cases of maximum likelihood estimation to more capable people, like Douglas Bates, <a href="https://groups.google.com/d/topic/julia-dev/GeouH--RPZo/discussion">who has already produced some GLM code</a>.)</p>
<p>At present my code is meant simply to demonstrate how one could write an implementation of simulated annealing in Julia. I&#8217;m sure that the code can be more efficient and I suspect that I&#8217;ve violated some of the idioms of the language. In addition, I&#8217;d much prefer that this function use default values for many of the arguments, as there is no reason that an end-user needs to be concerned with finding the best cooling schedule if SA seems to work out of the box on their problem with the cooling schedule I&#8217;ve been using.</p>
<p>With those disclaimers about my code in place, I&#8217;d like to think that I haven&#8217;t made any mathematical errors and that this function can be used by others. So, I&#8217;d ask that those interested please tear apart my code so that I can make it usable as a general purpose function for optimization in Julia. Alternatively, please tell me that there&#8217;s no need for a pure Julia implementation of SA because, for example, there are nice C libraries that would be much easier to link to than to re-implement.</p>
<p>With an implementation of SA in place, I&#8217;ll probably start working on implementing L-BFGS-S soon, which is the other optimization algorithm I use often in R. (To be honest, I use L-BFGS-S almost exclusively, but SA was much easier to implement.)</p>
<p>Incidentally, this code base demonstrates how I view the relationship between R and Julia: Julia is a beautiful new language that is still missing many important pieces. We can all work together to build the best pieces of R that are missing from Julia. While we&#8217;re working on improving Julia, we&#8217;ll need to keep using R to handle things like visualization of our results. For this post, I turned back to ggplot2 for all of the graphics generation.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/04/04/simulated-annealing-in-julia/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Julia, I Love You</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/03/31/julia-i-love-you/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/03/31/julia-i-love-you/#comments</comments>
		<pubDate>Sat, 31 Mar 2012 21:47:05 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4354</guid>
		<description><![CDATA[Julia is a new language for scientific computing that is winning praise from a slew of very smart people, including Harlan Harris, Chris Fonnesbeck, Douglas Bates, Vince Buffalo and Shane Conway. As a language, it has lofty design goals, which, if attained, will make it noticeably superior to Matlab, R and Python for scientific programming. [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://julialang.org/">Julia</a> is a new language for scientific computing that is winning praise from a slew of very smart people, including <a href="https://en.twitter.com/#!/HarlanH/status/173478216781144065">Harlan Harris</a>, <a href="https://de.twitter.com/#!/fonnesbeck/status/185399199976787968">Chris Fonnesbeck</a>, <a href="http://dmbates.blogspot.com/2012/03/julia-version-of-multinomial-sampler_12.html">Douglas Bates</a>, <a href="http://vincebuffalo.org/2012/03/07/thoughts-on-julia.html">Vince Buffalo</a> and <a href="http://www.statalgo.com/2012/03/24/statistics-with-julia/">Shane Conway</a>. As a language, it has lofty design goals, which, if attained, will make it noticeably superior to Matlab, R and Python for scientific programming. In the core development team&#8217;s <a href="http://julialang.org/blog/2012/02/why-we-created-julia/">own words</a>:</p>
<blockquote><p>
We want a language that&#8217;s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that&#8217;s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.</p>
<p>(Did we mention it should be as fast as C?)
</p></blockquote>
<p>Remarkably, Julia seems to be on its way to meeting those goals. Last night, I decided to see for myself whether Julia would live up to the hype. So I taught myself just enough of the language to write an implementation of the slowest R code I&#8217;ve ever written: the Metropolis algorithm-style sampler Drew and I use in Chapter 7 of <a href="http://shop.oreilly.com/product/0636920018483.do">Machine Learning for Hackers</a> to show off randomized, iterative optimization algorithms. You can find both the original R code and my new Julia code on <a href="https://github.com/johnmyleswhite/JuliaVsR">GitHub</a> in two files name <code>cipher.R</code> and <code>cipher.jl</code>, respectively.</p>
<p>In my opinion, the new code in Julia is easier to read than the R code because Julia has fewer syntactic quirks than R. More importantly, the Julia code runs much faster than the R code without any real effort put into speed optimization. For the sample text I tried to decipher, the Julia code completes 50,000 iterations of the sampler in 51 seconds, while the R code completes the same 50,000 iterations in 67 minutes &#8212; making the R code more than 75 slower than the Julia code.</p>
<p>Having seen that example alone, I would be convinced Julia is a real contender for the future of scientific computing. But this iterative sampling algorithm is not close to being the harshest comparison between Julia and R on my machine. For a more powerful example (lifted straight from the Julia docs), we can compare Julia and R code for computing the 25th Fibonacci number recursively.</p>
<p>First, the Julia code:</p>

<div class="wp_codebox"><table><tr id="p435419"><td class="line_numbers"><pre>1
2
</pre></td><td class="code" id="p4354code19"><pre class="c" style="font-family:monospace;">fib<span style="color: #009900;">&#40;</span>n<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> n <span style="color: #339933;">&lt;</span> <span style="color: #0000dd;">2</span> <span style="color: #339933;">?</span> n <span style="color: #339933;">:</span> fib<span style="color: #009900;">&#40;</span>n <span style="color: #339933;">-</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> fib<span style="color: #009900;">&#40;</span>n <span style="color: #339933;">-</span> <span style="color: #0000dd;">2</span><span style="color: #009900;">&#41;</span>
@elapsed fib<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">25</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>Second, the R code:</p>

<div class="wp_codebox"><table><tr id="p435420"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
</pre></td><td class="code" id="p4354code20"><pre class="c" style="font-family:monospace;">fib <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>n<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  ifelse<span style="color: #009900;">&#40;</span>n <span style="color: #339933;">&lt;</span> <span style="color: #0000dd;">2</span><span style="color: #339933;">,</span> n<span style="color: #339933;">,</span> fib<span style="color: #009900;">&#40;</span>n <span style="color: #339933;">-</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> fib<span style="color: #009900;">&#40;</span>n <span style="color: #339933;">-</span> <span style="color: #0000dd;">2</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
start <span style="color: #339933;">&lt;-</span> Sys.<span style="color: #202020;">time</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
fib<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">25</span><span style="color: #009900;">&#41;</span>
end <span style="color: #339933;">&lt;-</span> Sys.<span style="color: #202020;">time</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
end <span style="color: #339933;">-</span> start</pre></td></tr></table></div>

<p>The Julia code takes around 8 milliseconds to complete, whereas the R code takes around 4000 milliseconds. In this case, R is 500 times slower than Julia. To me, that&#8217;s sufficient reason to want to start focusing my time on implementing the algorithms I care about in Julia. I hope others will consider doing the same.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/03/31/julia-i-love-you/feed/</wfw:commentRss>
		<slash:comments>36</slash:comments>
		</item>
		<item>
		<title>Back to Blogging</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/03/31/back-to-blogging/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/03/31/back-to-blogging/#comments</comments>
		<pubDate>Sat, 31 Mar 2012 19:47:36 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Site News]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4351</guid>
		<description><![CDATA[If you&#8217;re subscribed to this blog, you&#8217;ve surely noticed the very long hiatus I&#8217;ve taken from writing over the last six months. I wish I&#8217;d kept up with blogging more faithfully this year, but, in my defense, I&#8217;ve been busy doing a few big things: I wrote a book with Drew Conway called Machine Learning [...]]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;re subscribed to this blog, you&#8217;ve surely noticed the very long hiatus I&#8217;ve taken from writing over the last six months. I wish I&#8217;d kept up with blogging more faithfully this year, but, in my defense, I&#8217;ve been busy doing a few big things:</p>
<ol>
<li>I wrote a book with <a href="http://www.drewconway.com/zia/">Drew Conway</a> called <a href="http://shop.oreilly.com/product/0636920018483.do">Machine Learning for Hackers</a>, which was published last month by O&#8217;Reilly Media. It&#8217;s an introduction to basic machine learning algorithms for programmers who&#8217;d like to skip the mathematical notation that ML algorithms are traditionally described in. You can pick up a copy from <a href="http://www.amazon.com/Machine-Learning-Hackers-Drew-Conway/dp/1449303714">Amazon</a> or your favorite bookseller.</li>
<li>I&#8217;ve been working on completing my research projects at Princeton so that I can submit my dissertation on time next year and finish my Ph.D a year from now.</li>
<li>I&#8217;ve been preparing to spend my summer this year as an intern at Microsoft Research, where I&#8217;ll be working with <a href="http://research.microsoft.com/en-us/um/people/counts/">Scott Counts</a>.</li>
<li>I&#8217;ve been developing materials for teaching Bayesian methods to people with some knowledge of statistical theory, but no knowledge of Bayesian theory and no experience using advanced computational statistical methods like MCMC. You can find the materials I&#8217;ve prepared for a short seminar course I&#8217;m now offering on <a href="https://github.com/johnmyleswhite/JAGSExamples">GitHub</a>.</li>
<li>Amid all those academic pursuits, I&#8217;ve also been maintaining some semblance of a social life, which consumes those rare hours when I&#8217;m not working.</li>
</ol>
<p>That said, the book is published and the summer is fast approaching, so I&#8217;ve decided that it&#8217;s time to start blogging again. Expect to see a couple of posts about Julia, Bayesian methods and Null Hypothesis Significance Testing in the coming weeks.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/03/31/back-to-blogging/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Psychology of Music and the &#8216;tuneR&#8217; Package</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/10/25/the-psychology-of-music-and-the-tuner-package/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/10/25/the-psychology-of-music-and-the-tuner-package/#comments</comments>
		<pubDate>Wed, 26 Oct 2011 01:28:41 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Music]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4311</guid>
		<description><![CDATA[Introduction This semester I&#8217;m TA&#8217;ing a course on the Psychology of Music taught by Phil Johnson-Laird. It&#8217;s been a great course to teach because (i) so much of the material is new to me and (ii) because the study of the psychology of music brings together so many of the intellectual tools I enjoy, including [...]]]></description>
			<content:encoded><![CDATA[<h3>Introduction</h3>
<p>This semester I&#8217;m TA&#8217;ing a course on the <a href="http://psych.princeton.edu/psychology/research/johnson_laird/music.php">Psychology of Music</a> taught by <a href="http://psych.princeton.edu/psychology/research/johnson_laird/">Phil Johnson-Laird</a>. It&#8217;s been a great course to teach because (i) so much of the material is new to me and (ii) because the study of the psychology of music brings together so many of the intellectual tools I enjoy, including music theory, psychophysics and Fourier analysis.</p>
<p>One topic this semester that was completely new to me was the theory of tuning: I had known about the invention of the <a href="http://en.wikipedia.org/wiki/Well_temperament">well-tempered system of tuning</a>, but had never heard of <a href="http://en.wikipedia.org/wiki/Pythagorean_tuning">Pythagorean tuning</a> or <a href="http://en.wikipedia.org/wiki/Just_intonation">just tuning</a> &#8212; and certainly was not aware that the well-tempered system <a href="http://en.wikipedia.org/wiki/The_Well-Tempered_Clavier">Bach celebrated</a> was not identical to our current equal-tempered system of tuning.</p>
<p>As a way of consolidating some of the knowledge I&#8217;ve gained, I decided I&#8217;d write a blog entry after several months of neglecting this blog. (For that neglect, I&#8217;ll blame a combination of grant writing, book writing, ongoing research projects and personal life developments.) In what follows, I&#8217;ll give a brief overview of the theory of tuning at a theoretical level that should be accessible to anyone who&#8217;s familiar with the names of intervals and feels comfortable thinking quantitatively.</p>
<p>After surveying the field, I&#8217;ll turn to a discussion of some code I&#8217;ve written in R that implements these ideas using the &#8216;tuneR&#8217; package, which is one of my favorite hidden gems from CRAN. Along the way, I&#8217;ll introduce some of the simplest tools from the &#8216;tuneR&#8217; package that can be used for generating computer music.</p>
<h3>Tuning Systems: Pythagorean, Just and 12-Tet</h3>
<p>It&#8217;s worth noting right at the start that tuning is a misleading name for the topic we&#8217;ll be discussing: we&#8217;re not talking about how one tunes a fixed instrument so that it sounds in tune, but rather we&#8217;re interested in how one defines the very notes that the instrument should be able to produce when it&#8217;s perfectly in tune.</p>
<p>To make that clear, let&#8217;s assume that we&#8217;ve accepted as a given that a frequency of 440 Hz will be called A. Our problem then becomes one of deciding which of the infinitely many frequencies we could produce  actually deserves the label of A#, B, C, C#, and so on.</p>
<h4>Pythagorean Tuning</h4>
<p>The simplest solution to this problem I know of is the <a href="http://en.wikipedia.org/wiki/Pythagorean_tuning">Pythagorean tuning system</a>. It&#8217;s based on constructing all of the possible notes using a series of perfect fifths. If you remember the <a href="http://en.wikipedia.org/wiki/Circle_of_fifths">Circle of Fifths</a>, you&#8217;ll remember that you can reach every chromatic note by ascending fifths: if you start at A, you&#8217;ll proceed through E, B, F# and so on.</p>
<p>The Pythagorean system implements the Circle of Fifths directly using repeated multiplication of a base frequency. To do this, you first declare that a perfect fifth is at a frequency 3/2 above your base frequency. For example, this definition implies that the perfect fifth above the A at 440 Hz has to be at a frequency of 3/2 * 440 = 660 Hz. Once you do this, you&#8217;ve defined the frequency we&#8217;ll call E.</p>
<p>And following on with this logic, you produce a B at 990 Hz. Of course, this B occurs an octave above the base A at 440 Hz, so you transpose it down an octave to produce the B you&#8217;ll actually use. To do this, you need to assume that an octave is at a frequency 2 times the base frequency. Since we&#8217;ve accepted that 990 Hz is a B, we divide 990 by 2 and conclude that 495 Hz should be B.</p>
<p>With these three notes defined, we have the following table of frequency/note pairs:</p>
<table>
<tr>
<th>Note</th>
<th>Frequency</th>
<th>Ratio with 440 Hz</th>
</tr>
<tr>
<td>A</td>
<td>440 Hz</td>
<td>1</td>
</tr>
<tr>
<td>E</td>
<td>660 Hz</td>
<td>3/2</td>
</tr>
<tr>
<td>B</td>
<td>495 Hz</td>
<td>9/8</td>
</tr>
</table>
<p>If we continue on with this logic and calculate many more multiplications by 3/2 and divisions by 2, we will eventually produce a complete table for all of the notes in the chromatic scale that looks like the following:</p>
<table>
<tr>
<th>Note</th>
<th>Frequency</th>
<th>Ratio</th>
</tr>
<tr>
<td>A</td>
<td>440</td>
<td>1</td>
</tr>
<tr>
<td>A#</td>
<td>463.5391</td>
<td>256/243</td>
</tr>
<tr>
<td>B</td>
<td>495</td>
<td>9/8</td>
</tr>
<tr>
<td>C</td>
<td>521.4815</td>
<td>32/27</td>
</tr>
<tr>
<td>C#</td>
<td>556.875</td>
<td>81/64</td>
</tr>
<tr>
<td>D</td>
<td>586.6667</td>
<td>4/3</td>
</tr>
<tr>
<td>D#</td>
<td>626.4844</td>
<td>729/512</td>
</tr>
<tr>
<td>E</td>
<td>660</td>
<td>3/2</td>
</tr>
<tr>
<td>F</td>
<td>695.3086</td>
<td>128/81</td>
</tr>
<tr>
<td>F#</td>
<td>742.5</td>
<td>27/16</td>
</tr>
<tr>
<td>G</td>
<td>782.2222</td>
<td>16/9</td>
</tr>
<tr>
<td>G#</td>
<td>835.3125</td>
<td>243/128</td>
</tr>
<tr>
<td>A</td>
<td>880</td>
<td>2</td>
</tr>
</table>
<p>One thing about this table might strike you as odd if you&#8217;re mathematically savvy: the octave, which we&#8217;ve defined by fiat as a ratio of 2:1, could never have been produced by successive multiplication by 3/2, since no power of 3 will be evenly divisible by a power of 2. This is the one flub in the Pythagorean system: you can&#8217;t really produce the entire chromatic scale using only multiples of 3/2. Here we&#8217;ve solved that problem by replacing the note we would have called A with a true octave generated using multiplication by 2. Because the exact octave produced by Pythagorean tuning is slightly out of tune with our preferred definition of an octave, you may hear people refer to this discrepancy as the <a href="http://en.wikipedia.org/wiki/Pythagorean_comma">the Pythagorean comma</a>.</p>
<h4>Just Tuning</h4>
<p>Given that we had to cheat a bit to create a proper octave using the Pythagorean tuning system based on multiples of 3/2, it makes sense to ask why we shouldn&#8217;t just allow ourselves to use other multipliers than 3/2. Looking at the Pythagoren tuning table, we see some pretty ugly fractions like 729/512. What if we forced these fractions to be simpler by employing ratios like 4/3 and 5/4 to build up the whole system?</p>
<p>The result of allowing ourselves several fractions beyond just those derived from 3/2 is called the <a href="http://en.wikipedia.org/wiki/Just_intonation">just tuning system</a>. Here we assume that perfect fifths occur at a frequency ratio of 3/2 and that perfect fourths occur at a frequency ratio of 4/3. Continuing on with this process, we eventually end up with the following tuning table:</p>
<table>
<tr>
<th>Note</th>
<th>Frequency</th>
<th>Ratio</th>
</tr>
<tr>
<td>A</td>
<td>440</td>
<td>1</td>
</tr>
<tr>
<td>A#</td>
<td>469.3333</td>
<td>16/15</td>
</tr>
<tr>
<td>B</td>
<td>495</td>
<td>9/8</td>
</tr>
<tr>
<td>C</td>
<td>528</td>
<td>6/5</td>
</tr>
<tr>
<td>C#</td>
<td>550</td>
<td>5/4</td>
</tr>
<tr>
<td>D</td>
<td>586.6667</td>
<td>4/3</td>
</tr>
<tr>
<td>D#</td>
<td>625.7778</td>
<td>64/45</td>
</tr>
<tr>
<td>E</td>
<td>660</td>
<td>3/2</td>
</tr>
<tr>
<td>F</td>
<td>704</td>
<td>8/5</td>
</tr>
<tr>
<td>F#</td>
<td>733.3333</td>
<td>5/3</td>
</tr>
<tr>
<td>G</td>
<td>782.2222</td>
<td>16/9</td>
</tr>
<tr>
<td>G#</td>
<td>825</td>
<td>15/8</td>
</tr>
<tr>
<td>A</td>
<td>880</td>
<td>2</td>
</tr>
</table>
<p>This is the tuning that early Classical music was written in. Looking at the table you con immediately appreciate the theoretical assertion that the relative dissonance of an interval is determined by the simplicity of the ratio of frequencies between the two notes: perfect fifths are 3/2 and major thirds are 5/4, while minor seconds are 16/15 and major sevenths are 15/8. This is one of the things I most enjoy about the theory of harmony: there&#8217;s a match between the aesthetics of fractions and the aesthetics of sounds that, for me, helps to justify my sense that certain fractions are more beautiful than others.</p>
<h4>12 Tet / Equal-Temperament</h4>
<p>Now, if you know the history of Bach&#8217;s Well-Tempered Clavier, you know that there is a problem with the just tuning system: it sounds great in the key you used as the base (here A), but it sounds a bit out of tune in other keys. The modern <a href="http://en.wikipedia.org/wiki/Equal_temperament">12-tet system</a> is the most recent approach to solving this problem: you assume the gap between two semitones (e.g. A to A# or A# to B) is always the exact same multiple. Since you&#8217;ll repeat this multiplication 12 times before reaching an octave, you can conclude that two notes that are a semitone apart must be separated by the 12th root of 2. Building a tuning system using that ratio alone gives us our modern system of tuning, which is shown in the table above using the decimal expansion of the ratios instead of their representation as powers of the 12th root of 2:</p>
<table>
<tr>
<th>Note</th>
<th>Frequency</th>
<th>Ratio</th>
</tr>
<tr>
<td>A</td>
<td>440</td>
<td>1.000000</td>
</tr>
<tr>
<td>A#</td>
<td>466.1638</td>
<td>1.059463</td>
</tr>
<tr>
<td>B</td>
<td>493.8833</td>
<td>1.122462</td>
</tr>
<tr>
<td>C</td>
<td>523.2511</td>
<td>1.189207</td>
</tr>
<tr>
<td>C#</td>
<td>554.3653</td>
<td>1.259921</td>
</tr>
<tr>
<td>D</td>
<td>587.3295</td>
<td>1.334840</td>
</tr>
<tr>
<td>D#</td>
<td>622.2540</td>
<td>1.414214</td>
</tr>
<tr>
<td>E</td>
<td>659.2551</td>
<td>1.498307</td>
</tr>
<tr>
<td>F</td>
<td>698.4565</td>
<td>1.587401</td>
</tr>
<tr>
<td>F#</td>
<td>739.9888</td>
<td>1.681793</td>
</tr>
<tr>
<td>G</td>
<td>783.9909</td>
<td>1.781797</td>
</tr>
<tr>
<td>G#</td>
<td>830.6094</td>
<td>1.887749</td>
</tr>
<tr>
<td>A</td>
<td>880</td>
<td>2.000000</td>
</tr>
</table>
<h3>Listening to the Results</h3>
<p>We&#8217;ve just described three ways to define the notes used in Western music. But how different do they sound? To answer that, I decided to produce a series of simple sine wave audio samples that were tuned using each of the three tuning systems. To produce those audio samples, I used the &#8216;tuneR&#8217; package, which I&#8217;ll describe now. Before you read on, you should install it from CRAN using the standard <code>install.packages('tuneR')</code> invocation.</p>
<h3>A tuneR Tutorial</h3>
<p>The <a href="http://cran.r-project.org/web/packages/tuneR/index.html">tuneR</a> package is an extremely convenient tool for generating audio files from R based on a numeric description of the audio stream. For the purposes of this discussion of tuning systems, we simply need to produce basic sine waves. Thankfully, that&#8217;s very easy to do with tuneR. Here&#8217;s an example:</p>

<div class="wp_codebox"><table><tr id="p431125"><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code" id="p4311code25"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'tuneR'</span><span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span>
&nbsp;
writeWave<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> <span style="color: #ff0000;">'440.wav'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>Here we&#8217;ve loaded the tuneR package, created a 1s snippet of sine wave audio at 16 bits resolution using the <code>sine</code> function, and then written out the audio to a WAV file using <code>writeWave</code>. If you look at your current directory and listen to this file, you&#8217;ll hear a sine wave at 440 Hz.</p>
<p>If you want to explore the use of <code>sine</code>, you can easily play with the duration of the sound by changing the <code>duration</code> parameter. If you want to, you can also change the sample rate and the bit rate, but I don&#8217;t see any reason to do that while exploring ideas about tuning.</p>
<p>More important is knowing that you can superimpose two sine waves using the <code>`+`</code> operator and that you can concatenate them using the <code>bind</code> function. To show off producing octaves, for example, you might use the following code to hear an A at 440 Hz, then an A an octave above it, and finally the harmony they produce together:</p>

<div class="wp_codebox"><table><tr id="p431126"><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code" id="p4311code26"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'tuneR'</span><span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> bind<span style="color: #009900;">&#40;</span>sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
              sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">880</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
              sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">880</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
writeWave<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> <span style="color: #ff0000;">'octaves.wav'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>Unfortunately, this sample code produces an error because of the naive addition we&#8217;ve implemented using the <code>`+`</code> operator. Adding two sine waves directly together overfills the bit rate we&#8217;re using. To safely perform addition of two sine waves, we need to normalize the results of our summation using the <code>normalize</code> function. This gives us just one more line of code:</p>

<div class="wp_codebox"><table><tr id="p431127"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
</pre></td><td class="code" id="p4311code27"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'tuneR'</span><span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> bind<span style="color: #009900;">&#40;</span>sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
              sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">880</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
              sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">440</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> sine<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">880</span><span style="color: #339933;">,</span> bit <span style="color: #339933;">=</span> <span style="color: #0000dd;">16</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> normalize<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> unit <span style="color: #339933;">=</span> <span style="color: #ff0000;">'16'</span><span style="color: #009900;">&#41;</span>
&nbsp;
writeWave<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> <span style="color: #ff0000;">'octaves.wav'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>For reasons that are not clear to me, you have to specify the bit rate to <code>normalize</code> using the <code>unit</code> parameter rather than the <code>bit</code> parameter.</p>
<h3>Demoing Tuning Systems</h3>
<p>Our little octave demo is cute, but we really want to know what more interesting harmonies like major thirds and minor seconds sound like in the various tuning systems we described. To do that, I first wrote a function called <code>interval</code> that spits out the multiplier you need to use to produce a given interval for any of the three tuning systems. That function is in a <a href="https://github.com/johnmyleswhite/computer_music">GitHub repository</a> I&#8217;ve set up with code for making these demos. If you download that repository, you could load my <code>interval</code> function using a simple call to <code>source</code> like the one seen below. And using this <code>interval</code> function, we can generate demos of various intervals as follows:</p>

<div class="wp_codebox"><table><tr id="p431128"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code" id="p4311code28"><pre class="c" style="font-family:monospace;">library<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'tuneR'</span><span style="color: #009900;">&#41;</span>
source<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'interval.R'</span><span style="color: #009900;">&#41;</span>
&nbsp;
base <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">440</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> sine<span style="color: #009900;">&#40;</span>base<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> sine<span style="color: #009900;">&#40;</span>interval<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">'minor-second'</span><span style="color: #339933;">,</span>
                                    tuning <span style="color: #339933;">=</span> <span style="color: #ff0000;">'pythagorean'</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">*</span> base<span style="color: #009900;">&#41;</span>
&nbsp;
sound <span style="color: #339933;">&lt;-</span> normalize<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> unit <span style="color: #339933;">=</span> <span style="color: #ff0000;">'16'</span><span style="color: #009900;">&#41;</span>
&nbsp;
writeWave<span style="color: #009900;">&#40;</span>sound<span style="color: #339933;">,</span> <span style="color: #ff0000;">'minor_second_pythagorean.wav'</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>On GitHub there&#8217;s a file called <code>test_intervals.R</code> that will go through and generate all of the intervals in all three tuning systems. If you run that file, you&#8217;ll generate a lot of audio files you can listen to as demos of the three tuning systems we&#8217;ve described. For me, these tuning systems all produce intervals that sound surprisingly similar, though at high volumes I find it moderately easy to hear slight differences between the tuning systems. That said, I very much doubt I would pick up on them in a normal musical context.</p>
<p>That&#8217;s the end of my little introduction to tuning systems and the use of the tuneR package to explore them. If you&#8217;re interested in thinking computationally about music, I highly recommend playing around with tuneR until you feel like you can produce interesting results. I&#8217;m already working on trying to build up some interesting timbres to work with.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/10/25/the-psychology-of-music-and-the-tuner-package/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Twitter Math Puzzle and Solution</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/07/07/twitter-math-puzzle-and-solution/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/07/07/twitter-math-puzzle-and-solution/#comments</comments>
		<pubDate>Thu, 07 Jul 2011 13:48:07 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Mathematics]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4302</guid>
		<description><![CDATA[Yesterday I posted a very simple math puzzle to Twitter that I found in Jonathan Baron&#8217;s book, Thinking and Deciding. The puzzle is the following: Show that every number of the form ABC,ABC is divisible by 13. The puzzle comes up in Baron&#8217;s book as an example of an &#8220;insight problem&#8221; in which one goes [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday I posted a very simple math puzzle to Twitter that I found in Jonathan Baron&#8217;s book, <a href="http://amzn.to/npM5Uk">Thinking and Deciding</a>. The puzzle is the following:</p>
<blockquote><p>
Show that every number of the form ABC,ABC is divisible by 13.
</p></blockquote>
<p>The puzzle comes up in Baron&#8217;s book as an example of an &#8220;insight problem&#8221; in which one goes from not knowing the answer at all to knowing the complete answering in a sudden moment of insight.</p>
<p>Several people replied to my tweet with solutions: I especially like <a href="https://twitter.com/#!/willtownes/status/88735472028876800">Will Townes&#8217;s</a> solution. In particular, if you&#8217;re familiar with <a href="http://en.wikipedia.org/wiki/Modular_arithmetic">modular arithmetic</a>, I like the logic of Will&#8217;s answer because it gives a simple generalization. First, represent ABC,ABC as ABC * 1000 + ABC * 1 rather than as ABC * 1001. Then notice that</p>
<ol>
<li>1 = 1 mod 13</li>
<li>1000 = -1 mod 13</li>
</ol>
<p>Thus ABC,ABC = ABC * -1 + ABC * 1 = 0 mod 13. This logic can be easily extended to show that (ABC,ABC,)*ABC,ABC = 0 mod 13 no matter how many times you repeat the ABC,ABC pattern.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/07/07/twitter-math-puzzle-and-solution/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Visualizing Periodic Data</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/06/28/visualizing-periodic-data/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/06/28/visualizing-periodic-data/#comments</comments>
		<pubDate>Tue, 28 Jun 2011 18:27:03 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4298</guid>
		<description><![CDATA[Yesterday the Princeton machine learning reading group went through a paper by Tukey on &#8220;Some graphic and semigraphic displays&#8221;. One issue we talked about at length was Tukey&#8217;s idiosyncratic approach to visualizing periodic data in a circular format to emphasize the connections between the &#8220;start&#8221; and the &#8220;end&#8221; of the data set. Allison Chaney pointed [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday the Princeton machine learning reading group went through a paper by Tukey on <a href="http://www.edwardtufte.com/tufte/tukey">&#8220;Some graphic and semigraphic displays&#8221;</a>. One issue we talked about at length was Tukey&#8217;s idiosyncratic approach to visualizing periodic data in a circular format to emphasize the connections between the &#8220;start&#8221; and the &#8220;end&#8221; of the data set.</p>
<p>Allison Chaney pointed out that many fields (for instance, environmental engineering) might want to consider using these circular displays to make periodic trends clear to the viewer. That inspired me to try plotting periodic weather data using both a standard x-y plane display and a polar coordinates display. The results are shown below in two videos that I&#8217;ve uploaded to Vimeo:</p>
<div style="text-align:center;"><iframe src="http://player.vimeo.com/video/25716170?title=0&amp;byline=0&amp;portrait=0" width="400" height="300" frameborder="0"></iframe>
<p><a href="http://vimeo.com/25716170">Visualizing Periodic Data: NYC Weather from 1995 to 2008</a> from <a href="http://vimeo.com/user698502">John Myles White</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
</div>
<div style="text-align:center;"><iframe src="http://player.vimeo.com/video/25717081?title=0&amp;byline=0&amp;portrait=0" width="400" height="300" frameborder="0"></iframe>
<p><a href="http://vimeo.com/25717081">Visualizing Periodic Data: NYC Weather from 1995 to 2008 (Take 2)</a> from <a href="http://vimeo.com/user698502">John Myles White</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
</div>
<p>There&#8217;s a clear tradeoff that&#8217;s being made when choosing between these two approaches: the polar coordinates plot, as promised, correctly connects the two &#8220;ends&#8221; of the data set. But it also makes it much harder to see the height of the graph at each point in time, so that the sinusoidal shape that can easily be seen in the x-y plane display is basically hidden in the polar coordinates display.</p>
<p>Since making these videos, it occurred to me that another potential visualization technique would be to project the data onto a cylinder, rather than a plane, and then progressively rotate the cylinder to reveal the time trend. This would allow heights to be seen properly, while emphasizing the periodicity. The problem with this cylindrical projection is that the entire data set is never fully visible at one time, but can only be seen by completing a full rotation of the data.</p>
<p>In his paper, Tukey describes one other approach: draw the periodic data twice so that the period is clearly visible. It wasn&#8217;t clear to me how to do this without some numeric hacks in ggplot2, so I&#8217;ll leave it to reader to search for Tukey&#8217;s example in <a href="http://www.edwardtufte.com/tufte/tukey">the original paper</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/06/28/visualizing-periodic-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>ProjectTemplate News</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/06/25/projecttemplate-news/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/06/25/projecttemplate-news/#comments</comments>
		<pubDate>Sat, 25 Jun 2011 16:43:14 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4294</guid>
		<description><![CDATA[The news below was recently reported on the ProjectTemplate mailing list. For completeness, I&#8217;m also reporting it here. The first piece of ProjectTemplate news is that I won&#8217;t be the exclusive maintainer for ProjectTemplate anymore. Allen Goodman, who works at BankSimple, is now my co-maintainer and he has full commit privileges. In the next few [...]]]></description>
			<content:encoded><![CDATA[<p>The news below was recently reported on the ProjectTemplate mailing list. For completeness, I&#8217;m also reporting it here.</p>
<ul>
<li>The first piece of ProjectTemplate news is that I won&#8217;t be the exclusive maintainer for ProjectTemplate anymore. Allen Goodman, who works at BankSimple, is now my co-maintainer and he has full commit privileges. In the next few months, the emerging group with commit privileges is likely to grow beyond the two of us, but hopefully just having one more person in charge of ProjectTemplate&#8217;s development will help to keep things moving forward.</li>
<li>There&#8217;s a <a href="https://github.com/johnmyleswhite/ProjectTemplate">new draft of ProjectTemplate available on GitHub</a>. v0.3-1 fixes problems with the YAML configuration system not working on Windows 64 machines by switching over to the DCF format that R naturally supports. Editing your configuration scripts should be trivial, but be prepared for ProjectTemplate to break on your existing v0.2-1 projects until you&#8217;ve updated them to use DCF instead of YAML.</li>
<li>In addition to switching the configuration system over to DCF, ProjectTemplate v0.3-1 now uses namespaces and separate functions to implement all of the automatic data loading functions that were previously nested inside of <code>load.project()</code>. Hopefully this will make it easier for end users to override ProjectTemplate&#8217;s defaults, while allowing ProjectTemplate releases to automatically rolls out bug fixes to less advanced users. On that note, the list of supported file formats for automatic data loading is growing and new patches on that front are always welcome.</li>
<li>A minimal project format: Some people have asked for the option to create projects without some of the clutter that the standard project format creates, such as the diagnostics and profiling directories. There&#8217;s now a minimal project format that you can use by invoking <code>create.project()</code> with the option <code>create.project(minimal = TRUE)</code>.</li>
<li>Starting in two weeks, the version of ProjectTemplate available on CRAN will stay in pace with the version on GitHub. If you&#8217;re still using v0.1-3, please consider upgrading or forking.</li>
<li>There is now an official ProjectTemplate website at <a href="http://projecttemplate.net/">http://projecttemplate.net/</a> that will hopefully be the start of a new era of better documentation for ProjectTemplate. While the material on the site is still in noticeably draft form, I expect the documentation to improve considerably in the near future. If anyone out there is a graphic designer and would like to make the new site look better, please let me know by e-mailing me at <a href="mailto:jmw@johnmyleswhite.com">jmw@johnmyleswhite.com</a>.</li>
</ul>
<p>For now that&#8217;s all, but there&#8217;s more ProjectTemplate news coming soon. Stay tuned!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/06/25/projecttemplate-news/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Speeding Up MLE Code in R</title>
		<link>http://www.johnmyleswhite.com/notebook/2011/06/18/speeding-up-mle-code-in-r/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2011/06/18/speeding-up-mle-code-in-r/#comments</comments>
		<pubDate>Sun, 19 Jun 2011 00:02:29 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4264</guid>
		<description><![CDATA[Recently, I&#8217;ve been fitting some models from the behavioral economics literature to choice data. Most of these models amount to non-linear variants of logistic regression in which I want to infer the parameters of a utility function. Because several of these models aren&#8217;t widely used, I&#8217;ve had to write my own maximum likelihood code to [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I&#8217;ve been fitting some models from the behavioral economics literature to choice data. Most of these models amount to non-linear variants of logistic regression in which I want to infer the parameters of a utility function. Because several of these models aren&#8217;t widely used, I&#8217;ve had to write my own maximum likelihood code to estimate the parameters of these models.</p>
<p>In the process, I&#8217;ve started to learn something about how to write code that runs quickly in R. In this post, I&#8217;ll try to share some of that knowledge by describing three ways of performing maximum likelihood estimation in R whose runtimes differ by two orders of magnitude. The differences seem to depend upon two factors: (1) how I access the entries of a data frame and (2) whether I use loops or vectorized operations to perform basic arithmetic.</p>
<p>To simplify things, I&#8217;ll present a model that should be familiar to people with a background in economics: the exponentially discounted utility model. To implement it in R, we define the discounted value of <code>x</code> dollars at time <code>t</code> as:</p>

<div class="wp_codebox"><table><tr id="p426436"><td class="line_numbers"><pre>1
2
3
4
</pre></td><td class="code" id="p4264code36"><pre class="c" style="font-family:monospace;">discounted.<span style="color: #202020;">value</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>x<span style="color: #339933;">,</span> t<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>x <span style="color: #339933;">*</span> delta <span style="color: #339933;">^</span> t<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>In addition to the discounted utility model, we assume that choices originate from a stochastic choice model with logistic noise. To invert this noise during inference, we&#8217;ll use the inverse logit transform:</p>

<div class="wp_codebox"><table><tr id="p426437"><td class="line_numbers"><pre>1
2
3
4
</pre></td><td class="code" id="p4264code37"><pre class="c" style="font-family:monospace;">invlogit <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>z<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span> <span style="color: #339933;">/</span> <span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span> <span style="color: #339933;">+</span> exp<span style="color: #009900;">&#40;</span><span style="color: #339933;">-</span>z<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>To test my inference routine, I need to generate &#8220;stochastic&#8221; data of the sort you would expect to see from an exponentially discounting agent that&#8217;s indifferent between having $1 at time t = 0 and $3 at time t = 1. I&#8217;ll refer to the first good as (X1, T1) and the second good as (X2, T2). If the agent chooses (X2, T2), I&#8217;ll write that as <code>C == 1</code>; if they choose (X1, T1), I&#8217;ll write that as <code>C == 0</code>. With those conventions, the sample data is generated as:</p>

<div class="wp_codebox"><table><tr id="p426438"><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code" id="p4264code38"><pre class="c" style="font-family:monospace;">n <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">100</span>
&nbsp;
choices <span style="color: #339933;">&lt;-</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span>X1 <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> each <span style="color: #339933;">=</span> n<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                      T1 <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> each <span style="color: #339933;">=</span> n<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                      X2 <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">3</span><span style="color: #339933;">,</span> each <span style="color: #339933;">=</span> n<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                      T2 <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> each <span style="color: #339933;">=</span> n<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                      C <span style="color: #339933;">=</span> rep<span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> by <span style="color: #339933;">=</span> n <span style="color: #339933;">/</span> <span style="color: #0000dd;">2</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>To fit the exponential model to this data set, we&#8217;ll use the <code>optim</code> function to minimize the negative log likelihood of the data by setting two parameters: <code>a</code>, the variance of the noise in the utility function; and <code>delta</code>, the discount factor in the discounted utility model. The three implementations of this model that I&#8217;ll show only differ in the definition of the log likelihood function, so the final call to <code>optim</code> to perform maximum likelihood estimation is constant across all examples:</p>

<div class="wp_codebox"><table><tr id="p426439"><td class="line_numbers"><pre>1
2
3
4
5
6
</pre></td><td class="code" id="p4264code39"><pre class="c" style="font-family:monospace;">logit.<span style="color: #202020;">estimator</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span> 
  wrapper <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>x<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span><span style="color: #339933;">-</span>log.<span style="color: #202020;">likelihood</span><span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> x<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> x<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span>
  optimization.<span style="color: #202020;">results</span> <span style="color: #339933;">&lt;-</span> optim<span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> wrapper<span style="color: #339933;">,</span> method <span style="color: #339933;">=</span> <span style="color: #ff0000;">'L-BFGS-B'</span><span style="color: #339933;">,</span> lower <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> upper <span style="color: #339933;">=</span> c<span style="color: #009900;">&#40;</span>Inf<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>optimization.<span style="color: #202020;">results</span>$par<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>Here, I had to specify bounds for the parameters, <code>a</code> and <code>delta</code>, because it&#8217;s assumed that <code>a</code> must be positive and that <code>delta</code> must lie in the interval [0, 1]. To deal with these bounds, one has to use the L-BFGS-B method in <code>optim</code>.</p>
<p>The first implementation I&#8217;ll show is the one I find most natural to write, even though it turns out to be the least efficient by far:</p>

<div class="wp_codebox"><table><tr id="p426440"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
</pre></td><td class="code" id="p4264code40"><pre class="c" style="font-family:monospace;">log.<span style="color: #202020;">likelihood</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  ll <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">0</span>
&nbsp;
  <span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>i in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>nrow<span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #009900;">&#123;</span>
    u2 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'X2'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'T2'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
    u1 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'X1'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'T1'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
&nbsp;
    p <span style="color: #339933;">&lt;-</span> invlogit<span style="color: #009900;">&#40;</span>a <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>u2 <span style="color: #339933;">-</span> u1<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
    <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span>choices<span style="color: #009900;">&#91;</span>i<span style="color: #339933;">,</span> <span style="color: #ff0000;">'C'</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">==</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#123;</span>
      ll <span style="color: #339933;">&lt;-</span> ll <span style="color: #339933;">+</span> log<span style="color: #009900;">&#40;</span>p<span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span>
    <span style="color: #b1b100;">else</span>
    <span style="color: #009900;">&#123;</span>
      ll <span style="color: #339933;">&lt;-</span> ll <span style="color: #339933;">+</span> log<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span> <span style="color: #339933;">-</span> p<span style="color: #009900;">&#41;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>ll<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>In the second implementation, I define a row level likelihood function, so that the summing and logarithmic transform are vectorized.</p>

<div class="wp_codebox"><table><tr id="p426441"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code" id="p4264code41"><pre class="c" style="font-family:monospace;">rowwise.<span style="color: #202020;">likelihood</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>row<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  u2 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'X2'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'T2'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
  u1 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'X1'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'T1'</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
  p <span style="color: #339933;">&lt;-</span> invlogit<span style="color: #009900;">&#40;</span>a <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>u2 <span style="color: #339933;">-</span> u1<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>ifelse<span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#91;</span><span style="color: #ff0000;">'C'</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">==</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> p<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span> <span style="color: #339933;">-</span> p<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
log.<span style="color: #202020;">likelihood</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  likelihoods <span style="color: #339933;">&lt;-</span> apply<span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">function</span> <span style="color: #009900;">&#40;</span>row<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>rowwise.<span style="color: #202020;">likelihood</span><span style="color: #009900;">&#40;</span>row<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#125;</span><span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>sum<span style="color: #009900;">&#40;</span>log<span style="color: #009900;">&#40;</span>likelihoods<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>In the third implementation, I define a fully vectorized log likelihood function that avoids any explicit iteration and therefore removes most of the data frame indexing operations:</p>

<div class="wp_codebox"><table><tr id="p426442"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
</pre></td><td class="code" id="p4264code42"><pre class="c" style="font-family:monospace;">log.<span style="color: #202020;">likelihood</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>choices<span style="color: #339933;">,</span> a<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
  u2 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>choices$X2<span style="color: #339933;">,</span> choices$T2<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
  u1 <span style="color: #339933;">&lt;-</span> discounted.<span style="color: #202020;">value</span><span style="color: #009900;">&#40;</span>choices$X1<span style="color: #339933;">,</span> choices$T1<span style="color: #339933;">,</span> delta<span style="color: #009900;">&#41;</span>
  p <span style="color: #339933;">&lt;-</span> invlogit<span style="color: #009900;">&#40;</span>a <span style="color: #339933;">*</span> <span style="color: #009900;">&#40;</span>u2 <span style="color: #339933;">-</span> u1<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
  likelihoods <span style="color: #339933;">&lt;-</span> ifelse<span style="color: #009900;">&#40;</span>choices$C <span style="color: #339933;">==</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> p<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span> <span style="color: #339933;">-</span> p<span style="color: #009900;">&#41;</span>
  <span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>sum<span style="color: #009900;">&#40;</span>log<span style="color: #009900;">&#40;</span>likelihoods<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>The code I used to call all of these implementations and compare them is up on <a href="https://github.com/johnmyleswhite/fastR">GitHub</a> for those interested. The results, which strike me as remarkable, are below:</p>
<ol>
<li>On my laptop, implementation 1 takes ~1.0 second to run.</li>
<li>On my laptop, implementation 2 takes ~0.25 seconds to run.</li>
<li>On my laptop, implementation 3 takes ~0.01 seconds to run.</li>
</ol>
<p>In short, the third implementation is 100x faster than the first implementation with only minor changes to the code I originally wrote. Hopefully this example will help inspire others who have R code they&#8217;d like to speed up, but aren&#8217;t sure where to start.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2011/06/18/speeding-up-mle-code-in-r/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.871 seconds -->
<!-- Cached page served by WP-Cache -->

