<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>John Myles White</title>
	<atom:link href="http://www.johnmyleswhite.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johnmyleswhite.com</link>
	<description>&#34;He who refuses to do arithmetic is doomed to talk nonsense.&#34;</description>
	<lastBuildDate>Thu, 09 May 2013 21:54:06 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>What&#8217;s Next</title>
		<link>http://www.johnmyleswhite.com/notebook/2013/05/09/whats-next/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2013/05/09/whats-next/#comments</comments>
		<pubDate>Thu, 09 May 2013 21:54:06 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Autobiographical]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4901</guid>
		<description><![CDATA[The last two weeks have been full of changes for me. For those who&#8217;ve been asking about what&#8217;s next, I thought I&#8217;d write up a quick summary of all the news. (1) I successfully defended my thesis this past Monday. Completing a Ph.D. has been a massive undertaking for the past five years, and it&#8217;s [...]]]></description>
				<content:encoded><![CDATA[<p>The last two weeks have been full of changes for me. For those who&#8217;ve been asking about what&#8217;s next, I thought I&#8217;d write up a quick summary of all the news.</p>
<p>(1) I successfully defended my thesis this past Monday. Completing a Ph.D. has been a massive undertaking for the past five years, and it&#8217;s a major relief to be done. From now on I&#8217;ll be (perhaps undeservedly) making airline and restaurant reservations under the name Dr. White.</p>
<p>(2) As announced last week, I&#8217;ll be one of <a href="https://www.hackerschool.com/residents">the residents at Hacker School this summer</a>. The list of other residents is pretty amazing, and I&#8217;m really looking forward to meeting the students.</p>
<p>(3) In addition to my residency at Hacker School, I&#8217;ll be a temporary postdoc in the applied math department at MIT, where I&#8217;ll be working on <a href="http://julialang.org">Julia</a> full-time. Expect to see lots of work on building up the core data analysis infrastructure.</p>
<p>(4) As of today I&#8217;ve accepted an offer to join Facebook&#8217;s Data Science team in the fall. I&#8217;ll be moving out to the Bay Area in November.</p>
<p>That&#8217;s all so far.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2013/05/09/whats-next/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Using Norms to Understand Linear Regression</title>
		<link>http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/#comments</comments>
		<pubDate>Fri, 22 Mar 2013 19:39:15 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4869</guid>
		<description><![CDATA[Introduction In my last post, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, \((x_1, x_2, \ldots, x_n)\), using a single number, \(s\). In particular, we measured the quality of different potential summaries in three different ways, which led us to [...]]]></description>
				<content:encoded><![CDATA[<h3>Introduction</h3>
<p>In <a href="http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/">my last post</a>, I described how we can derive modes, medians and means as three natural solutions to the problem of summarizing a list of numbers, \((x_1, x_2, \ldots, x_n)\), using a single number, \(s\). In particular, we measured the quality of different potential summaries in three different ways, which led us to modes, medians and means respectively. Each of these quantities emerged from measuring the typical discrepancy between an element of the list, \(x_i\), and the summary, \(s\), using a formula of the form,<br />
$$<br />
\sum_i |x_i &#8211; s|^p,<br />
$$<br />
where \(p\) was either \(0\), \(1\) or \(2\).</p>
<h3>The \(L_p\) Norms</h3>
<p>In this post, I&#8217;d like to extend this approach to linear regression. The notion of discrepancies we used in the last post is very closely tied to the idea of measuring the size of a vector in \(\mathbb{R}^n\). Specifically, we were minimizing a measure of discrepancies that was almost identical to the \(L_p\) family of norms that can be used to measure the size of vectors. Understanding \(L_p\) norms makes it much easier to describe several modern generalizations of classical linear regression.</p>
<p>To extend our previous approach to the more standard notion of an \(L_p\) norm, we simply take the sum we used before and rescale things by taking a \(p^{th}\) root. This gives the formula for the \(L_p\) norm of any vector, \(v = (v_1, v_2, \ldots, v_n)\), as,<br />
$$<br />
|v|_p = (\sum_i |v_i|^p)^\frac{1}{p}.<br />
$$<br />
When \(p = 2\), this formula reduces to the familiar formula for the length of a vector:<br />
$$<br />
|v|_2 = \sqrt{\sum_i v_i^2}.<br />
$$</p>
<p>In the last post, the vector we cared about was the vector of elementwise discrepancies, \(v = (x_1 &#8211; s, x_2 &#8211; s, \ldots, x_n &#8211; s)\). We wanted to minimize the overall size of this vector in order to make \(s\) a good summary of \(x_1, \ldots, x_n\). Because we were interested only in the minimum size of this vector, it didn&#8217;t matter that we skipped taking the \(p^{th}\) root at the end because one vector, \(v_1\), has a smaller norm than another vector, \(v_2\), only when the \(p^{th}\) power of that norm smaller than the \(p^{th}\) power of the other. What was essential wasn&#8217;t the scale of the norm, but rather the value of \(p\) that we chose. Here we&#8217;ll follow that approach again. Specifically, we&#8217;ll again be working consistently with the \(p^{th}\) power of an \(L_p\) norm:<br />
$$<br />
|v|_p^p = (\sum_i |v_i|^p).<br />
$$</p>
<h3>The Regression Problem</h3>
<p>Using \(L_p\) norms to measure the overall size of a vector of discrepancies extends naturally to other problems in statistics. In the previous post, we were trying to summarize a list of numbers by producing a simple summary statistic. In this post, we&#8217;re instead going to summarize the relationship between two lists of numbers in a form that generalizes traditional regression models.</p>
<p>Instead of a single list, we&#8217;ll now work with two vectors: \((x_1, x_2, \ldots, x_n)\) and \((y_1, y_2, \ldots, y_n)\). Because we like simple models, we&#8217;ll make the very strong (and very convenient) assumption that the second vector is, approximately, a linear function of the first vector, which gives us the formula:<br />
$$<br />
y_i \approx \beta_0 + \beta_1 x_i.<br />
$$</p>
<p>In practice, this linear relationship is never perfect, but only an approximation. As such, for any specific values we choose for \(\beta_0\) and \(\beta_1\), we have to compute a vector of discrepancies: \(v = (y_1 &#8211; (\beta_0 + \beta_1 x_1), \ldots, y_n &#8211; (\beta_0 + \beta_1 x_n))\). The question then becomes: how do we measure the size of this vector of discrepancies? By choosing different norms to measure its size, we arrive at several different forms of linear regression models. In particular, we&#8217;ll work with three norms: the \(L_0\), \(L_1\) and \(L_2\) norms.</p>
<p>As we did with the single vector case, here we&#8217;ll define discrepancies as,<br />
$$<br />
d_i = |y_i &#8211; (\beta_0 + \beta_1 x_i)|^p,<br />
$$<br />
and the total error as,<br />
$$<br />
E_p = \sum_i |y_i &#8211; (\beta_0 + \beta_1 x_i)|^p,<br />
$$<br />
which is the just the \(p^{th}\) power of the \(L_p\) norm.</p>
<h3>Several Forms of Regression</h3>
<p>In general, we want estimate a set of regression coefficients that minimize this total error. Different forms of linear regression appear when we alter the values of \(p\). As before, let&#8217;s consider three settings:<br />
$$<br />
E_0 = \sum_i |y_i &#8211; (\beta_0 + \beta_1 x_i)|^0<br />
$$<br />
$$<br />
E_1 = \sum_i |y_i &#8211; (\beta_0 + \beta_1 x_i)|^1<br />
$$<br />
$$<br />
E_2 = \sum_i |y_i &#8211; (\beta_0 + \beta_1 x_i)|^2<br />
$$</p>
<p>What happens in these settings? In the first case, we select regression coefficients so that the line passes through as many points as possible. Clearly we can always select a line that passes through any pair of points. And we can show that there are data sets in which we cannot do better. So the \(L_0\) norm doesn&#8217;t seem to provide a very useful form of linear regression, but I&#8217;d be interested to see examples of its use.</p>
<p>In contrast, minimizing \(E_1\) and \(E_2\) define quite interesting and familiar forms of linear regression. We&#8217;ll start with \(E_2\) because it&#8217;s the most familiar: it defines Ordinary Least Squares (OLS) regression, which is the one we all know and love. In the \(L_2\) case, we select \(\beta_0\) and \(\beta_1\) to minimize,<br />
$$<br />
E_2 = \sum_i (y_i &#8211; (\beta_0 + \beta_1 x_i))^2,<br />
$$<br />
which is the summed squared error over all of the \((x_i, y_i)\) pairs. In other words, Ordinary Least Squares regression is just an attempt to find an approximating linear relationship between two vectors that minimizes the \(L_2\) norm of the vector of discrepancies.</p>
<p>Although OLS regression is clearly king, the coefficients we get from minimizing \(E_1\) are also quite widely used: using the \(L_1\) norm defines Least Absolute Deviations (LAD) regression, which is also sometimes called Robust Regression. This approach to regression is robust because large outliers that would produce errors greater than \(1\) are not unnecessarily augmented by the squaring operation that&#8217;s used in defining OLS regression, but instead only have their absolute values taken. This means that the resulting model will try to match the overall linear pattern in the data even when there are some very large outliers.</p>
<p>We can also relate these two approaches to the strategy employed in the previous post. When we use OLS regression (which would be better called \(L_2\) regression), we predict the mean of \(y_i\) given the value of \(x_i\). And when we use LAD regression (which would be better called \(L_1\) regression), we predict the median of \(y_i\) given the value of \(x_i\). Just as I said in the previous post, the core theoretical tool that we need to understand is the \(L_p\) norm. For single number summaries, it naturally leads to modes, medians and means. For simple regression problems, it naturally leads to LAD regression and OLS regression. But there&#8217;s more: it also leads naturally to the two most popular forms of regularized regression.</p>
<h3>Regularization</h3>
<p>If you&#8217;re not familiar with regularization, the central idea is that  we don&#8217;t exclusively try to find the values of \(\beta_0\) and \(\beta_1\) that minimize the discrepancy between \(\beta_0 + \beta_1 x_i\) and \(y_i\), but also simultaneously try to satisfy a competing requirement that \(\beta_1\) not get too large. Note that we don&#8217;t try to control the size of \(\beta_0\) because it describes the overall scale of the data rather than the relationship between \(x\) and \(y\).</p>
<p>Because these objectives compete, we have to combine them into a single objective. We do that by working with a linear sum of the two objectives. And because both the discrepancy objective and the size of the coefficients can be described in terms of norms, we&#8217;ll assume that we want to minimize the \(L_p\) norm of the discrepancies and the \(L_q\) norm of the \(\beta\)&#8217;s. This means that we end up trying to minimize an expression of the form,<br />
$$<br />
(\sum_i |y_i &#8211; (\beta_0 + \beta_1 x_i)|^{p}) + \lambda (|\beta_1|^q).<br />
$$</p>
<p>In most regularized regression models that I&#8217;ve seen in the wild, people tend to use \(p = 2\) and \(q = 1\) or \(q = 2\). When \(q = 1\), this model is called the LASSO. When \(q = 2\), this model is called ridge regression. In another approach, I&#8217;ll try to describe why the LASSO and ridge regression produce such different patterns of coefficients.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2013/03/22/using-norms-to-understand-linear-regression/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Modes, Medians and Means: A Unifying Perspective</title>
		<link>http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/#comments</comments>
		<pubDate>Fri, 22 Mar 2013 13:21:06 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4813</guid>
		<description><![CDATA[Introduction / Warning Any traditional introductory statistics course will teach students the definitions of modes, medians and means. But, because introductory courses can&#8217;t assume that students have much mathematical maturity, the close relationship between these three summary statistics can&#8217;t be made clear. This post tries to remedy that situation by making it clear that all [...]]]></description>
				<content:encoded><![CDATA[<h3>Introduction / Warning</h3>
<p>Any traditional introductory statistics course will teach students the definitions of modes, medians and means. But, because introductory courses can&#8217;t assume that students have much mathematical maturity, the close relationship between these three summary statistics can&#8217;t be made clear. This post tries to remedy that situation by making it clear that all three concepts arise as specific parameterizations of a more general problem.</p>
<p>To do so, I&#8217;ll need to introduce one non-standard definition that may trouble some readers. In order to simplify my exposition, let&#8217;s all agree to assume that \(0^0 = 0\). In particular, we&#8217;ll want to assume that \(|0|^0 = 0\), even though \(|\epsilon|^0 = 1\) for all \(\epsilon > 0\). This definition is non-standard, but it greatly simplifies what follows and emphasizes the conceptual unity of modes, medians and means.</p>
<h3>Constructing a Summary Statistic</h3>
<p>To see how modes, medians and means arise, let&#8217;s assume that we have a list of numbers, \((x_1, x_2, \ldots, x_n)\), that we want to summarize. We want our summary to be a single number, which we&#8217;ll call \(s\). How should we select \(s\) so that it summarizes the numbers,  \((x_1, x_2, \ldots, x_n)\), effectively?</p>
<p>To answer that, we&#8217;ll assume that \(s\) is an effective summary of the entire list if the typical discrepancy between \(s\) and each of the \(x_i\) is small. With that assumption in place, we only need to do two things: (1) define the notion of discrepancy between two numbers, \(x_i\) and \(s\);  and (2) define the notion of a typical discrepancy. Because each number \(x_i\) produces its own discrepancy, we&#8217;ll need to introduce a method for aggregating the individual discrepancies to order to say something about the typical discrepancy.</p>
<h3>Defining a Discrepancy</h3>
<p>We could define the discrepancy between a number \(x_i\) and another number \(s\) in many ways. For now, we&#8217;ll consider only three possibilities. All of these three options satisfies a basic intuition we have about the notion of discrepancy: we expect that the discrepancy between \(x_i\) and \(s\) should be \(0\) if \(|x_i &#8211; s| = 0\) and that the discrepancy should be greater than \(0\) if \(|x_i &#8211; s| > 0\). That leaves us with one obvious question: how much greater should the discrepancy be when \(|x_i &#8211; s| > 0\)?</p>
<p>To answer that question, let&#8217;s consider three definitions of the discrepancy, \(d_i\):</p>
<ol>
<li>\(d_i = |x_i &#8211; s|^0\)</li>
<li>\(d_i = |x_i &#8211; s|^1\)</li>
<li>\(d_i = |x_i &#8211; s|^2\)</li>
</ol>
<p>How should we think about these three possible definitions?</p>
<p>The first definition, \(d_i = |x_i &#8211; s|^0\), says that the discrepancy is \(1\) if \(x_i \neq s\) and is \(0\) only when \(x_i = s\). This notion of discrepancy is typically called <b>zero-one loss</b> in machine learning. Note that this definition implies that anything other than exact equality produces a constant measure of discrepancy. Summarizing \(x_i = 2\) with \(s = 0\) is no better nor worse than using \(s = 1\). In other words, the discrepancy does not increase at all as \(s\) gets further and further from \(x_i\). You can see this reflected in the far-left column of the image below:</p>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2013/03/discrepancy1.png" alt="Discrepancy" title="discrepancy.png" border="0" width="800" height="466" /><br />
</center></p>
<p>The second definition, \(d_i = |x_i &#8211; s|^1\), says that the discrepancy is equal to the distance between \(x_i\) and \(s\). This is often called an <b>absolute deviation</b> in machine learning. Note that this definition implies that the discrepancy should increase linearly as \(s\) gets further and further from \(x_i\). This is reflected in the center column of the image above.</p>
<p>The third definition, \(d_i = |x_i &#8211; s|^2\), says that the discrepancy is the squared distance between \(x_i\) and \(s\). This is often called a <b>squared error</b> in machine learning. Note that this definition implies that the discrepancy should increase super-linearly as \(s\) gets further and further from \(x_i\). For example, if \(x_i = 1\) and \(s = 0\), then the discrepancy is \(1\). But if \(x_i = 2\) and \(s = 0\), then the discrepancy is \(4\). This is reflected in the far right column of the image above.</p>
<p>When we consider a list with a single element, \((x_1)\), these definitions all suggest that we should choose the same number: namely, \(s = x_1\).</p>
<h3>Aggregating Discrepancies</h3>
<p>Although these definitions do not differ for a list with a single element, they suggest using very different summaries of a list with more than one number in it. To see why, let&#8217;s first assume that we&#8217;ll aggregate the discrepancy between \(x_i\) and \(s\) for each of the \(x_i\) into a single summary of the quality of a proposed value of \(s\). To perform this aggregation, we&#8217;ll sum up the discrepancies over each of the \(x_i\) and call the result \(E\).</p>
<p>In that case, our three definitions give three interestingly different possible definitions of the typical discrepancy, which we&#8217;ll call \(E\) for error:<br />
$$<br />
E_0 = \sum_{i} |x_i &#8211; s|^0.<br />
$$</p>
<p>$$<br />
E_1 = \sum_{i} |x_i &#8211; s|^1.<br />
$$</p>
<p>$$<br />
E_2 = \sum_{i} |x_i &#8211; s|^2.<br />
$$</p>
<p>When we write down these expressions in isolation, they don&#8217;t look very different. But if we select \(s\) to minimize each of these three types of errors, we get very different numbers. And, surprisingly, each of these three numbers will be very familiar to us.</p>
<h3>Minimizing Aggregate Discrepancies</h3>
<p>For example, suppose that we try to find \(s_0\) that minimizes the zero-one loss definition of the error of a single number summary. In that case, we require that,<br />
$$<br />
s_0 = \arg \min_{s} \sum_{i} |x_i &#8211; s|^0.<br />
$$<br />
What value should \(s_0\) take on? If you give this some extended thought, you&#8217;ll discover two things: (1) there is not necessarily a single best value of \(s_0\), but potentially many different values; and (2) each of these best values is one of the <b>modes</b> of the \(x_i\).</p>
<p>In other words, the best single number summary of a set of numbers, when you use exact equality as your metric of error, is one of the modes of that set of numbers.</p>
<p>What happens if we consider some of the other definitions? Let&#8217;s start by considering \(s_1\):<br />
$$<br />
s_1 = \arg \min_{s} \sum_{i} |x_i &#8211; s|^1.<br />
$$<br />
Unlike \(s_0\), \(s_1\) is a unique number: it is the <b>median</b> of the \(x_i\). That is, the best summary of a set of numbers, when you use absolute differences as your metric of error, is the median of that set of numbers.</p>
<p>Since we&#8217;ve just found that the mode and the median appear naturally, we might wonder if other familiar basic statistics will appear. Luckily, they will. If we look for,<br />
$$<br />
s_2 = \arg \min_{s} \sum_{i} |x_i &#8211; s|^2,<br />
$$<br />
we&#8217;ll find that, like \(s_1\), \(s_2\) is again a unique number. Moreover, \(s_2\) is the <b>mean</b> of the \(x_i\). That is, the best summary of a set of numbers, when you use squared differences as your metric of error, is the mean of that set of numbers.</p>
<p>To sum up, we&#8217;ve just seen that the three most famous single number summaries of a data set are very closely related: they all minimize the average discrepancy between \(s\) and the numbers being summarized. They only differ in the type of discrepancy being considered:</p>
<ol>
<li>The mode minimizes the number of times that one of the numbers in our summarized list is not equal to the summary that we use.</li>
<li>The median minimizes the average distance between each number and our summary.</li>
<li>The mean minimizes the average squared distance between each number and our summary.</li>
</ol>
<p>In equations,</p>
<ol>
<li>\(\text{The mode of } x_i = \arg \min_{s} \sum_{i} |x_i &#8211; s|^0\)</li>
<li>\(\text{The median of } x_i = \arg \min_{s} \sum_{i} |x_i &#8211; s|^1\)</li>
<li>\(\text{The mean of } x_i = \arg \min_{s} \sum_{i} |x_i &#8211; s|^2\)</li>
</ol>
<h3>Summary</h3>
<p>We&#8217;ve just seen that the mode, median and mean all arise from a simple parametric process in which we try to minimize the average discrepancy between a single number \(s\) and a list of numbers, \(x_1, x_2, \ldots, x_n\) that we try to summarize using \(s\). In a future blog post, I&#8217;ll describe how the ideas we&#8217;ve just introduced relate to the concept of \(L_p\) norms. Thinking about minimizing \(L_p\) norms is a generalization of taking modes, medians and means that leads to almost every important linear method in statistics &#8212; ranging from linear regression to the SVD.</p>
<h3>Thanks</h3>
<p>Thanks to Sean Taylor for reading a draft of this post and commenting on it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medians-and-means-an-unifying-perspective/feed/</wfw:commentRss>
		<slash:comments>28</slash:comments>
		</item>
		<item>
		<title>Writing Better Statistical Programs in R</title>
		<link>http://www.johnmyleswhite.com/notebook/2013/01/24/writing-better-statistical-programs-in-r/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2013/01/24/writing-better-statistical-programs-in-r/#comments</comments>
		<pubDate>Thu, 24 Jan 2013 17:40:43 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4806</guid>
		<description><![CDATA[A while back a friend asked me for advice about speeding up some R code that they&#8217;d written. Because they were running an extensive Monte Carlo simulation of a model they&#8217;d been developing, the poor performance of their code had become an impediment to their work. After I looked through their code, it was clear [...]]]></description>
				<content:encoded><![CDATA[<p>A while back a friend asked me for advice about speeding up some R code that they&#8217;d written. Because they were running an extensive Monte Carlo simulation of a model they&#8217;d been developing, the poor performance of their code had become an impediment to their work.</p>
<p>After I looked through their code, it was clear that the performance hurdles they were stumbling upon could be overcome by adopting a few best practices for statistical programming. This post tries to describe some of the simplest best practices for statistical programming in R. Following these principles should make it easier for you to write statistical programs that are both highly performant and correct.</p>
<h3>Write Out a DAG</h3>
<p>Whenever you&#8217;re running a simulation study, you should appreciate the fact that you are working with a probabilistic model. Even if you are primarily focused upon the deterministic components of this model, the presence of any randomness in the model means that all of the theory of probabilistic models applies to your situation.</p>
<p>Almost certainly the most important concept in probabilistic modeling when you want to write efficient code is the notion of conditional independence. Conditional independence is important because many probabilistic models can be decomposed into simple pieces that can be computed in isolation. Although your model contains many variables, any one of these variables may depend upon only a few other variables in your model. If you can organize all of variables in your model based on their dependencies, it will be easier to exploit two computational tricks: vectorization and parallelization.</p>
<p>Let&#8217;s go through an example. Imagine that you have the model shown below:</p>
<p>$$<br />
X \sim \text{Normal}(0, 1)<br />
$$</p>
<p>$$<br />
Y1 \sim \text{Uniform}(X, X + 1)<br />
$$</p>
<p>$$<br />
Y2 \sim \text{Uniform}(X &#8211; 1, X)<br />
$$</p>
<p>$$<br />
Z \sim \text{Cauchy}(Y1 + Y2, 1)<br />
$$</p>
<p>In this model, the distribution of Y1 and Y2 depends only on the value of X. Similarly, the distribution of Z depends only on the values of Y1 and Y2. We can formalize this notion using a DAG, which is a directed acyclic graph that depicts which variables depend upon which other variables. It will help you appreciate the value of this format if you think of the arrows in the DAG below as indicating the flow of causality:</p>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2013/01/dag.png" alt="Dag" title="dag.png" border="0" width="470" height="600" /><br />
</center></p>
<p>Having this DAG drawn out for your model will make it easier to write efficient code, because you can generate all of the values of a variable V simultaneously once you&#8217;ve computed the values of the variables that V depends upon. In our example, you can generate the values of X for all of your different simulations at once and then generate all of the Y1&#8242;s and Y2&#8242;s based on the values of X that you generate. You can then exploit this stepwise generation procedure to vectorize and parallelize your code. I&#8217;ll discuss vectorization to give you a sense of how to exploit the DAG we&#8217;ve drawn to write faster code.</p>
<h3>Vectorize Your Simulations</h3>
<p>Sequential dependencies are a major bottleneck in languages like R and Matlab that cannot perform loops efficiently. Looking at the DAG for the model shown able, you might think that you can&#8217;t get around writing a &#8220;for&#8221; loop to generate samples of this model because some of the variables need to be generated before others.</p>
<p>But, in reality, each individual sample from this model is independent of all of the others. As such, you can draw all of the X&#8217;s for all of your different simulations using vectorized code. Below I show how this model could be implemented using loops and then show how this same model could be implemented using vectorized operations:</p>
<h4>Loop Code</h4>

<div class="wp_codebox"><table><tr id="p48065"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code" id="p4806code5"><pre class="c" style="font-family:monospace;">run.<span style="color: #202020;">sims</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>n.<span style="color: #202020;">sims</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
	results <span style="color: #339933;">&lt;-</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
&nbsp;
	<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>sim in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>n.<span style="color: #202020;">sims</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#123;</span>
		x <span style="color: #339933;">&lt;-</span> rnorm<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
		y1 <span style="color: #339933;">&lt;-</span> runif<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> x<span style="color: #339933;">,</span> x <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
		y2 <span style="color: #339933;">&lt;-</span> runif<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> x <span style="color: #339933;">-</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> x<span style="color: #009900;">&#41;</span>
		z <span style="color: #339933;">&lt;-</span> rcauchy<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> y1 <span style="color: #339933;">+</span> y2<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
		results <span style="color: #339933;">&lt;-</span> rbind<span style="color: #009900;">&#40;</span>results<span style="color: #339933;">,</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span>X <span style="color: #339933;">=</span> x<span style="color: #339933;">,</span> Y1 <span style="color: #339933;">=</span> y1<span style="color: #339933;">,</span> Y2 <span style="color: #339933;">=</span> y2<span style="color: #339933;">,</span> Z <span style="color: #339933;">=</span> z<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
	<span style="color: #009900;">&#125;</span>
&nbsp;
	<span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>results<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
b <span style="color: #339933;">&lt;-</span> Sys.<span style="color: #202020;">time</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
run.<span style="color: #202020;">sims</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">5000</span><span style="color: #009900;">&#41;</span>
e <span style="color: #339933;">&lt;-</span> Sys.<span style="color: #202020;">time</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
e <span style="color: #339933;">-</span> b</pre></td></tr></table></div>

<h4>Vectorized Code</h4>

<div class="wp_codebox"><table><tr id="p48066"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code" id="p4806code6"><pre class="c" style="font-family:monospace;">run.<span style="color: #202020;">sims</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>n.<span style="color: #202020;">sims</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
	x <span style="color: #339933;">&lt;-</span> rnorm<span style="color: #009900;">&#40;</span>n.<span style="color: #202020;">sims</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">0</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
	y1 <span style="color: #339933;">&lt;-</span> runif<span style="color: #009900;">&#40;</span>n.<span style="color: #202020;">sims</span><span style="color: #339933;">,</span> x<span style="color: #339933;">,</span> x <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
	y2 <span style="color: #339933;">&lt;-</span> runif<span style="color: #009900;">&#40;</span>n.<span style="color: #202020;">sims</span><span style="color: #339933;">,</span> x <span style="color: #339933;">-</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> x<span style="color: #009900;">&#41;</span>
	z <span style="color: #339933;">&lt;-</span> rcauchy<span style="color: #009900;">&#40;</span>n.<span style="color: #202020;">sims</span><span style="color: #339933;">,</span> y<span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
	results <span style="color: #339933;">&lt;-</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span>X <span style="color: #339933;">=</span> x<span style="color: #339933;">,</span> Y1 <span style="color: #339933;">=</span> y1<span style="color: #339933;">,</span> Y2 <span style="color: #339933;">=</span> y2<span style="color: #339933;">,</span> Z <span style="color: #339933;">=</span> z<span style="color: #009900;">&#41;</span>
&nbsp;
	<span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>results<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
b <span style="color: #339933;">&lt;-</span> Sys.<span style="color: #202020;">time</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
run.<span style="color: #202020;">sims</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">5000</span><span style="color: #009900;">&#41;</span>
e <span style="color: #339933;">&lt;-</span> Sys.<span style="color: #202020;">time</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
e <span style="color: #339933;">-</span> b</pre></td></tr></table></div>

<p>The performance gains for this example are substantial when you move from the naive loop code to the vectorized code. (NB: There are also some gains from avoiding the repeated calls to <code>rbind</code>, although they are less important than one might think in this case.)</p>
<p>We could go further and parallelize the vectorized code, but this can be tedious to do in R.</p>
<h3>The Data Generation / Model Fitting Cycle</h3>
<p>Vectorization can make code in languages like R much more efficient. But speed is useless if you&#8217;re not generating correct output. For me, the essential test of correctness for a probabilistic model only becomes clear after I&#8217;ve written two complementary functions:</p>
<ol>
<li>A data generation function that produces samples from my model. We can call this function <code>generate</code>. The arguments to <code>generate</code> are the parameters of my model.</li>
<p> 
<li>A model fitting function that estimates the parameters of my model based on a sample of data. We can call this function <code>fit</code>. The arguments to <code>fit</code> are the data points we generated using <code>generate</code></li>
</ol>
<p>The value of these two functions is that they can be set up to feedback into one another in the cycle shown below:</p>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2013/01/cycle2.png" alt="Cycle2" title="cycle2.png" border="0" width="576" height="360" /><br />
</center></p>
<p>I feel confident in the quality of statistical code when these functions interact stably. If the parameters inferred in a single pass through this loop are close to the original inputs, then my code is likely to work correctly. This amounts to a specific instance of the following design pattern:</p>

<div class="wp_codebox"><table><tr id="p48067"><td class="line_numbers"><pre>1
2
3
</pre></td><td class="code" id="p4806code7"><pre class="c" style="font-family:monospace;">data <span style="color: #339933;">&lt;-</span> generate<span style="color: #009900;">&#40;</span>model<span style="color: #339933;">,</span> parameters<span style="color: #009900;">&#41;</span>
inferred.<span style="color: #202020;">parameters</span> <span style="color: #339933;">&lt;-</span> fit<span style="color: #009900;">&#40;</span>model<span style="color: #339933;">,</span> data<span style="color: #009900;">&#41;</span>
reliability <span style="color: #339933;">&lt;-</span> error<span style="color: #009900;">&#40;</span>model<span style="color: #339933;">,</span> parameters<span style="color: #339933;">,</span> inferred.<span style="color: #202020;">parameters</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>To see this pattern in action, let&#8217;s step through a process of generating data from a normal distribution and then fitting a normal to the data we generate. You can think of this as a form of &#8220;currying&#8221; in which we hardcore the value of the parameter <code>model</code>:</p>

<div class="wp_codebox"><table><tr id="p48068"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code" id="p4806code8"><pre class="c" style="font-family:monospace;">n.<span style="color: #202020;">sims</span> <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">100</span>
n.<span style="color: #202020;">obs</span> <span style="color: #339933;">&lt;-</span> <span style="color: #0000dd;">100</span>
&nbsp;
generate.<span style="color: #202020;">normal</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>parameters<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
	<span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>rnorm<span style="color: #009900;">&#40;</span>n.<span style="color: #202020;">obs</span><span style="color: #339933;">,</span> parameters<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> parameters<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
fit.<span style="color: #202020;">normal</span> <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
	<span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span>c<span style="color: #009900;">&#40;</span>mean<span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> sd<span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
distance <span style="color: #339933;">&lt;-</span> <span style="color: #000000; font-weight: bold;">function</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">true</span>.<span style="color: #202020;">parameters</span><span style="color: #339933;">,</span> inferred.<span style="color: #202020;">parameters</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
	<span style="color: #b1b100;">return</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">true</span>.<span style="color: #202020;">parameters</span> <span style="color: #339933;">-</span> inferred.<span style="color: #202020;">parameters</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">^</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
reliability <span style="color: #339933;">&lt;-</span> data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
&nbsp;
<span style="color: #b1b100;">for</span> <span style="color: #009900;">&#40;</span>sim in <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span>n.<span style="color: #202020;">sims</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#123;</span>
	parameters <span style="color: #339933;">&lt;-</span> c<span style="color: #009900;">&#40;</span>runif<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> runif<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
	data <span style="color: #339933;">&lt;-</span> generate.<span style="color: #202020;">normal</span><span style="color: #009900;">&#40;</span>parameters<span style="color: #009900;">&#41;</span>
	inferred.<span style="color: #202020;">parameters</span> <span style="color: #339933;">&lt;-</span> fit.<span style="color: #202020;">normal</span><span style="color: #009900;">&#40;</span>data<span style="color: #009900;">&#41;</span>
	recovery.<span style="color: #202020;">error</span> <span style="color: #339933;">&lt;-</span> distance<span style="color: #009900;">&#40;</span>parameters<span style="color: #339933;">,</span> inferred.<span style="color: #202020;">parameters</span><span style="color: #009900;">&#41;</span>
	reliability <span style="color: #339933;">&lt;-</span> rbind<span style="color: #009900;">&#40;</span>reliability<span style="color: #339933;">,</span>
		                 data.<span style="color: #202020;">frame</span><span style="color: #009900;">&#40;</span>True1 <span style="color: #339933;">=</span> parameters<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>
		                 	        True2 <span style="color: #339933;">=</span> parameters<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>
		                 	        Inferred1 <span style="color: #339933;">=</span> inferred.<span style="color: #202020;">parameters</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>
		                 	        Inferred2 <span style="color: #339933;">=</span> inferred.<span style="color: #202020;">parameters</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>
							        Error1 <span style="color: #339933;">=</span> recovery.<span style="color: #202020;">error</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>
							        Error2 <span style="color: #339933;">=</span> recovery.<span style="color: #202020;">error</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>If you generate data this way, you will see that our inference code is quite reliable. And you can see that it becomes better if we set <code>n.obs</code> to a larger value like 100,000.</p>
<p>I expect this kind of performance from all of my statistical code. I can&#8217;t trust the quality of either <code>generate</code> or <code>fit</code> until I see that they play well together. It is their mutual coherence that inspires faith.</p>
<h3>General Lessons</h3>
<h4>Speed</h4>
<p>When writing code in R, you can improve performance by searching for every possible location in which vectorization is possible. Vectorization essentially replaces R&#8217;s loops (which are not efficient) with C&#8217;s loops (which are efficient) because the computations in a vectorized call are almost always implemented in a language other than R.</p>
<h4>Correctness</h4>
<p>When writing code for model fitting in any language, you should always insure that your code can infer the parameters of models when given simulated data with known parameter values.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2013/01/24/writing-better-statistical-programs-in-r/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Americans Live Longer and Work Less</title>
		<link>http://www.johnmyleswhite.com/notebook/2013/01/21/americans-live-longer-and-work-less/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2013/01/21/americans-live-longer-and-work-less/#comments</comments>
		<pubDate>Mon, 21 Jan 2013 15:30:07 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4801</guid>
		<description><![CDATA[Today I saw an article on Hacker News entitled, &#8220;America&#8217;s CEOs Want You to Work Until You&#8217;re 70&#8243;. I was particularly surprised by this article appearing out of the blue because I take it for granted that America will eventually have to raise the retirement age to avoid bankruptcy. After reading the article, I wasn&#8217;t [...]]]></description>
				<content:encoded><![CDATA[<p>Today I saw an article on Hacker News entitled, <a href="http://www.businessweek.com/articles/2013-01-18/americas-ceos-want-you-to-work-until-youre-70">&#8220;America&#8217;s CEOs Want You to Work Until You&#8217;re 70&#8243;</a>. I was particularly surprised by this article appearing out of the blue because I take it for granted that America will eventually have to raise the retirement age to avoid bankruptcy. After reading the article, I wasn&#8217;t able to figure out why the story had been run at all. So I decided to do some basic fact-checking.</p>
<p>I tracked down <a href="http://demog.berkeley.edu/~andrew/1918/figure2.html">some time series data about life expectancies in the U.S.</a> from Berkeley and then found <a href="http://www.oecd.org/els/employmentpoliciesanddata/ageingandemploymentpolicies-statisticsonaverageeffectiveageofretirement.htm">some time series data about the average age at retirement</a> from the OECD. Plotting just these two bits of information, as shown below, makes it clear that Americans are spending a larger proportion of their life in retirement.</p>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2013/01/retirement.png" alt="Retirement" title="retirement.png" border="0" width="800" height="466" /><br />
</center></p>
<p>Perhaps I&#8217;m just naive, but it seems obvious to me that we can&#8217;t afford to take on several additional years of retirement pension liabilities for every living American. If Americans are living longer, we will need them to work longer in order to pay our bills.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2013/01/21/americans-live-longer-and-work-less/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
		</item>
		<item>
		<title>Symbolic Differentiation in Julia</title>
		<link>http://www.johnmyleswhite.com/notebook/2013/01/07/symbolic-differentiation-in-julia/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2013/01/07/symbolic-differentiation-in-julia/#comments</comments>
		<pubDate>Mon, 07 Jan 2013 15:45:15 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4793</guid>
		<description><![CDATA[A Brief Introduction to Metaprogramming in Julia In contrast to my previous post, which described one way in which Julia allows (and expects) the programmer to write code that directly employs the atomic operations offered by computers, this post is meant to introduce newcomers to some of Julia&#8217;s higher level functions for metaprogramming. To make [...]]]></description>
				<content:encoded><![CDATA[<h3>A Brief Introduction to Metaprogramming in Julia</h3>
<p>In contrast to <a href="http://www.johnmyleswhite.com/notebook/2013/01/03/computers-are-machines/">my previous post</a>, which described one way in which Julia allows (and expects) the programmer to write code that directly employs the atomic operations offered by computers, this post is meant to introduce newcomers to some of Julia&#8217;s higher level functions for metaprogramming. To make metaprogramming more interesting, we&#8217;re going to build a system for symbolic differentiation in Julia.</p>
<p>Like Lisp, the Julia interpreter represents Julian expressions using normal data structures: every Julian expression is represented using an object of type <code>Expr</code>. You can see this by typing something like <code>:(x + 1)</code> into the Julia REPL:</p>

<div class="wp_codebox"><table><tr id="p479322"><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code" id="p4793code22"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> <span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span>x <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
<span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span>x<span style="color: #339933;">,</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> typeof<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span>x<span style="color: #339933;">+</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
Expr</pre></td></tr></table></div>

<p>Looking at the REPL output when we enter an expression quoted using the <code>:</code> operator, we can see that Julia has rewritten our input expression, originally written using infix notation, as an expression that uses prefix notation. This standardization to prefix notation makes it easier to work with arbitrary expressions because it removes a needless source of variation in the format of expressions.</p>
<p>To develop an intuition for what this kind of expression means to Julia, we can use the <code>dump</code> function to examine its contents:</p>

<div class="wp_codebox"><table><tr id="p479323"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
</pre></td><td class="code" id="p4793code23"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> dump<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span>x <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
Expr 
  head<span style="color: #339933;">:</span> Symbol call
  args<span style="color: #339933;">:</span> Array<span style="color: #009900;">&#40;</span>Any<span style="color: #339933;">,</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">3</span><span style="color: #339933;">,</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
    <span style="color: #0000dd;">1</span><span style="color: #339933;">:</span> Symbol <span style="color: #339933;">+</span>
    <span style="color: #0000dd;">2</span><span style="color: #339933;">:</span> Symbol x
    <span style="color: #0000dd;">3</span><span style="color: #339933;">:</span> Int64 <span style="color: #0000dd;">1</span>
  typ<span style="color: #339933;">:</span> Any</pre></td></tr></table></div>

<p>Here you can see that a Julian expression consists of three parts:</p>
<ol>
<li>A <code>head</code> symbol, which describes the basic type of the expression. For this blog post, all of the expressions we&#8217;ll work with have <code>head</code> equal to <code>:call</code>.</li>
<li>An <code>Array{Any}</code> that contains the arguments of the <code>head</code>. In our example, the <code>head</code> is <code>:call</code>, which indicates a function call is being made in this expression. The arguments for the function call are:</li>
<ol>
<li><code>:+</code>, the symbol denoting the addition function that we are calling.</li>
<li><code>:x</code>, the symbol denoting the variable <code>x</code></li>
<li><code>1</code>, the number 1 represented as a 64-bit integer.</li>
</ol>
<li>A <code>typ</code> which stores type inference information. We&#8217;ll ignore this information as it&#8217;s not relevant to us right now.</li>
</ol>
<p>Because each expression is built out of normal components, we can construct one piecemeal:</p>

<div class="wp_codebox"><table><tr id="p479324"><td class="line_numbers"><pre>1
2
</pre></td><td class="code" id="p4793code24"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span> <span style="color: #009900;">&#123;</span><span style="color: #339933;">:+,</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> Any<span style="color: #009900;">&#41;</span>
<span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>Because this expression only depends upon constants, we can immediately evaluate it using the <code>eval</code> function:</p>

<div class="wp_codebox"><table><tr id="p479325"><td class="line_numbers"><pre>1
2
</pre></td><td class="code" id="p4793code25"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> eval<span style="color: #009900;">&#40;</span>Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span> <span style="color: #009900;">&#123;</span><span style="color: #339933;">:+,</span> <span style="color: #0000dd;">1</span><span style="color: #339933;">,</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> Any<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #0000dd;">2</span></pre></td></tr></table></div>

<h3>Symbolic Differentiation in Julia</h3>
<p>Now that we know how Julia expressions are built, we can design a very simple prototype system for doing symbolic differentiation in Julia. We&#8217;ll build up our system in pieces using some of the most basic rules of calculus:</p>
<ol>
<li><b>The Constant Rule</b>: <code>d/dx c = 0</code></li>
<li><b>The Symbol Rule</b>: <code>d/dx x = 1</code>, <code>d/dx y = 0</code></li>
<li><b>The Sum Rule</b>: <code>d/dx (f + g) = (d/dx f) + (d/dx g)</code></li>
<li><b>The Subtraction Rule</b>: <code>d/dx (f - g) = (d/dx f) - (d/dx g)</code></li>
<li><b>The Product Rule</b>: <code>d/dx (f * g) = (d/dx f) * g + f * (d/dx g)</code></li>
<li><b>The Quotient Rule</b>: <code>d/dx (f / g) = [(d/dx f) * g - f * (d/dx g)] / g^2</code></li>
</ol>
<p>Implementing these operations is quite easy once you understand the data structure Julia uses to represent expressions. And some of these operations would be trivial regardless.</p>
<p>For example, here&#8217;s the Constant Rule in Julia:</p>

<div class="wp_codebox"><table><tr id="p479326"><td class="line_numbers"><pre>1
</pre></td><td class="code" id="p4793code26"><pre class="c" style="font-family:monospace;">differentiate<span style="color: #009900;">&#40;</span>x<span style="color: #339933;">::</span><span style="color: #202020;">Number</span><span style="color: #339933;">,</span> target<span style="color: #339933;">::</span><span style="color: #202020;">Symbol</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #0000dd;">0</span></pre></td></tr></table></div>

<p>And here&#8217;s the Symbol rule:</p>

<div class="wp_codebox"><table><tr id="p479327"><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code" id="p4793code27"><pre class="c" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> differentiate<span style="color: #009900;">&#40;</span>s<span style="color: #339933;">::</span><span style="color: #202020;">Symbol</span><span style="color: #339933;">,</span> target<span style="color: #339933;">::</span><span style="color: #202020;">Symbol</span><span style="color: #009900;">&#41;</span>
    <span style="color: #b1b100;">if</span> s <span style="color: #339933;">==</span> target
        <span style="color: #b1b100;">return</span> <span style="color: #0000dd;">1</span>
    <span style="color: #b1b100;">else</span>
        <span style="color: #b1b100;">return</span> <span style="color: #0000dd;">0</span>
    end
end</pre></td></tr></table></div>

<p>The first two rules of calculus don&#8217;t actually require us to understand anything about Julian expressions. But the interesting parts of a symbolic differentiation system do. To see that, let&#8217;s look at the Sum Rule:</p>

<div class="wp_codebox"><table><tr id="p479328"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
</pre></td><td class="code" id="p4793code28"><pre class="c" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> differentiate_sum<span style="color: #009900;">&#40;</span>ex<span style="color: #339933;">::</span><span style="color: #202020;">Expr</span><span style="color: #339933;">,</span> target<span style="color: #339933;">::</span><span style="color: #202020;">Symbol</span><span style="color: #009900;">&#41;</span>
    n <span style="color: #339933;">=</span> length<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#41;</span>
    new_args <span style="color: #339933;">=</span> Array<span style="color: #009900;">&#40;</span>Any<span style="color: #339933;">,</span> n<span style="color: #009900;">&#41;</span>
    new_args<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #339933;">:+</span>
    <span style="color: #b1b100;">for</span> i in <span style="color: #0000dd;">2</span><span style="color: #339933;">:</span>n
        new_args<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> differentiate<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> target<span style="color: #009900;">&#41;</span>
    end
    <span style="color: #b1b100;">return</span> Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span> new_args<span style="color: #339933;">,</span> Any<span style="color: #009900;">&#41;</span>
end</pre></td></tr></table></div>

<p>The Subtraction Rule can be defined almost identically:</p>

<div class="wp_codebox"><table><tr id="p479329"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
</pre></td><td class="code" id="p4793code29"><pre class="c" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> differentiate_subtraction<span style="color: #009900;">&#40;</span>ex<span style="color: #339933;">::</span><span style="color: #202020;">Expr</span><span style="color: #339933;">,</span> target<span style="color: #339933;">::</span><span style="color: #202020;">Symbol</span><span style="color: #009900;">&#41;</span>
    n <span style="color: #339933;">=</span> length<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#41;</span>
    new_args <span style="color: #339933;">=</span> Array<span style="color: #009900;">&#40;</span>Any<span style="color: #339933;">,</span> n<span style="color: #009900;">&#41;</span>
    new_args<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #339933;">:-</span>
    <span style="color: #b1b100;">for</span> i in <span style="color: #0000dd;">2</span><span style="color: #339933;">:</span>n
        new_args<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> differentiate<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> target<span style="color: #009900;">&#41;</span>
    end
    <span style="color: #b1b100;">return</span> Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span> new_args<span style="color: #339933;">,</span> Any<span style="color: #009900;">&#41;</span>
end</pre></td></tr></table></div>

<p>The Product Rule is a little more interesting because we need to build up an expression whose components are themselves expressions:</p>

<div class="wp_codebox"><table><tr id="p479330"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="code" id="p4793code30"><pre class="c" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> differentiate_product<span style="color: #009900;">&#40;</span>ex<span style="color: #339933;">::</span><span style="color: #202020;">Expr</span><span style="color: #339933;">,</span> target<span style="color: #339933;">::</span><span style="color: #202020;">Symbol</span><span style="color: #009900;">&#41;</span>
    n <span style="color: #339933;">=</span> length<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#41;</span>
    res_args <span style="color: #339933;">=</span> Array<span style="color: #009900;">&#40;</span>Any<span style="color: #339933;">,</span> n<span style="color: #009900;">&#41;</span>
    res_args<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #339933;">:+</span>
    <span style="color: #b1b100;">for</span> i in <span style="color: #0000dd;">2</span><span style="color: #339933;">:</span>n
       new_args <span style="color: #339933;">=</span> Array<span style="color: #009900;">&#40;</span>Any<span style="color: #339933;">,</span> n<span style="color: #009900;">&#41;</span>
       new_args<span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #339933;">:*</span>
       <span style="color: #b1b100;">for</span> j in <span style="color: #0000dd;">2</span><span style="color: #339933;">:</span>n
           <span style="color: #b1b100;">if</span> j <span style="color: #339933;">==</span> i
               new_args<span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> differentiate<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> target<span style="color: #009900;">&#41;</span>
           <span style="color: #b1b100;">else</span>
               new_args<span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span>j<span style="color: #009900;">&#93;</span>
           end
       end
       res_args<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span> new_args<span style="color: #339933;">,</span> Any<span style="color: #009900;">&#41;</span>
    end
    <span style="color: #b1b100;">return</span> Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span> res_args<span style="color: #339933;">,</span> Any<span style="color: #009900;">&#41;</span>
end</pre></td></tr></table></div>

<p>Last, but not least, here&#8217;s the Quotient Rule, which is a little more complex. We can code this rule up in a more explicit fashion that doesn&#8217;t use any loops so that we can directly see the steps we&#8217;re taking:</p>

<div class="wp_codebox"><table><tr id="p479331"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
</pre></td><td class="code" id="p4793code31"><pre class="c" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> differentiate_quotient<span style="color: #009900;">&#40;</span>ex<span style="color: #339933;">::</span><span style="color: #202020;">Expr</span><span style="color: #339933;">,</span> target<span style="color: #339933;">::</span><span style="color: #202020;">Symbol</span><span style="color: #009900;">&#41;</span>
    <span style="color: #b1b100;">return</span> Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span>
                <span style="color: #009900;">&#123;</span>
                    <span style="color: #339933;">:/,</span>
                    Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span>
                         <span style="color: #009900;">&#123;</span>
                            <span style="color: #339933;">:-,</span>
                            Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span>
                                 <span style="color: #009900;">&#123;</span>
                                    <span style="color: #339933;">:*,</span>
                                    differentiate<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> target<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                                    ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">3</span><span style="color: #009900;">&#93;</span>
                                 <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
                                 Any<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                            Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span>
                                 <span style="color: #009900;">&#123;</span>
                                    <span style="color: #339933;">:*,</span>
                                    ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">2</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>
                                    differentiate<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">3</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span> target<span style="color: #009900;">&#41;</span>
                                 <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
                                 Any<span style="color: #009900;">&#41;</span>
                         <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
                         Any<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span>
                    Expr<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span>call<span style="color: #339933;">,</span>
                         <span style="color: #009900;">&#123;</span>
                            <span style="color: #339933;">:^,</span>
                            ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">3</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">,</span>
                            <span style="color: #0000dd;">2</span>
                         <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
                         Any<span style="color: #009900;">&#41;</span>
                <span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span>
                Any<span style="color: #009900;">&#41;</span>
end</pre></td></tr></table></div>

<p>Now that we have all of those basic rules of calculus implemented as functions, we&#8217;ll build up a lookup table that we can use to tell our final <code>differentiate</code> function where to send new expressions based on the kind of function&#8217;s that being differentiated during each call to <code>differentiate</code>:</p>

<div class="wp_codebox"><table><tr id="p479332"><td class="line_numbers"><pre>1
2
3
4
5
6
</pre></td><td class="code" id="p4793code32"><pre class="c" style="font-family:monospace;">differentiate_lookup <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span>
                          <span style="color: #339933;">:+</span> <span style="color: #339933;">=&gt;</span> differentiate_sum<span style="color: #339933;">,</span>
                          <span style="color: #339933;">:-</span> <span style="color: #339933;">=&gt;</span> differentiate_subtraction<span style="color: #339933;">,</span>
                          <span style="color: #339933;">:*</span> <span style="color: #339933;">=&gt;</span> differentiate_product<span style="color: #339933;">,</span>
                          <span style="color: #339933;">:/</span> <span style="color: #339933;">=&gt;</span> differentiate_quotient
                       <span style="color: #009900;">&#125;</span></pre></td></tr></table></div>

<p>With all of the core machinery in place, the final definition of <code>differentiate</code> is very simple:</p>

<div class="wp_codebox"><table><tr id="p479333"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code" id="p4793code33"><pre class="c" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">function</span> differentiate<span style="color: #009900;">&#40;</span>ex<span style="color: #339933;">::</span><span style="color: #202020;">Expr</span><span style="color: #339933;">,</span> target<span style="color: #339933;">::</span><span style="color: #202020;">Symbol</span><span style="color: #009900;">&#41;</span>
    <span style="color: #b1b100;">if</span> ex.<span style="color: #202020;">head</span> <span style="color: #339933;">==</span> <span style="color: #339933;">:</span>call
        <span style="color: #b1b100;">if</span> has<span style="color: #009900;">&#40;</span>differentiate_lookup<span style="color: #339933;">,</span> ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span>
            <span style="color: #b1b100;">return</span> differentiate_lookup<span style="color: #009900;">&#91;</span>ex.<span style="color: #202020;">args</span><span style="color: #009900;">&#91;</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#40;</span>ex<span style="color: #339933;">,</span> target<span style="color: #009900;">&#41;</span>
        <span style="color: #b1b100;">else</span>
            error<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;Don't know how to differentiate $(ex.args[1])&quot;</span><span style="color: #009900;">&#41;</span>
        end
    <span style="color: #b1b100;">else</span>
        <span style="color: #b1b100;">return</span> differentiate<span style="color: #009900;">&#40;</span>ex.<span style="color: #202020;">head</span><span style="color: #009900;">&#41;</span>
    end
end</pre></td></tr></table></div>

<p>Ive put all of these snippets together in a single <a href="https://gist.github.com/4475902">GitHub</a> Gist. To try out this new differentiation function, let&#8217;s copy the contents of that GitHub gist into a file called <code>differentiate.jl</code>. We can then load the contents of that file into Julia at the REPL using <code>include</code>, which will allow us try out our differentiation tool:</p>

<div class="wp_codebox"><table><tr id="p479334"><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code" id="p4793code34"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> include<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;differentiate.jl&quot;</span><span style="color: #009900;">&#41;</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> differentiate<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span>x <span style="color: #339933;">+</span> x<span style="color: #339933;">*</span>x<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #339933;">:</span>x<span style="color: #009900;">&#41;</span>
<span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,+</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,</span>x<span style="color: #009900;">&#41;</span><span style="color: #339933;">,*</span><span style="color: #009900;">&#40;</span>x<span style="color: #339933;">,</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> differentiate<span style="color: #009900;">&#40;</span><span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span>x <span style="color: #339933;">+</span> a<span style="color: #339933;">*</span>x<span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #339933;">:</span>x<span style="color: #009900;">&#41;</span>
<span style="color: #339933;">:</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">1</span><span style="color: #339933;">,+</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">*</span><span style="color: #009900;">&#40;</span><span style="color: #0000dd;">0</span><span style="color: #339933;">,</span>x<span style="color: #009900;">&#41;</span><span style="color: #339933;">,*</span><span style="color: #009900;">&#40;</span>a<span style="color: #339933;">,</span><span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<p>While the expressions that are constructed by our <code>differentiate</code> function are ugly, they are correct: they just need to be simplified so that things like <code>*(0, x)</code> are replaced with <code>0</code>. If you&#8217;d like to see how to write code to perform some basic simplifications, you can see the <a href="https://github.com/johnmyleswhite/Calculus.jl/blob/master/src/symbolic.jl"><code>simplify</code> function</a> I&#8217;ve been building for Julia&#8217;s new <a href="https://github.com/johnmyleswhite/Calculus.jl">Calculus package</a>. That codebase includes all of the functionality shown here for <code>differentiate</code>, along with several other rules that make the system more powerful.</p>
<p>What I love about Julia is the ease with which one can move from low-level bit operations like those described in my previous post to high-level operations that manipulate Julian expressions. By allowing the programmer to manipulate expressions programmatically, Julia has copied one of the most beautiful parts of Lisp.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2013/01/07/symbolic-differentiation-in-julia/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Computers are Machines</title>
		<link>http://www.johnmyleswhite.com/notebook/2013/01/03/computers-are-machines/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2013/01/03/computers-are-machines/#comments</comments>
		<pubDate>Thu, 03 Jan 2013 20:18:02 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4790</guid>
		<description><![CDATA[When people try out Julia for the first time, many of them are worried by the following example: 1 2 3 4 5 6 7 julia&#62; factorial&#40;n&#41; = n == 0 ? 1 : n * factorial&#40;n - 1&#41; &#160; julia&#62; factorial&#40;20&#41; 2432902008176640000 &#160; julia&#62; factorial&#40;21&#41; -4249290049419214848 If you&#8217;re not familiar with computer architecture, this [...]]]></description>
				<content:encoded><![CDATA[<p>When people try out Julia for the first time, many of them are worried by the following example:</p>

<div class="wp_codebox"><table><tr id="p479043"><td class="line_numbers"><pre>1
2
3
4
5
6
7
</pre></td><td class="code" id="p4790code43"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> factorial<span style="color: #009900;">&#40;</span>n<span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> n <span style="color: #339933;">==</span> <span style="color: #0000dd;">0</span> <span style="color: #339933;">?</span> <span style="color: #0000dd;">1</span> <span style="color: #339933;">:</span> n <span style="color: #339933;">*</span> factorial<span style="color: #009900;">&#40;</span>n <span style="color: #339933;">-</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> factorial<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">20</span><span style="color: #009900;">&#41;</span>
<span style="color: #0000dd;">2432902008176640000</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> factorial<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">21</span><span style="color: #009900;">&#41;</span>
<span style="color: #339933;">-</span><span style="color: #0000dd;">4249290049419214848</span></pre></td></tr></table></div>

<p>If you&#8217;re not familiar with computer architecture, this result is very troubling. <i>Why would Julia claim that the factorial of 21 is a negative number?</i></p>
<p>The answer is simple, but depends upon a set of concepts that are largely unfamiliar to programmers who, like me, grew up using modern languages like Python and Ruby. Julia thinks that the factorial of 21 is a negative number because <i>computers are machines</i>.</p>
<p>Because they are machines, computers represent numbers using many small groups of bits. Most modern machines work with groups of 64 bits at a time. If an operation has to work with more than 64 bits at a time, that operation will be slower than a similar operation than only works with 64 bits at a time.</p>
<p>As a result, if you want to write fast computer code, it helps to only execute operations that are easily expressible using groups of 64 bits.</p>
<p>Arithmetic involving small integers fits into the category of operations that only require 64 bits at a time. Every integer between <code>-9223372036854775808</code> and <code>9223372036854775807</code> can be expressed using just 64 bits. You can see this for yourself by using the <code>typemin</code> and <code>typemax</code> functions in Julia:</p>

<div class="wp_codebox"><table><tr id="p479044"><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code" id="p4790code44"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> typemin<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span>
<span style="color: #339933;">-</span><span style="color: #0000dd;">9223372036854775808</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> typemax<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span>
<span style="color: #0000dd;">9223372036854775807</span></pre></td></tr></table></div>

<p>If you do things like the following, the computer will quickly produce correct results:</p>

<div class="wp_codebox"><table><tr id="p479045"><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code" id="p4790code45"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> typemin<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span>
<span style="color: #339933;">-</span><span style="color: #0000dd;">9223372036854775807</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> typemax<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> <span style="color: #0000dd;">1</span>
<span style="color: #0000dd;">9223372036854775806</span></pre></td></tr></table></div>

<p>But things go badly if you try to break out of the range of numbers that can be represented using only 64 bits:</p>

<div class="wp_codebox"><table><tr id="p479046"><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code" id="p4790code46"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> typemin<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> <span style="color: #0000dd;">1</span>
<span style="color: #0000dd;">9223372036854775807</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> typemax<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span>
<span style="color: #339933;">-</span><span style="color: #0000dd;">9223372036854775808</span></pre></td></tr></table></div>

<p>The reasons for this are not obvious at first, but make more sense if you examine the actual bits being operated upon:</p>

<div class="wp_codebox"><table><tr id="p479047"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
</pre></td><td class="code" id="p4790code47"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> bits<span style="color: #009900;">&#40;</span>typemax<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #ff0000;">&quot;0111111111111111111111111111111111111111111111111111111111111111&quot;</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> bits<span style="color: #009900;">&#40;</span>typemax<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span><span style="color: #009900;">&#41;</span>
<span style="color: #ff0000;">&quot;1000000000000000000000000000000000000000000000000000000000000000&quot;</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> bits<span style="color: #009900;">&#40;</span>typemin<span style="color: #009900;">&#40;</span>Int64<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #ff0000;">&quot;1000000000000000000000000000000000000000000000000000000000000000&quot;</span></pre></td></tr></table></div>

<p>When it adds 1 to a number, the computer blindly uses a simple arithmetic rule for individual bits that works just like the carry system you learned as a child. This carrying rule is very efficient, but works poorly if you end up flipping the very first bit in a group of 64 bits. The reason is that this first bit represents the sign of an integer. When this special first bit gets flipped by an operation that overflows the space provided by 64 bits, everything else breaks down.</p>
<p>The special interpretation given to certain bits in a group of 64 is the reason that factorial of 21 is a negative number when Julia computes it. You can confirm this by looking at the exact bits involved:</p>

<div class="wp_codebox"><table><tr id="p479048"><td class="line_numbers"><pre>1
2
3
4
5
</pre></td><td class="code" id="p4790code48"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> bits<span style="color: #009900;">&#40;</span>factorial<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">20</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #ff0000;">&quot;0010000111000011011001110111110010000010101101000000000000000000&quot;</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> bits<span style="color: #009900;">&#40;</span>factorial<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">21</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #ff0000;">&quot;1100010100000111011111010011011010111000110001000000000000000000&quot;</span></pre></td></tr></table></div>

<p>Here, as before, the computer has just executed the operations necessary to perform multiplication by 21. But the result has flipped the sign bit, which causes the result to appear to be a negative number.</p>
<p>There is a way around this: you can tell Julia to work with groups of more than 64 bits at a time when expressing integers using the <code>BigInt</code> type:</p>

<div class="wp_codebox"><table><tr id="p479049"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code" id="p4790code49"><pre class="c" style="font-family:monospace;">julia<span style="color: #339933;">&gt;</span> require<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;BigInt&quot;</span><span style="color: #009900;">&#41;</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> BigInt<span style="color: #009900;">&#40;</span>typemax<span style="color: #009900;">&#40;</span>Int<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>
<span style="color: #0000dd;">9223372036854775807</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> BigInt<span style="color: #009900;">&#40;</span>typemax<span style="color: #009900;">&#40;</span>Int<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span> <span style="color: #0000dd;">1</span>
<span style="color: #0000dd;">9223372036854775808</span>
&nbsp;
julia<span style="color: #339933;">&gt;</span> BigInt<span style="color: #009900;">&#40;</span>factorial<span style="color: #009900;">&#40;</span><span style="color: #0000dd;">20</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">*</span> <span style="color: #0000dd;">21</span>
<span style="color: #0000dd;">51090942171709440000</span></pre></td></tr></table></div>

<p>Now everything works smoothly. By working with <code>BigInt</code>&#8216;s automatically, languages like Python avoid these concerns:</p>

<div class="wp_codebox"><table><tr id="p479050"><td class="line_numbers"><pre>1
2
3
4
</pre></td><td class="code" id="p4790code50"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> factorial<span style="color: black;">&#40;</span><span style="color: #ff4500;">20</span><span style="color: black;">&#41;</span>
<span style="color: #ff4500;">2432902008176640000</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> factorial<span style="color: black;">&#40;</span><span style="color: #ff4500;">21</span><span style="color: black;">&#41;</span>
51090942171709440000L</pre></td></tr></table></div>

<p>The <code>L</code> at the end of the numbers here indicates that Python has automatically converted a normal integer into something like Julia&#8217;s <code>BigInt</code>. But this automatic conversion comes at a substantial cost: every operation that stays within the bounds of 64-bit arithmetic is slower in Python than Julia because of the time required to check whether an operation might go beyond the 64-bit bound.</p>
<p>Python&#8217;s automatic conversion approach is safer, but slower. Julia&#8217;s approach is faster, but requires that the programmer understand more about the computer&#8217;s architecture. Julia achieves its performance by confronting the fact that computers are machines head on. This is confusing at first and frustrating at times, but it&#8217;s a price that you have to pay for high performance computing. Everyone who grew up with C is used to these issues, but they&#8217;re largely unfamiliar to programmers who grew up with modern languages like Python. In many ways, Julia sets itself apart from other new languages by its attempt to recover some of the power that was lost in the transition from C to languages like Python. But the transition comes with a substantial learning curve.</p>
<p>And that&#8217;s why I wrote this post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2013/01/03/computers-are-machines/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>What is Correctness for Statistical Software?</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/12/14/what-is-correctness-for-statistical-software/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/12/14/what-is-correctness-for-statistical-software/#comments</comments>
		<pubDate>Fri, 14 Dec 2012 15:17:50 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4787</guid>
		<description><![CDATA[Introduction A few months ago, Drew Conway and I gave a webcast that tried to teach people about the basic principles behind linear and logistic regression. To illustrate logistic regression, we worked through a series of progressively more complex spam detection problems. The simplest data set we used was the following: This data set has [...]]]></description>
				<content:encoded><![CDATA[<h3>Introduction</h3>
<p>A few months ago, <a href="http://www.drewconway.com/Drew_Conway/About.html">Drew Conway</a> and I gave <a href="http://oreillynet.com/pub/e/2353">a webcast</a> that tried to teach people about the basic principles behind linear and logistic regression. To illustrate logistic regression, we worked through a series of progressively more complex spam detection problems.</p>
<p>The simplest data set we used was the following:</p>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/12/spam2.jpg" alt="Spam2" title="spam2.jpg" border="0" width="576" height="360" /><br />
</center></p>
<p>This data set has one clear virtue: the correct classifier defines a decision boundary that implements a simple <code>OR</code> operation on the values of <code>MentionsViagra</code> and <code>MentionsNigeria</code>. Unfortunately, that very simplicity causes the logistic regression model to break down, because the MLE coefficients for <code>MentionsViagra</code> and <code>MentionsNigeria</code> should be infinite. In some ways, our elegantly simple example for logistic regression is actually the statistical equivalent of a SQL injection.</p>
<p>In our webcast, Drew and I decided to ignore that concern because R produces a useful model fit despite the theoretical MLE coefficients being infinite:</p>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/12/ToyClassificationResults.jpg" alt="ToyClassificationResults" title="ToyClassificationResults.jpg" border="0" width="600" height="342" /><br />
</center></p>
<p>Although R produces finite coefficients here despite theory telling us to except something else, I should note that R does produce a somewhat cryptic warning during the model fitting step that alerts the very well-informed user that something has gain awry:</p>

<div class="wp_codebox"><table><tr id="p478753"><td class="line_numbers"><pre>1
</pre></td><td class="code" id="p4787code53"><pre class="c" style="font-family:monospace;">glm.<span style="color: #202020;">fit</span><span style="color: #339933;">:</span> fitted probabilities numerically <span style="color: #0000dd;">0</span> or <span style="color: #0000dd;">1</span> occurred</pre></td></tr></table></div>

<p>It seems clear to me that R&#8217;s warning would be better off if it were substantially more verbose:</p>

<div class="wp_codebox"><table><tr id="p478754"><td class="line_numbers"><pre>1
2
3
4
5
6
</pre></td><td class="code" id="p4787code54"><pre class="c" style="font-family:monospace;">Warning from glm.<span style="color: #202020;">fit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">:</span>
&nbsp;
Fitted probabilities could not be distinguished from <span style="color: #0000dd;">0</span><span style="color: #ff0000;">'s or 1'</span>s 
under finite precision floating point arithmetic. <span style="color: #202020;">As</span> a result<span style="color: #339933;">,</span> the 
optimization algorithm <span style="color: #b1b100;">for</span> GLM fitting may have failed to converge.
<span style="color: #202020;">You</span> should check whether your data set is linearly separable.</pre></td></tr></table></div>

<h3>Broader Questions</h3>
<p>Although I&#8217;ve started this piece with a very focused example of how R&#8217;s implementation of logistic regression differs from the purely mathematical definition of that model, I&#8217;m not really that interested in the details of how different pieces of software implement logistic regression. If you&#8217;re interested in learning more about that kind of thing, I&#8217;d suggest reading the excellent piece on R&#8217;s logistic regression function that can be found on the <a href="http://www.win-vector.com/blog/2012/08/how-robust-is-logistic-regression/">Win-Vector blog</a>.</p>
<p>Instead, what interests me right now are a set of broader questions about how statistical software should work. What is the standard for correctness for statistical software? And what is the standard for usefulness? And how closely related are those two criteria?</p>
<p>Let&#8217;s think about each of them separately:</p>
<ul>
<li><i>Usefulness</i>: If you want to simply make predictions based on your model, then you want R to produce a fitted model for this data set that makes reasonably good predictions on the training data. R achieves that goal: the fitted predictions for R&#8217;s logistic regression model are numerically almost indistinguishable from the 0/1 values that we would expect from a maximum likelihood algorithm. If you want useful algorithms, then R&#8217;s decision to produce some model fit is justified.</li>
<li><i>Correctness</i>: If you want software to either produce mathematically correct answers or to die trying, then R&#8217;s implementation of logistic is not for you. If you insist on theoretical purity, it seems clear that R should not merely emit a warning here, but should instead throw an inescapable error rather than return an imperfect model fit. You might even want R to go further and to teach the end-user about the virtues of SVM&#8217;s or the general usefulness of parameter regularization. Whatever you&#8217;d like to see, one thing is sure: you definitely do not want R to produce model fits that are mathematically incorrect.</li>
</ul>
<p>It&#8217;s remarkable that such a simple example can bring the goals of predictive power and theoretical correctness into such direct opposition. In part, the conflict arises here because those purely theoretical concerns are linked by a third consideration: computer algorithms are not generally equivalent to their mathematical idealizations. Purely computational concerns involving floating-point imprecision and finite compute time mean that we cannot generally hope for computers to produce answers similar to those prescribed by theoretical mathematics.</p>
<p>What&#8217;s fascinating about this specific example is that there&#8217;s something strangely desirable about floating-point numbers having finite precision: no one with any practical interest in modeling is likely to be interested in fitting a model with infinite-valued parameters. R&#8217;s decision to blindly run an optimization algorithm here unwittingly achieves a form of regularization like that employed in early stopping algorithms for fitting neural networks. And that may be a good thing if you&#8217;re interested in using a fitted model to make predictions, even though it means that R produces quantities like standard errors that have no real coherent interpretation in terms of frequentist estimators.</p>
<p>Whatever your take is on the virtues or vices of R&#8217;s implementation of logistic regression, there&#8217;s a broad take away from this example that I&#8217;ve been dealing with constantly while working on Julia: <i>any programmer designing statistical software has to make decisions that involve personal judgment</i>. The requirement for striking a compromise between correctness and usefulness is so nearly omnipresent that one of the most popular pieces of statistical software on Earth implements logistic regression using an algorithm that a pure theorist could argue is basically broken. But it produces an answer that has practical value. And that might just be the more important thing for statistical software to do.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/12/14/what-is-correctness-for-statistical-software/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>What is Economics Studying?</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/12/10/what-is-economics-studying/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/12/10/what-is-economics-studying/#comments</comments>
		<pubDate>Tue, 11 Dec 2012 04:28:43 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Economics]]></category>
		<category><![CDATA[Psychology]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4782</guid>
		<description><![CDATA[Having spent all five of my years as a graduate student trying to get psychologists and economists to agree on basic ideas about decision-making, I think the following two pieces complement one another perfectly: Cosma Shalizi&#8217;s comments on rereading Blanchard and Fischer&#8217;s &#8220;Lectures on Macroeconomics&#8221;: Blanchard and Fischer is about &#8220;modern&#8221; macro, models based on [...]]]></description>
				<content:encoded><![CDATA[<p>Having spent all five of my years as a graduate student trying to get psychologists and economists to agree on basic ideas about decision-making, I think the following two pieces complement one another perfectly:</p>
<ul>
<li>Cosma Shalizi&#8217;s comments <a href="http://masi.cscs.lsa.umich.edu/~crshalizi/weblog/algae-2012-10.html">on rereading Blanchard and Fischer&#8217;s &#8220;Lectures on Macroeconomics&#8221;</a>:<br />
<blockquote><p>
Blanchard and Fischer is about &#8220;modern&#8221; macro, models based on agents who know what the economy is like optimizing over time, possible under some limits. This is the DSGE style of macro. which has lately come into so much discredit — thoroughly deserved discredit. Chaikin and Lubensky is about modern condensed matter physics, especially soft condensed matter, based on principles of symmetry-breaking and phase transitions. Both books are about building stylized theoretical models and solving them to see what they imply; implicitly they are also about the considerations which go into building models in their respective domains.</p>
<p>What is very striking, looking at them side by side, is that while these are both books about mathematical modeling, Chaikin and Lubensky presents empirical data, compares theoretical predictions to experimental results, and goes into some detail into the considerations which lead to this sort of model for nematic liquid crystals, or that model for magnetism. There is absolutely nothing like this in Blanchard and Fischer — no data at all, no comparison of models to reality, no evidence of any kind supporting any of the models. There is not even an attempt, that I can find, to assess different macroeconomic models, by comparing their qualitative predictions to each other and to historical reality. I presume that Blanchard and Fischer, as individual scholars, are not quite so indifferent to reality, but their pedagogy is.</p>
<p>I will leave readers to draw their own morals.
</p></blockquote>
</li>
<li>Itzhak Gilboa&#8217;s argument that <a href="http://www.paristechreview.com/2012/12/03/rhetoric-in-economics/">economic theory is a rhetoric apparatus</a> rather than a set of direct predictions about the world in which we live.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/12/10/what-is-economics-studying/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Cheap Criticism of p-Values</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/12/06/a-cheap-criticism-of-p-values/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/12/06/a-cheap-criticism-of-p-values/#comments</comments>
		<pubDate>Thu, 06 Dec 2012 16:39:00 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Psychology]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4779</guid>
		<description><![CDATA[One of these days I am going to finish my series on problems with how NHST is issued in the social sciences. Until then, I came up with a cheap criticism of p-values today. To make sense of my complaint, you&#8217;ll want to head over to Andy Gelman&#8217;s blog and read the comments on his [...]]]></description>
				<content:encoded><![CDATA[<p>One of these days I am going to finish my series on problems with how NHST is issued in the social sciences. Until then, I came up with a cheap criticism of p-values today.</p>
<p>To make sense of my complaint, you&#8217;ll want to head over to Andy Gelman&#8217;s blog and read the comments on <a href="http://andrewgelman.com/2012/12/the-p-value-is-not/">his recent blog post about p-values.</a> Reading them makes one thing clear: not even a large group of stats wonks can agree on how to think about p-values. How could we ever hope for understanding from the kind of people who are only reporting p-values because they&#8217;re forced to do so by their fields?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/12/06/a-cheap-criticism-of-p-values/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>The State of Statistics in Julia</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/12/02/the-state-of-statistics-in-julia/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/12/02/the-state-of-statistics-in-julia/#comments</comments>
		<pubDate>Sun, 02 Dec 2012 16:51:24 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Julia]]></category>
		<category><![CDATA[Programming]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4753</guid>
		<description><![CDATA[Updated 12.2.2012: Added sample output based on a suggestion from Stefan Karpinski. Introduction Over the last few weeks, the Julia core team has rolled out a demo version of Julia&#8217;s package management system. While the Julia package system is still very much in beta, it nevertheless provides the first plausible way for non-expert users to [...]]]></description>
				<content:encoded><![CDATA[<p><b>Updated 12.2.2012: Added sample output based on a suggestion from Stefan Karpinski.</b></p>
<h3>Introduction</h3>
<p>Over the last few weeks, the Julia core team has rolled out a demo version of Julia&#8217;s <a href="https://github.com/JuliaLang/METADATA.jl">package management system</a>. While the Julia package system is still very much in beta, it nevertheless provides the first plausible way for non-expert users to see where Julia&#8217;s growing community of developers is heading.</p>
<p>To celebrate some of the amazing work that&#8217;s already been done to make Julia usable for day-to-day data analysis, I&#8217;d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis.</p>
<h3>Getting Julia Installed</h3>
<p>If you&#8217;d like to try out Julia for yourself, you&#8217;ll first need to clone the current Julia repo from <a href="https://github.com/JuliaLang/julia">GitHub</a> and then build Julia from source as described in the Julia <a href="https://github.com/JuliaLang/julia/blob/master/README.md">README</a>. Compiling Julia for the first time can take up to two hours, but updating Julia afterwards will be quite fast once you&#8217;ve gotten a working copy of the language and its dependencies installed on your system. After you have Julia built, you should add its main directory to your path and then open up the Julia REPL by typing <code>julia</code> at the command line.</p>
<h3>Installing Packages</h3>
<p>Once Julia&#8217;s REPL is running, you can use the following commands to start installing packages:</p>

<div class="wp_codebox"><table><tr id="p475364"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
</pre></td><td class="code" id="p4753code64"><pre class="python" style="font-family:monospace;">julia<span style="color: #66cc66;">&gt;</span> require<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;pkg&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> Pkg.<span style="color: black;">init</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
Initialized empty Git repository <span style="color: #ff7700;font-weight:bold;">in</span> /Users/johnmyleswhite/.<span style="color: black;">julia</span>/.<span style="color: black;">git</span>/
Cloning into <span style="color: #483d8b;">'METADATA'</span>...
<span style="color: black;">remote</span>: Counting objects: <span style="color: #ff4500;">443</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">208</span>/<span style="color: #ff4500;">208</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">443</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">53</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">423</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">33</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">443</span>/<span style="color: #ff4500;">443</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">38.98</span> KiB, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">53</span>/<span style="color: #ff4500;">53</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">&#91;</span>master <span style="color: black;">&#40;</span>root-commit<span style="color: black;">&#41;</span> dbd486e<span style="color: black;">&#93;</span> empty package repo
 <span style="color: #ff4500;">2</span> files changed, <span style="color: #ff4500;">4</span> insertions<span style="color: black;">&#40;</span>+<span style="color: black;">&#41;</span>
 create mode <span style="color: #ff4500;">100644</span> .<span style="color: black;">gitmodules</span>
 create mode <span style="color: #ff4500;">160000</span> METADATA
 create mode <span style="color: #ff4500;">100644</span> REQUIRE
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> Pkg.<span style="color: black;">add</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;DataFrames&quot;</span>, <span style="color: #483d8b;">&quot;Distributions&quot;</span>, <span style="color: #483d8b;">&quot;MCMC&quot;</span>, <span style="color: #483d8b;">&quot;Optim&quot;</span>, <span style="color: #483d8b;">&quot;NHST&quot;</span>, <span style="color: #483d8b;">&quot;Clustering&quot;</span><span style="color: black;">&#41;</span>
Installing DataFrames: v0.0.0
Cloning into <span style="color: #483d8b;">'DataFrames'</span>...
<span style="color: black;">remote</span>: Counting objects: <span style="color: #ff4500;">1340</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">562</span>/<span style="color: #ff4500;">562</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">1340</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">760</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">1229</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">655</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">1340</span>/<span style="color: #ff4500;">1340</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">494.79</span> KiB, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">760</span>/<span style="color: #ff4500;">760</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">Installing</span> Distributions: v0.0.0
Cloning into <span style="color: #483d8b;">'Distributions'</span>...
<span style="color: black;">remote</span>: Counting objects: <span style="color: #ff4500;">49</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">30</span>/<span style="color: #ff4500;">30</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">49</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">49</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">49</span>/<span style="color: #ff4500;">49</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">17.29</span> KiB, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">8</span>/<span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">Installing</span> MCMC: v0.0.0
Cloning into <span style="color: #483d8b;">'MCMC'</span>...
<span style="color: black;">warning</span>: no common commits
remote: Counting objects: <span style="color: #ff4500;">155</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">97</span>/<span style="color: #ff4500;">97</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">155</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">66</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">140</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">51</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">155</span>/<span style="color: #ff4500;">155</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">256.68</span> KiB, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">66</span>/<span style="color: #ff4500;">66</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">Installing</span> NHST: v0.0.0
Cloning into <span style="color: #483d8b;">'NHST'</span>...
<span style="color: black;">remote</span>: Counting objects: <span style="color: #ff4500;">20</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">18</span>/<span style="color: #ff4500;">18</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">20</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">19</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">20</span>/<span style="color: #ff4500;">20</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">4.31</span> KiB, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span>/<span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">Installing</span> Optim: v0.0.0
Cloning into <span style="color: #483d8b;">'Optim'</span>...
<span style="color: black;">remote</span>: Counting objects: <span style="color: #ff4500;">497</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">191</span>/<span style="color: #ff4500;">191</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">497</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">318</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">476</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">297</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">497</span>/<span style="color: #ff4500;">497</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">79.68</span> KiB, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">318</span>/<span style="color: #ff4500;">318</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">Installing</span> Options: v0.0.0
Cloning into <span style="color: #483d8b;">'Options'</span>...
<span style="color: black;">remote</span>: Counting objects: <span style="color: #ff4500;">10</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">8</span>/<span style="color: #ff4500;">8</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">10</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">6</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">0</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">10</span>/<span style="color: #ff4500;">10</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span>/<span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">Installing</span> Clustering: v0.0.0
Cloning into <span style="color: #483d8b;">'Clustering'</span>...
<span style="color: black;">remote</span>: Counting objects: <span style="color: #ff4500;">38</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">28</span>/<span style="color: #ff4500;">28</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">38</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">7</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">38</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">7</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">38</span>/<span style="color: #ff4500;">38</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">7.77</span> KiB, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">7</span>/<span style="color: #ff4500;">7</span><span style="color: black;">&#41;</span>, done.</pre></td></tr></table></div>

<p>That will get you started with some of the core tools for doing statistical programming in Julia. You&#8217;ll probably also want to install another package called &#8220;RDatasets&#8221;, which provides access to 570 of the classic data sets available in R. This package has a much larger file size than the others, which is why I recommend installing it after you&#8217;ve first installed the other packages:</p>

<div class="wp_codebox"><table><tr id="p475365"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code" id="p4753code65"><pre class="python" style="font-family:monospace;">require<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;pkg&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> Pkg.<span style="color: black;">add</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;RDatasets&quot;</span><span style="color: black;">&#41;</span>
Installing RDatasets: v0.0.0
Cloning into <span style="color: #483d8b;">'RDatasets'</span>...
<span style="color: black;">remote</span>: Counting objects: <span style="color: #ff4500;">609</span>, done.
<span style="color: black;">remote</span>: Compressing objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">588</span>/<span style="color: #ff4500;">588</span><span style="color: black;">&#41;</span>, done.
<span style="color: black;">remote</span>: Total <span style="color: #ff4500;">609</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">21</span><span style="color: black;">&#41;</span>, reused <span style="color: #ff4500;">605</span> <span style="color: black;">&#40;</span>delta <span style="color: #ff4500;">17</span><span style="color: black;">&#41;</span>
Receiving objects: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">609</span>/<span style="color: #ff4500;">609</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">10.56</span> MiB | <span style="color: #ff4500;">1.15</span> MiB/s, done.
<span style="color: black;">Resolving</span> deltas: <span style="color: #ff4500;">100</span><span style="color: #66cc66;">%</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">21</span>/<span style="color: #ff4500;">21</span><span style="color: black;">&#41;</span>, done.</pre></td></tr></table></div>

<p>Assuming that you&#8217;ve gotten everything working, you can then type the following to load Fisher&#8217;s classic Iris data set:</p>

<div class="wp_codebox"><table><tr id="p475366"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
</pre></td><td class="code" id="p4753code66"><pre class="python" style="font-family:monospace;">julia<span style="color: #66cc66;">&gt;</span> load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;RDatasets&quot;</span><span style="color: black;">&#41;</span>
<span style="color: #008000;">Warning</span>: redefinition of constant NARule ignored.
<span style="color: #008000;">Warning</span>: New definition ==<span style="color: black;">&#40;</span>NAtype,Any<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> ambiguous <span style="color: #ff7700;font-weight:bold;">with</span> ==<span style="color: black;">&#40;</span>Any,AbstractArray<span style="color: black;">&#123;</span>T,N<span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>.
         <span style="color: black;">Make</span> sure ==<span style="color: black;">&#40;</span>NAtype,AbstractArray<span style="color: black;">&#123;</span>T,N<span style="color: black;">&#125;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> defined first.
<span style="color: #008000;">Warning</span>: New definition ==<span style="color: black;">&#40;</span>Any,NAtype<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> ambiguous <span style="color: #ff7700;font-weight:bold;">with</span> ==<span style="color: black;">&#40;</span>AbstractArray<span style="color: black;">&#123;</span>T,N<span style="color: black;">&#125;</span>,Any<span style="color: black;">&#41;</span>.
         <span style="color: black;">Make</span> sure ==<span style="color: black;">&#40;</span>AbstractArray<span style="color: black;">&#123;</span>T,N<span style="color: black;">&#125;</span>,NAtype<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> defined first.
<span style="color: #008000;">Warning</span>: New definition replace<span style="color: #66cc66;">!</span><span style="color: black;">&#40;</span>PooledDataVec<span style="color: black;">&#123;</span>S<span style="color: black;">&#125;</span>,NAtype,T<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> ambiguous <span style="color: #ff7700;font-weight:bold;">with</span> replace<span style="color: #66cc66;">!</span><span style="color: black;">&#40;</span>PooledDataVec<span style="color: black;">&#123;</span>S<span style="color: black;">&#125;</span>,T,NAtype<span style="color: black;">&#41;</span>.
         <span style="color: black;">Make</span> sure replace<span style="color: #66cc66;">!</span><span style="color: black;">&#40;</span>PooledDataVec<span style="color: black;">&#123;</span>S<span style="color: black;">&#125;</span>,NAtype,NAtype<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> defined first.
<span style="color: #008000;">Warning</span>: New definition promote_rule<span style="color: black;">&#40;</span>Type<span style="color: black;">&#123;</span>AbstractDataVec<span style="color: black;">&#123;</span>T<span style="color: black;">&#125;</span><span style="color: black;">&#125;</span>,Type<span style="color: black;">&#123;</span>T<span style="color: black;">&#125;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> ambiguous <span style="color: #ff7700;font-weight:bold;">with</span> promote_rule<span style="color: black;">&#40;</span>Type<span style="color: black;">&#123;</span>AbstractDataVec<span style="color: black;">&#123;</span>S<span style="color: black;">&#125;</span><span style="color: black;">&#125;</span>,Type<span style="color: black;">&#123;</span>T<span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>.
         <span style="color: black;">Make</span> sure promote_rule<span style="color: black;">&#40;</span>Type<span style="color: black;">&#123;</span>AbstractDataVec<span style="color: black;">&#123;</span>T<span style="color: black;">&#125;</span><span style="color: black;">&#125;</span>,Type<span style="color: black;">&#123;</span>T<span style="color: black;">&#125;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> defined first.
<span style="color: #008000;">Warning</span>: New definition ^<span style="color: black;">&#40;</span>NAtype,T<span style="color: #66cc66;">&lt;</span>:Union<span style="color: black;">&#40;</span>String,Number<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> ambiguous <span style="color: #ff7700;font-weight:bold;">with</span> ^<span style="color: black;">&#40;</span>Any,Integer<span style="color: black;">&#41;</span>.
         <span style="color: black;">Make</span> sure ^<span style="color: black;">&#40;</span>NAtype,_<span style="color: #66cc66;">&lt;</span>:Integer<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> defined first.
<span style="color: #008000;">Warning</span>: New definition ^<span style="color: black;">&#40;</span>DataVec<span style="color: black;">&#123;</span>T<span style="color: black;">&#125;</span>,Number<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> ambiguous <span style="color: #ff7700;font-weight:bold;">with</span> ^<span style="color: black;">&#40;</span>Any,Integer<span style="color: black;">&#41;</span>.
         <span style="color: black;">Make</span> sure ^<span style="color: black;">&#40;</span>DataVec<span style="color: black;">&#123;</span>T<span style="color: black;">&#125;</span>,Integer<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> defined first.
<span style="color: #008000;">Warning</span>: New definition ^<span style="color: black;">&#40;</span>DataFrame,Union<span style="color: black;">&#40;</span>NAtype,Number<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> ambiguous <span style="color: #ff7700;font-weight:bold;">with</span> ^<span style="color: black;">&#40;</span>Any,Integer<span style="color: black;">&#41;</span>.
         <span style="color: black;">Make</span> sure ^<span style="color: black;">&#40;</span>DataFrame,Integer<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">is</span> defined first.
&nbsp;
<span style="color: black;">julia</span><span style="color: #66cc66;">&gt;</span> using DataFrames
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> using RDatasets
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> iris = data<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;datasets&quot;</span>, <span style="color: #483d8b;">&quot;iris&quot;</span><span style="color: black;">&#41;</span>
DataFrame  <span style="color: black;">&#40;</span><span style="color: #ff4500;">150</span>,<span style="color: #ff4500;">6</span><span style="color: black;">&#41;</span>
              Sepal.<span style="color: black;">Length</span> Sepal.<span style="color: black;">Width</span> Petal.<span style="color: black;">Length</span> Petal.<span style="color: black;">Width</span>     Species
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">1</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">3.5</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">2</span>          <span style="color: #ff4500;">4.9</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">3</span>          <span style="color: #ff4500;">4.7</span>         <span style="color: #ff4500;">3.2</span>          <span style="color: #ff4500;">1.3</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">4</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">4</span>          <span style="color: #ff4500;">4.6</span>         <span style="color: #ff4500;">3.1</span>          <span style="color: #ff4500;">1.5</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">5</span>          <span style="color: #ff4500;">5.0</span>         <span style="color: #ff4500;">3.6</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">6</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">6</span>          <span style="color: #ff4500;">5.4</span>         <span style="color: #ff4500;">3.9</span>          <span style="color: #ff4500;">1.7</span>         <span style="color: #ff4500;">0.4</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">7</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">7</span>          <span style="color: #ff4500;">4.6</span>         <span style="color: #ff4500;">3.4</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.3</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">8</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">8</span>          <span style="color: #ff4500;">5.0</span>         <span style="color: #ff4500;">3.4</span>          <span style="color: #ff4500;">1.5</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">9</span>,<span style="color: black;">&#93;</span>        <span style="color: #ff4500;">9</span>          <span style="color: #ff4500;">4.4</span>         <span style="color: #ff4500;">2.9</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">10</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">10</span>          <span style="color: #ff4500;">4.9</span>         <span style="color: #ff4500;">3.1</span>          <span style="color: #ff4500;">1.5</span>         <span style="color: #ff4500;">0.1</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">11</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">11</span>          <span style="color: #ff4500;">5.4</span>         <span style="color: #ff4500;">3.7</span>          <span style="color: #ff4500;">1.5</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">12</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">12</span>          <span style="color: #ff4500;">4.8</span>         <span style="color: #ff4500;">3.4</span>          <span style="color: #ff4500;">1.6</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">13</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">13</span>          <span style="color: #ff4500;">4.8</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.1</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">14</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">14</span>          <span style="color: #ff4500;">4.3</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">1.1</span>         <span style="color: #ff4500;">0.1</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">15</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">15</span>          <span style="color: #ff4500;">5.8</span>         <span style="color: #ff4500;">4.0</span>          <span style="color: #ff4500;">1.2</span>         <span style="color: #ff4500;">0.2</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">16</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">16</span>          <span style="color: #ff4500;">5.7</span>         <span style="color: #ff4500;">4.4</span>          <span style="color: #ff4500;">1.5</span>         <span style="color: #ff4500;">0.4</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">17</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">17</span>          <span style="color: #ff4500;">5.4</span>         <span style="color: #ff4500;">3.9</span>          <span style="color: #ff4500;">1.3</span>         <span style="color: #ff4500;">0.4</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">18</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">18</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">3.5</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.3</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">19</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">19</span>          <span style="color: #ff4500;">5.7</span>         <span style="color: #ff4500;">3.8</span>          <span style="color: #ff4500;">1.7</span>         <span style="color: #ff4500;">0.3</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">20</span>,<span style="color: black;">&#93;</span>      <span style="color: #ff4500;">20</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">3.8</span>          <span style="color: #ff4500;">1.5</span>         <span style="color: #ff4500;">0.3</span>    <span style="color: #483d8b;">&quot;setosa&quot;</span>
  :
<span style="color: black;">&#91;</span><span style="color: #ff4500;">131</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">131</span>          <span style="color: #ff4500;">7.4</span>         <span style="color: #ff4500;">2.8</span>          <span style="color: #ff4500;">6.1</span>         <span style="color: #ff4500;">1.9</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">132</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">132</span>          <span style="color: #ff4500;">7.9</span>         <span style="color: #ff4500;">3.8</span>          <span style="color: #ff4500;">6.4</span>         <span style="color: #ff4500;">2.0</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">133</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">133</span>          <span style="color: #ff4500;">6.4</span>         <span style="color: #ff4500;">2.8</span>          <span style="color: #ff4500;">5.6</span>         <span style="color: #ff4500;">2.2</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">134</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">134</span>          <span style="color: #ff4500;">6.3</span>         <span style="color: #ff4500;">2.8</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">1.5</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">135</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">135</span>          <span style="color: #ff4500;">6.1</span>         <span style="color: #ff4500;">2.6</span>          <span style="color: #ff4500;">5.6</span>         <span style="color: #ff4500;">1.4</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">136</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">136</span>          <span style="color: #ff4500;">7.7</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">6.1</span>         <span style="color: #ff4500;">2.3</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">137</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">137</span>          <span style="color: #ff4500;">6.3</span>         <span style="color: #ff4500;">3.4</span>          <span style="color: #ff4500;">5.6</span>         <span style="color: #ff4500;">2.4</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">138</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">138</span>          <span style="color: #ff4500;">6.4</span>         <span style="color: #ff4500;">3.1</span>          <span style="color: #ff4500;">5.5</span>         <span style="color: #ff4500;">1.8</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">139</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">139</span>          <span style="color: #ff4500;">6.0</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">4.8</span>         <span style="color: #ff4500;">1.8</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">140</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">140</span>          <span style="color: #ff4500;">6.9</span>         <span style="color: #ff4500;">3.1</span>          <span style="color: #ff4500;">5.4</span>         <span style="color: #ff4500;">2.1</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">141</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">141</span>          <span style="color: #ff4500;">6.7</span>         <span style="color: #ff4500;">3.1</span>          <span style="color: #ff4500;">5.6</span>         <span style="color: #ff4500;">2.4</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">142</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">142</span>          <span style="color: #ff4500;">6.9</span>         <span style="color: #ff4500;">3.1</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">2.3</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">143</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">143</span>          <span style="color: #ff4500;">5.8</span>         <span style="color: #ff4500;">2.7</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">1.9</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">144</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">144</span>          <span style="color: #ff4500;">6.8</span>         <span style="color: #ff4500;">3.2</span>          <span style="color: #ff4500;">5.9</span>         <span style="color: #ff4500;">2.3</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">145</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">145</span>          <span style="color: #ff4500;">6.7</span>         <span style="color: #ff4500;">3.3</span>          <span style="color: #ff4500;">5.7</span>         <span style="color: #ff4500;">2.5</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">146</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">146</span>          <span style="color: #ff4500;">6.7</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">5.2</span>         <span style="color: #ff4500;">2.3</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">147</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">147</span>          <span style="color: #ff4500;">6.3</span>         <span style="color: #ff4500;">2.5</span>          <span style="color: #ff4500;">5.0</span>         <span style="color: #ff4500;">1.9</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">148</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">148</span>          <span style="color: #ff4500;">6.5</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">5.2</span>         <span style="color: #ff4500;">2.0</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">149</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">149</span>          <span style="color: #ff4500;">6.2</span>         <span style="color: #ff4500;">3.4</span>          <span style="color: #ff4500;">5.4</span>         <span style="color: #ff4500;">2.3</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">150</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">150</span>          <span style="color: #ff4500;">5.9</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">1.8</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> head<span style="color: black;">&#40;</span>iris<span style="color: black;">&#41;</span>
DataFrame  <span style="color: black;">&#40;</span><span style="color: #ff4500;">6</span>,<span style="color: #ff4500;">6</span><span style="color: black;">&#41;</span>
          Sepal.<span style="color: black;">Length</span> Sepal.<span style="color: black;">Width</span> Petal.<span style="color: black;">Length</span> Petal.<span style="color: black;">Width</span>  Species
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">1</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">3.5</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.2</span> <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">2</span>          <span style="color: #ff4500;">4.9</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.2</span> <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">3</span>          <span style="color: #ff4500;">4.7</span>         <span style="color: #ff4500;">3.2</span>          <span style="color: #ff4500;">1.3</span>         <span style="color: #ff4500;">0.2</span> <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">4</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">4</span>          <span style="color: #ff4500;">4.6</span>         <span style="color: #ff4500;">3.1</span>          <span style="color: #ff4500;">1.5</span>         <span style="color: #ff4500;">0.2</span> <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">5</span>          <span style="color: #ff4500;">5.0</span>         <span style="color: #ff4500;">3.6</span>          <span style="color: #ff4500;">1.4</span>         <span style="color: #ff4500;">0.2</span> <span style="color: #483d8b;">&quot;setosa&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">6</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">6</span>          <span style="color: #ff4500;">5.4</span>         <span style="color: #ff4500;">3.9</span>          <span style="color: #ff4500;">1.7</span>         <span style="color: #ff4500;">0.4</span> <span style="color: #483d8b;">&quot;setosa&quot;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> tail<span style="color: black;">&#40;</span>iris<span style="color: black;">&#41;</span>
DataFrame  <span style="color: black;">&#40;</span><span style="color: #ff4500;">6</span>,<span style="color: #ff4500;">6</span><span style="color: black;">&#41;</span>
            Sepal.<span style="color: black;">Length</span> Sepal.<span style="color: black;">Width</span> Petal.<span style="color: black;">Length</span> Petal.<span style="color: black;">Width</span>     Species
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">145</span>          <span style="color: #ff4500;">6.7</span>         <span style="color: #ff4500;">3.3</span>          <span style="color: #ff4500;">5.7</span>         <span style="color: #ff4500;">2.5</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">146</span>          <span style="color: #ff4500;">6.7</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">5.2</span>         <span style="color: #ff4500;">2.3</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">147</span>          <span style="color: #ff4500;">6.3</span>         <span style="color: #ff4500;">2.5</span>          <span style="color: #ff4500;">5.0</span>         <span style="color: #ff4500;">1.9</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">4</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">148</span>          <span style="color: #ff4500;">6.5</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">5.2</span>         <span style="color: #ff4500;">2.0</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">5</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">149</span>          <span style="color: #ff4500;">6.2</span>         <span style="color: #ff4500;">3.4</span>          <span style="color: #ff4500;">5.4</span>         <span style="color: #ff4500;">2.3</span> <span style="color: #483d8b;">&quot;virginica&quot;</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">6</span>,<span style="color: black;">&#93;</span>    <span style="color: #ff4500;">150</span>          <span style="color: #ff4500;">5.9</span>         <span style="color: #ff4500;">3.0</span>          <span style="color: #ff4500;">5.1</span>         <span style="color: #ff4500;">1.8</span> <span style="color: #483d8b;">&quot;virginica&quot;</span></pre></td></tr></table></div>

<p>Now that you can see that Julia can handle complex data sets, let&#8217;s talk a little bit about the packages that make statistical analysis in Julia possible.</p>
<h3>The DataFrames Package</h3>
<p>The <a href="">DataFrames</a> package provides data structures for working with tabular data in Julia. At a minimum, this means that DataFrames provides tools for dealing with individual columns of missing data, which are called <code>DataVec</code>&#8216;s. A collection of <code>DataVec</code>&#8216;s allows one to build up a <code>DataFrame</code>, which provides a tabular data structure like that used by R&#8217;s <code>data.frame</code> type.</p>

<div class="wp_codebox"><table><tr id="p475367"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
</pre></td><td class="code" id="p4753code67"><pre class="python" style="font-family:monospace;">julia<span style="color: #66cc66;">&gt;</span> load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;DataFrames&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> using DataFrames
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> data = <span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;Value&quot;</span> =<span style="color: #66cc66;">&gt;</span> <span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span>, <span style="color: #483d8b;">&quot;Label&quot;</span> =<span style="color: #66cc66;">&gt;</span> <span style="color: black;">&#91;</span><span style="color: #483d8b;">&quot;A&quot;</span>, <span style="color: #483d8b;">&quot;B&quot;</span>, <span style="color: #483d8b;">&quot;C&quot;</span><span style="color: black;">&#93;</span><span style="color: black;">&#125;</span>
<span style="color: #008000;">Warning</span>: imported binding <span style="color: #ff7700;font-weight:bold;">for</span> data overwritten <span style="color: #ff7700;font-weight:bold;">in</span> module Main
<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;Label&quot;</span>=<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">&quot;A&quot;</span>, <span style="color: #483d8b;">&quot;B&quot;</span>, <span style="color: #483d8b;">&quot;C&quot;</span><span style="color: black;">&#93;</span>,<span style="color: #483d8b;">&quot;Value&quot;</span>=<span style="color: #66cc66;">&gt;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span><span style="color: black;">&#125;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> df = DataFrame<span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>
DataFrame  <span style="color: black;">&#40;</span><span style="color: #ff4500;">3</span>,<span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>
        Label Value
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: black;">&#93;</span>      <span style="color: #483d8b;">&quot;A&quot;</span>     <span style="color: #ff4500;">1</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>,<span style="color: black;">&#93;</span>      <span style="color: #483d8b;">&quot;B&quot;</span>     <span style="color: #ff4500;">2</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>,<span style="color: black;">&#93;</span>      <span style="color: #483d8b;">&quot;C&quot;</span>     <span style="color: #ff4500;">3</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> df<span style="color: black;">&#91;</span><span style="color: #483d8b;">&quot;Value&quot;</span><span style="color: black;">&#93;</span>
<span style="color: #ff4500;">3</span>-element DataVec<span style="color: black;">&#123;</span>Int64<span style="color: black;">&#125;</span>
&nbsp;
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: #ff4500;">2</span>,<span style="color: #ff4500;">3</span><span style="color: black;">&#93;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> df<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #483d8b;">&quot;Value&quot;</span><span style="color: black;">&#93;</span> = NA
NA
&nbsp;
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> head<span style="color: black;">&#40;</span>df<span style="color: black;">&#41;</span>
DataFrame  <span style="color: black;">&#40;</span><span style="color: #ff4500;">3</span>,<span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>
        Label Value
<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>,<span style="color: black;">&#93;</span>      <span style="color: #483d8b;">&quot;A&quot;</span>    NA
<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span>,<span style="color: black;">&#93;</span>      <span style="color: #483d8b;">&quot;B&quot;</span>     <span style="color: #ff4500;">2</span>
<span style="color: black;">&#91;</span><span style="color: #ff4500;">3</span>,<span style="color: black;">&#93;</span>      <span style="color: #483d8b;">&quot;C&quot;</span>     <span style="color: #ff4500;">3</span></pre></td></tr></table></div>

<h3>Distributions</h3>
<p>The <a href="https://github.com/JuliaStats/Distributions.jl">Distributions</a> package provides tools for working with probability distributions in Julia. It reifies distributions as types in Julia&#8217;s large type hierarchy, which means that quite generic names like <code>rand</code> can be used to sample from complex distributions:</p>

<div class="wp_codebox"><table><tr id="p475368"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
</pre></td><td class="code" id="p4753code68"><pre class="python" style="font-family:monospace;">julia<span style="color: #66cc66;">&gt;</span> load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Distributions&quot;</span><span style="color: black;">&#41;</span>
julia<span style="color: #66cc66;">&gt;</span> using Distributions
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> x = rand<span style="color: black;">&#40;</span>Normal<span style="color: black;">&#40;</span><span style="color: #ff4500;">11.0</span>, <span style="color: #ff4500;">3.0</span><span style="color: black;">&#41;</span>, <span style="color: #ff4500;">10</span>_000<span style="color: black;">&#41;</span>
<span style="color: #ff4500;">10000</span>-element Float64 Array:
  <span style="color: #ff4500;">6.87693</span>
 <span style="color: #ff4500;">13.3676</span> 
  <span style="color: #ff4500;">7.25008</span>
  <span style="color: #ff4500;">8.82833</span>
 <span style="color: #ff4500;">10.6911</span> 
  <span style="color: #ff4500;">7.1004</span> 
 <span style="color: #ff4500;">13.7449</span> 
  <span style="color: #ff4500;">5.96412</span>
  <span style="color: #ff4500;">8.57957</span>
 <span style="color: #ff4500;">15.2737</span> 
  ⋮      
  <span style="color: #ff4500;">4.89007</span>
 <span style="color: #ff4500;">15.1509</span> 
  <span style="color: #ff4500;">6.32376</span>
  <span style="color: #ff4500;">7.83847</span>
 <span style="color: #ff4500;">14.4476</span> 
 <span style="color: #ff4500;">14.2974</span> 
  <span style="color: #ff4500;">9.74783</span>
  <span style="color: #ff4500;">9.67398</span>
 <span style="color: #ff4500;">14.4992</span> 
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> mean<span style="color: black;">&#40;</span>x<span style="color: black;">&#41;</span>
<span style="color: #ff4500;">11.00366217730023</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> var<span style="color: black;">&#40;</span>x<span style="color: black;">&#41;</span>
<span style="color: #008000;">Warning</span>: Possible conflict <span style="color: #ff7700;font-weight:bold;">in</span> library <span style="color: #dc143c;">symbol</span> ddot_
<span style="color: #ff4500;">9.288938550823996</span></pre></td></tr></table></div>

<h3>Optim</h3>
<p>The <a href="https://github.com/johnmyleswhite/Optim.jl">Optim</a> package provides tools for numerical optimization of arbitrary functions in Julia. It provides a function, <code>optimize</code>, which works a bit like R&#8217;s <code>optim</code> function.</p>

<div class="wp_codebox"><table><tr id="p475369"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="code" id="p4753code69"><pre class="python" style="font-family:monospace;">julia<span style="color: #66cc66;">&gt;</span> load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Optim&quot;</span><span style="color: black;">&#41;</span>
julia<span style="color: #66cc66;">&gt;</span> using Optim
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> f = v -<span style="color: #66cc66;">&gt;</span> <span style="color: black;">&#40;</span><span style="color: #ff4500;">10.9</span> - v<span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>^<span style="color: #ff4500;">2</span> + <span style="color: black;">&#40;</span><span style="color: #ff4500;">7.3</span> - v<span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>^<span style="color: #ff4500;">2</span>
<span style="color: #808080; font-style: italic;">#&lt;function&gt;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> initial_guess = <span style="color: black;">&#91;</span><span style="color: #ff4500;">0.0</span>, <span style="color: #ff4500;">0.0</span><span style="color: black;">&#93;</span>
<span style="color: #ff4500;">2</span>-element Float64 Array:
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> results = optimize<span style="color: black;">&#40;</span>f, initial_guess<span style="color: black;">&#41;</span>
<span style="color: #008000;">Warning</span>: Possible conflict <span style="color: #ff7700;font-weight:bold;">in</span> library <span style="color: #dc143c;">symbol</span> dcopy_
OptimizationResults<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Nelder-Mead&quot;</span>,<span style="color: black;">&#91;</span><span style="color: #ff4500;">0.333333</span>, <span style="color: #ff4500;">0.333333</span><span style="color: black;">&#93;</span>,<span style="color: black;">&#91;</span><span style="color: #ff4500;">10.9</span>, <span style="color: #ff4500;">7.29994</span><span style="color: black;">&#93;</span>,3.2848148720460163e-9,<span style="color: #ff4500;">38</span>,true<span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> results.<span style="color: black;">minimum</span>
<span style="color: #ff4500;">2</span>-element Float64 Array:
 <span style="color: #ff4500;">10.9</span>    
  <span style="color: #ff4500;">7.29994</span></pre></td></tr></table></div>

<h3>MCMC</h3>
<p>The <a href="https://github.com/doobwa/mcmc.jl">MCMC</a> package provides tools for sampling from arbitrary probability distributions using Markov Chain Monte Carlo. It provides functions like <code>slice_sampler</code>, which allows one to sample from a (potentially unnormalized) density function using Radford Neal&#8217;s slice sampling algorithm.</p>

<div class="wp_codebox"><table><tr id="p475370"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
</pre></td><td class="code" id="p4753code70"><pre class="python" style="font-family:monospace;">julia<span style="color: #66cc66;">&gt;</span> load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;MCMC&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> using MCMC
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> d = Normal<span style="color: black;">&#40;</span><span style="color: #ff4500;">17.29</span>, <span style="color: #ff4500;">1.0</span><span style="color: black;">&#41;</span>
Normal<span style="color: black;">&#40;</span><span style="color: #ff4500;">17.29</span>,<span style="color: #ff4500;">1.0</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> f = x -<span style="color: #66cc66;">&gt;</span> logpdf<span style="color: black;">&#40;</span>d, x<span style="color: black;">&#41;</span>
<span style="color: #808080; font-style: italic;">#&lt;function&gt;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> <span style="color: black;">&#91;</span>slice_sampler<span style="color: black;">&#40;</span><span style="color: #ff4500;">0.0</span>, f<span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> i <span style="color: #ff7700;font-weight:bold;">in</span> <span style="color: #ff4500;">1</span>:<span style="color: #ff4500;">100</span><span style="color: black;">&#93;</span>
<span style="color: #ff4500;">100</span>-element <span style="color: black;">&#40;</span>Float64,Float64<span style="color: black;">&#41;</span> Array:
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">2.7589100475626323</span>,-<span style="color: #ff4500;">106.49522613611775</span><span style="color: black;">&#41;</span> 
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">22.840595204318323</span>,-<span style="color: #ff4500;">16.323492094305458</span><span style="color: black;">&#41;</span> 
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">0.11800384424353683</span>,-<span style="color: #ff4500;">148.35766451986206</span><span style="color: black;">&#41;</span>
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">25.507580447082677</span>,-<span style="color: #ff4500;">34.68325273534245</span><span style="color: black;">&#41;</span>  
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">25.794565860846134</span>,-<span style="color: #ff4500;">37.08275877393945</span><span style="color: black;">&#41;</span>  
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">25.898128716394307</span>,-<span style="color: #ff4500;">37.96887853221083</span><span style="color: black;">&#41;</span>  
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">9.309878825853284</span>,-<span style="color: #ff4500;">32.76010551023705</span><span style="color: black;">&#41;</span>   
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">30.824102772255355</span>,-<span style="color: #ff4500;">92.50490745818972</span><span style="color: black;">&#41;</span>  
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">9.108789186504177</span>,-<span style="color: #ff4500;">34.38504372063516</span><span style="color: black;">&#41;</span>   
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">25.547686903330494</span>,-<span style="color: #ff4500;">35.01363502992266</span><span style="color: black;">&#41;</span>  
 ⋮                                        
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">5.795001414731885</span>,-<span style="color: #ff4500;">66.98643477086263</span><span style="color: black;">&#41;</span>   
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">15.50115292212293</span>,-<span style="color: #ff4500;">2.518925467219337</span><span style="color: black;">&#41;</span>   
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">12.046429369881345</span>,-<span style="color: #ff4500;">14.666455009726143</span><span style="color: black;">&#41;</span> 
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">17.25455052645699</span>,-<span style="color: #ff4500;">0.919566865791911</span><span style="color: black;">&#41;</span>   
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">25.494698549206657</span>,-<span style="color: #ff4500;">34.57747767488159</span><span style="color: black;">&#41;</span>  
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">1.8340810959111111</span>,-<span style="color: #ff4500;">120.36165311809079</span><span style="color: black;">&#41;</span> 
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">2.7112428736526177</span>,-<span style="color: #ff4500;">107.18901820771696</span><span style="color: black;">&#41;</span> 
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">9.21203292192012</span>,-<span style="color: #ff4500;">33.54571459047587</span><span style="color: black;">&#41;</span>    
 <span style="color: black;">&#40;</span><span style="color: #ff4500;">19.12274407701784</span>,-<span style="color: #ff4500;">2.5984139591266584</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<h3>NHST</h3>
<p>The <a href="https://github.com/johnmyleswhite/NHST.jl">NHST</a> package provides tools for testing standard statistical hypotheses using null hypothesis significance testing tools like the t-test and the chi-squared test.</p>

<div class="wp_codebox"><table><tr id="p475371"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
</pre></td><td class="code" id="p4753code71"><pre class="python" style="font-family:monospace;">julia<span style="color: #66cc66;">&gt;</span> load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Distributions&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> using Distributions
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;NHST&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> using NHST
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> d1 = Normal<span style="color: black;">&#40;</span><span style="color: #ff4500;">17.29</span>, <span style="color: #ff4500;">1.0</span><span style="color: black;">&#41;</span>
Normal<span style="color: black;">&#40;</span><span style="color: #ff4500;">17.29</span>,<span style="color: #ff4500;">1.0</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> d2 = Normal<span style="color: black;">&#40;</span><span style="color: #ff4500;">0.0</span>, <span style="color: #ff4500;">1.0</span><span style="color: black;">&#41;</span>
Normal<span style="color: black;">&#40;</span><span style="color: #ff4500;">0.0</span>,<span style="color: #ff4500;">1.0</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> x = rand<span style="color: black;">&#40;</span>d1, <span style="color: #ff4500;">1</span>_000<span style="color: black;">&#41;</span>
<span style="color: #ff4500;">1000</span>-element Float64 Array:
 <span style="color: #ff4500;">15.7085</span>
 <span style="color: #ff4500;">18.585</span> 
 <span style="color: #ff4500;">16.6036</span>
 <span style="color: #ff4500;">18.962</span> 
 <span style="color: #ff4500;">17.8715</span>
 <span style="color: #ff4500;">16.6814</span>
 <span style="color: #ff4500;">17.9676</span>
 <span style="color: #ff4500;">16.8924</span>
 <span style="color: #ff4500;">16.6022</span>
 <span style="color: #ff4500;">17.9813</span>
  ⋮     
 <span style="color: #ff4500;">17.1339</span>
 <span style="color: #ff4500;">17.3964</span>
 <span style="color: #ff4500;">18.6184</span>
 <span style="color: #ff4500;">16.7238</span>
 <span style="color: #ff4500;">18.5003</span>
 <span style="color: #ff4500;">16.1618</span>
 <span style="color: #ff4500;">17.9198</span>
 <span style="color: #ff4500;">17.4928</span>
 <span style="color: #ff4500;">18.715</span> 
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> y = rand<span style="color: black;">&#40;</span>d2, <span style="color: #ff4500;">1</span>_000<span style="color: black;">&#41;</span>
<span style="color: #ff4500;">1000</span>-element Float64 Array:
  <span style="color: #ff4500;">0.664885</span> 
  <span style="color: #ff4500;">0.147182</span> 
  <span style="color: #ff4500;">0.96265</span>  
  <span style="color: #ff4500;">0.24282</span>  
  <span style="color: #ff4500;">1.881</span>    
 -<span style="color: #ff4500;">0.632478</span> 
  <span style="color: #ff4500;">0.539297</span> 
  <span style="color: #ff4500;">0.996562</span> 
 -<span style="color: #ff4500;">0.483302</span> 
  <span style="color: #ff4500;">0.514629</span> 
  ⋮        
  <span style="color: #ff4500;">2.06249</span>  
 -<span style="color: #ff4500;">0.549444</span> 
  <span style="color: #ff4500;">0.857575</span> 
 -<span style="color: #ff4500;">1.47464</span>  
 -<span style="color: #ff4500;">2.33243</span>  
  <span style="color: #ff4500;">0.510751</span> 
 -<span style="color: #ff4500;">0.381069</span> 
 -<span style="color: #ff4500;">1.49165</span>  
  <span style="color: #ff4500;">0.0521203</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> t_test<span style="color: black;">&#40;</span>x, y<span style="color: black;">&#41;</span>
HypothesisTest<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;t-Test&quot;</span>,<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;t&quot;</span>=<span style="color: #66cc66;">&gt;</span><span style="color: #ff4500;">392.2838409538002</span><span style="color: black;">&#125;</span>,<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;df&quot;</span>=<span style="color: #66cc66;">&gt;</span><span style="color: #ff4500;">1989.732411290855</span><span style="color: black;">&#125;</span>,<span style="color: #ff4500;">0.0</span>,<span style="color: black;">&#91;</span><span style="color: #ff4500;">17.1535</span>, <span style="color: #ff4500;">17.3293</span><span style="color: black;">&#93;</span>,<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;mean of x&quot;</span>=<span style="color: #66cc66;">&gt;</span><span style="color: #ff4500;">17.24357323225425</span>,<span style="color: #483d8b;">&quot;mean of y&quot;</span>=<span style="color: #66cc66;">&gt;</span><span style="color: #ff4500;">0.0021786523177457794</span><span style="color: black;">&#125;</span>,<span style="color: #ff4500;">0.0</span>,<span style="color: #483d8b;">&quot;two-sided&quot;</span>,<span style="color: #483d8b;">&quot;Welch Two Sample t-test&quot;</span>,<span style="color: #483d8b;">&quot;x and y&quot;</span>,<span style="color: #ff4500;">1989.732411290855</span><span style="color: black;">&#41;</span></pre></td></tr></table></div>

<h3>Clustering</h3>
<p>The <a href="https://github.com/johnmyleswhite/Clustering.jl">Clustering</a> package provides tools for doing simple k-means style clustering.</p>

<div class="wp_codebox"><table><tr id="p475372"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
</pre></td><td class="code" id="p4753code72"><pre class="python" style="font-family:monospace;">julia<span style="color: #66cc66;">&gt;</span> load<span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;Clustering&quot;</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> using Clustering
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> srand<span style="color: black;">&#40;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> n = <span style="color: #ff4500;">100</span>
<span style="color: #ff4500;">100</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> x = vcat<span style="color: black;">&#40;</span>randn<span style="color: black;">&#40;</span>n, <span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>, randn<span style="color: black;">&#40;</span>n, <span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span> .+ <span style="color: #ff4500;">10</span><span style="color: black;">&#41;</span>
200x2 Float64 Array:
  <span style="color: #ff4500;">0.0575636</span>  -<span style="color: #ff4500;">0.112322</span> 
 -<span style="color: #ff4500;">1.8329</span>     -<span style="color: #ff4500;">0.101326</span> 
  <span style="color: #ff4500;">0.370699</span>   -<span style="color: #ff4500;">0.956183</span> 
  <span style="color: #ff4500;">1.31816</span>    -<span style="color: #ff4500;">1.44351</span>  
  <span style="color: #ff4500;">0.787598</span>    <span style="color: #ff4500;">0.148386</span> 
  <span style="color: #ff4500;">0.712214</span>   -<span style="color: #ff4500;">1.293</span>    
 -<span style="color: #ff4500;">1.8578</span>     -<span style="color: #ff4500;">1.06208</span>  
 -<span style="color: #ff4500;">0.746303</span>   -<span style="color: #ff4500;">0.0439182</span>
  <span style="color: #ff4500;">1.12082</span>    -<span style="color: #ff4500;">2.00616</span>  
  <span style="color: #ff4500;">0.364646</span>   -<span style="color: #ff4500;">1.09331</span>  
  ⋮                    
 <span style="color: #ff4500;">10.1974</span>     <span style="color: #ff4500;">10.5583</span>   
 <span style="color: #ff4500;">11.0832</span>      <span style="color: #ff4500;">8.92082</span>  
 <span style="color: #ff4500;">11.5414</span>     <span style="color: #ff4500;">11.6022</span>   
  <span style="color: #ff4500;">9.0453</span>     <span style="color: #ff4500;">11.5093</span>   
  <span style="color: #ff4500;">8.86714</span>    <span style="color: #ff4500;">10.4233</span>   
 <span style="color: #ff4500;">10.7336</span>     <span style="color: #ff4500;">10.7201</span>   
  <span style="color: #ff4500;">8.60415</span>     <span style="color: #ff4500;">9.13942</span>  
  <span style="color: #ff4500;">8.62482</span>     <span style="color: #ff4500;">8.51701</span>  
 <span style="color: #ff4500;">10.5044</span>     <span style="color: #ff4500;">10.3841</span>   
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> true_assignments = vcat<span style="color: black;">&#40;</span>zeros<span style="color: black;">&#40;</span>n<span style="color: black;">&#41;</span>, ones<span style="color: black;">&#40;</span>n<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
<span style="color: #ff4500;">200</span>-element Float64 Array:
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 <span style="color: #ff4500;">0.0</span>
 ⋮  
 <span style="color: #ff4500;">1.0</span>
 <span style="color: #ff4500;">1.0</span>
 <span style="color: #ff4500;">1.0</span>
 <span style="color: #ff4500;">1.0</span>
 <span style="color: #ff4500;">1.0</span>
 <span style="color: #ff4500;">1.0</span>
 <span style="color: #ff4500;">1.0</span>
 <span style="color: #ff4500;">1.0</span>
 <span style="color: #ff4500;">1.0</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> results = k_means<span style="color: black;">&#40;</span>x, <span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>
<span style="color: #008000;">Warning</span>: Possible conflict <span style="color: #ff7700;font-weight:bold;">in</span> library <span style="color: #dc143c;">symbol</span> dgesdd_
<span style="color: #008000;">Warning</span>: Possible conflict <span style="color: #ff7700;font-weight:bold;">in</span> library <span style="color: #dc143c;">symbol</span> dsyrk_
<span style="color: #008000;">Warning</span>: Possible conflict <span style="color: #ff7700;font-weight:bold;">in</span> library <span style="color: #dc143c;">symbol</span> dgemm_
KMeansOutput<span style="color: black;">&#40;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>, <span style="color: #ff4500;">1</span>  ...  <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span>, <span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>,2x2 Float64 Array:
 -<span style="color: #ff4500;">0.0166203</span>  -<span style="color: #ff4500;">0.248904</span>
 <span style="color: #ff4500;">10.0418</span>     <span style="color: #ff4500;">10.0074</span>  ,<span style="color: #ff4500;">3</span>,<span style="color: #ff4500;">422.9820560670007</span>,true<span style="color: black;">&#41;</span>
&nbsp;
julia<span style="color: #66cc66;">&gt;</span> results.<span style="color: black;">assignments</span>
<span style="color: #ff4500;">200</span>-element Int64 Array:
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 <span style="color: #ff4500;">1</span>
 ⋮
 <span style="color: #ff4500;">2</span>
 <span style="color: #ff4500;">2</span>
 <span style="color: #ff4500;">2</span>
 <span style="color: #ff4500;">2</span>
 <span style="color: #ff4500;">2</span>
 <span style="color: #ff4500;">2</span>
 <span style="color: #ff4500;">2</span>
 <span style="color: #ff4500;">2</span>
 <span style="color: #ff4500;">2</span></pre></td></tr></table></div>

<p>While all of this software is still quite new and often still buggy, being able to work with these tools through a simple package systems had made me more excited than ever before about the future of Julia as a language for data analysis. There is, of course, one thing conspicuously lacking right now: a really powerful visualization toolkit for interactive graphics like that provided by R&#8217;s ggplot2 package. Hopefully something will come into being within the next few months.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/12/02/the-state-of-statistics-in-julia/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>The Shape of Floating Point Random Numbers</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/10/15/the-shape-of-floating-point-random-numbers/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/10/15/the-shape-of-floating-point-random-numbers/#comments</comments>
		<pubDate>Mon, 15 Oct 2012 14:32:20 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4736</guid>
		<description><![CDATA[[Updated 10/18/2012: Fixed a typo in which mantissa was replaced with exponent.] Over the weekend, Viral Shah updated Julia&#8217;s implementation of randn() to give a 20% speed boost. Because we all wanted to test that this speed-up had not come at the expense of the validity of Julia&#8217;s RNG system, I spent some time this [...]]]></description>
				<content:encoded><![CDATA[<p>[<b>Updated 10/18/2012</b>: Fixed a typo in which mantissa was replaced with exponent.]</p>
<p>Over the weekend, <a href="http://www.allthingshpc.org/Welcome.html">Viral Shah</a> updated <a href="https://github.com/JuliaLang/julia/issues/1348">Julia&#8217;s implementation of <code>randn()</code> to give a 20% speed boost</a>. Because we all wanted to test that this speed-up had not come at the expense of the validity of Julia&#8217;s RNG system, I spent some time this weekend trying to get tests up and running. I didn&#8217;t get far, but thankfully others chimed in and got things done.</p>
<p><a href="http://www.johndcook.com/Beautiful_Testing_ch10.pdf">Testing an RNG is serious business</a>. In total, we&#8217;ve considered using four different test suites:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Diehard_tests">Diehard</a></li>
<li><a href="http://www.phy.duke.edu/~rgb/General/dieharder.php">Dieharder</a></li>
<li><a href="http://csrc.nist.gov/groups/ST/toolkit/rng/index.html">STS</a></li>
<li><a href="http://www.iro.umontreal.ca/%7Esimardr/testu01/tu01.html">TestU01</a></li>
</ul>
<p>All of these suites can be easily used to test uniform random numbers over unsigned integers. Some are also appropriate for testing uniform random numbers over floatint-point values.</p>
<p>But we wanted to test a Gaussian RNG. To do that, we followed <a href="http://www.cse.cuhk.edu.hk/%7Ephwl/mt/public/archives/papers/grng_acmcs07.pdf">Thomas et al.&#8217;s lead </a>and mapped the Gaussian RNG&#8217;s output through a high-precision quantile function to produce uniform random floating point values. As our high-precision quantile function we ended up using the one described in <a href="http://www.jstatsoft.org/v11/a05/paper">Marsaglia&#8217;s 2004 JSS paper.</a></p>
<p>With that in place, I started to try modifying my previous RNG testing code. When we previously tried to test Julia&#8217;s <code>rand()</code> function, I got STS working on my machine and deciphered its manual well enough to run a suite of tests on a bit stream from Julia.</p>
<p>Unfortunately I made a fairly serious error in how I attempted to test Julia&#8217;s RNG. Because STS expects a stream of random 0&#8242;s and 1&#8242;s, I converted random numbers into 0&#8242;s and 1&#8242;s by testing whether the floating point numbers being generated were greater than 0.5 or less than 0.5. While this test is not completely wrong, it is very, very weak. Its substantive value comes from two points:</p>
<ol>
<li>It confirms that the median of the RNG is correctly positioned at 0.5.</li>
<li>It confirms that the placement of successive entries relative to 0.5 is effectively random. In short, there is not trivial correlation between successive values.</li>
</ol>
<p>Unfortunately that&#8217;s about all you learn from this method. We needed something more. So I started exploring how to convert a floating point into bits. Others had the good sense to avoid this and pushed us forward by using the TestU01 suite.</p>
<p>I instead got lost exploring the surprising complexity of trying to work with the individual bits of random floating point numbers. The topic is so subtle because <i>the distribution of bits in a randomly generated floating point number is extremely far from a random source of individual bits.</i></p>
<p>For example, a uniform variable&#8217;s representation in floating point has all the following non-random properties:</p>
<ol>
<li>The sign bit is never random because uniform variables are never negative.</li>
<li>The exponent is not random either because uniform variables are strictly contained in the interval [0, 1].</li>
<li>Even the mantissa isn&#8217;t random. Because floating point numbers aren&#8217;t evenly spaced in the reals, the mantissa has to have complex patterns in it to simulate the equal-spacing of uniform numbers.</li>
</ol>
<p>Inspired by all of this, I decided to get a sense for the bit pattern signature of different RNG&#8217;s.  Below I’ve plotted the patterns for uniform, normal, gamma and Cauchy variables using lines that describe the mean value of the i-th bit in the bit string. At a minimum, a completely random bit stream would have a flat horizontal line through 0.5, which many of the lines touch for a moment, but never perfectly match.</p>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/signatures2.png" alt="Signatures" title="signatures.png" border="0" width="800" height="600" /><br />
</center></p>
<p>Some patterns:</p>
<ol>
<li>The first bit (shown on the far left) is the sign bit: you can clearly see which distributions are symmetric by looking for a mean value of 0.5 versus those that are strictly positive and have a mean value of 0.0.</li>
<li>The next eleven bits are the exponent and you can clearly see which distributions are largely concentrated in the interval [-1, 1] and which have substantial density outside of that region. This bit would clue you into the variance of the distribution.</li>
<li>You can see that there is a lot of non-randomness in the last few bits of the mantissa for uniform variables. There&#8217;s also non-randomness in the first few bits for all variables. I don&#8217;t yet have any real intuition for those patterns.</li>
</ol>
<p>You can go beyond looking at the signatures of mean bit patterns by looking at covariance matrices as well. Below I show these covariances matrices in a white-blue coloring scheme in which white indicates negative values, light blue indicates zero and dark blue indicates positive values. Note that matrices, generated using R&#8217;s <code>image()</code> function are reflections of the more intuitive matrix ordering in which the [1,1] entry of the matrix occurs in the top-left instead of the bottom-left.</p>
<p><center></p>
<h4>Uniform Variables</h4>
<p><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/uniform_covariance.png" alt="Uniform covariance" title="uniform_covariance.png" border="0" width="480" height="480" /></p>
<h4>Normal Variables</h4>
<p><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/normal_covariance.png" alt="Normal covariance" title="normal_covariance.png" border="0" width="480" height="480" /></p>
<h4>Gamma Variables</h4>
<p><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/gamma_covariance.png" alt="Gamma covariance" title="gamma_covariance.png" border="0" width="480" height="480" /></p>
<h4>Cauchy Variables</h4>
<p><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/cauchy_covariance.png" alt="Cauchy covariance" title="cauchy_covariance.png" border="0" width="480" height="480" /><br />
</center></p>
<p>I find these pictures really helpful for reminding me how strangely floating point numbers behave. The complexity of these images is so far removed from the simplicity of the bit non-patterns in randomly generated unsigned integers, which can be generated using IID random bits and concatenating them together.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/10/15/the-shape-of-floating-point-random-numbers/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Overfitting</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/10/13/overfitting/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/10/13/overfitting/#comments</comments>
		<pubDate>Sat, 13 Oct 2012 11:59:23 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4726</guid>
		<description><![CDATA[What do you think when you see a model like the one below? Does this strike you as a good model? Or as a bad model? There&#8217;s no right or wrong answer to this question, but I&#8217;d like to argue that models that are able to match white noise are typically bad things, especially when [...]]]></description>
				<content:encoded><![CDATA[<p>What do you think when you see a model like the one below?</p>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/overfitting.png" alt="Overfitting" title="overfitting.png" border="0" width="800" height="600" /><br />
</center></p>
<p>Does this strike you as a good model? Or as a bad model?</p>
<p>There&#8217;s no right or wrong answer to this question, but I&#8217;d like to argue that models that are able to match white noise are typically bad things, especially when you don&#8217;t have a clear cross-validation paradigm that will allow you to demonstrate that your model&#8217;s ability to match complex data isn&#8217;t a form of overfitting.</p>
<p>There are many objective reasons to suspect complicated models, but I&#8217;d like to offer up a subjective one. A model that fits complex data as perfectly as the model above is likely to not be an interpretable model<sup><a href="http://www.johnmyleswhite.com/notebook/2012/10/13/overfitting/#footnote_0_4726" id="identifier_0_4726" class="footnote-link footnote-identifier-link" title="Although it might be a great predictive model if you can confirm that the fit above is the quality of the fit to held-out data!">1</a></sup> because it is essentially a noisy copy of the data. If the model looks so much like the data, why construct a model at all? Why not just use the raw data?</p>
<p>Unless the functional form of a model and its dependence on inputs is simple, I&#8217;m very suspicious of any statistical method that produces outputs like those shown above. If you want a model to do more than produce black-box predictions, it should probably provide predictions that are relatively smooth. At the least it should reveal comprehensible and memorable patterns that are non-smooth. While there are fields in which neither of these goals is possible (and others where it&#8217;s not desirable), I think the default reaction to a model fit like the one above should be: &#8220;why does the model make such complex predictions? Isn&#8217;t that a mistake? How many degrees of freedom does it have that it can so closely fit such noisy data?&#8221;</p>
<ol class="footnotes"><li id="footnote_0_4726" class="footnote">Although it might be a great predictive model if you can confirm that the fit above is the quality of the fit to held-out data!</li></ol>]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/10/13/overfitting/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>EDA Before CDA</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/10/06/eda-before-cda/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/10/06/eda-before-cda/#comments</comments>
		<pubDate>Sat, 06 Oct 2012 21:11:11 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4719</guid>
		<description><![CDATA[One Paragraph Summary Always explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you&#8217;ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of [...]]]></description>
				<content:encoded><![CDATA[<h3>One Paragraph Summary</h3>
<p><b>Always</b> explore your data visually. Whatever specific hypothesis you have when you go out to collect data is likely to be worse than any of the hypotheses you&#8217;ll form after looking at just a few simple visualizations of that data. The most effective hypothesis testing framework in existence is the test of intraocular trauma.</p>
<h3>Context</h3>
<p>This morning, I woke up to find that <a href="https://twitter.com/neilkod/status/254449853650837504">Neil Kodner</a> had discovered a very convenient CSV file that contains geospatial data about every valid US zip code. I&#8217;ve been interested in the relationship between places and zip codes recently, because I spent my summer living in the 98122 zip code after having spent my entire life living in places with zip codes below 20000. Because of the huge gulf between my Seattle zip code and my zip codes on the East Coast, I&#8217;ve on-and-off wondered if the zip codes were originally assigned in terms of the seniority of states. Specifically, the original thirteen colonies seem to have some of the lowest zip codes, while the newer states had some of the highest zip codes.</p>
<p>While I could presumably find this information through a few web searches or could gather the right data set to test my idea formally, I decided to blindly plot the zip code data instead. I think the results help to show why a few well-chosen visualizations can be so much more valuable than regression coefficients. Below I&#8217;ve posted the code I used to explore the zip code data in the exact order of the plots I produced. I&#8217;ll let the resulting pictures tell the rest of the story.</p>

<div class="wp_codebox"><table><tr id="p471974"><td class="line_numbers"><pre>1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="code" id="p4719code74"><pre class="c" style="font-family:monospace;">zipcodes <span style="color: #339933;">&lt;-</span> read.<span style="color: #202020;">csv</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;zipcodes.csv&quot;</span><span style="color: #009900;">&#41;</span>
&nbsp;
ggplot<span style="color: #009900;">&#40;</span>zipcodes<span style="color: #339933;">,</span> aes<span style="color: #009900;">&#40;</span>x <span style="color: #339933;">=</span> zip<span style="color: #339933;">,</span> y <span style="color: #339933;">=</span> latitude<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span>
  geom_point<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
ggsave<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;latitude_vs_zip.png&quot;</span><span style="color: #339933;">,</span> height <span style="color: #339933;">=</span> <span style="color: #0000dd;">7</span><span style="color: #339933;">,</span> width <span style="color: #339933;">=</span> <span style="color: #0000dd;">10</span><span style="color: #009900;">&#41;</span>
ggplot<span style="color: #009900;">&#40;</span>zipcodes<span style="color: #339933;">,</span> aes<span style="color: #009900;">&#40;</span>x <span style="color: #339933;">=</span> zip<span style="color: #339933;">,</span> y <span style="color: #339933;">=</span> longitude<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span>
  geom_point<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
ggsave<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;longitude_vs_zip.png&quot;</span><span style="color: #339933;">,</span> height <span style="color: #339933;">=</span> <span style="color: #0000dd;">7</span><span style="color: #339933;">,</span> width <span style="color: #339933;">=</span> <span style="color: #0000dd;">10</span><span style="color: #009900;">&#41;</span>
ggplot<span style="color: #009900;">&#40;</span>zipcodes<span style="color: #339933;">,</span> aes<span style="color: #009900;">&#40;</span>x <span style="color: #339933;">=</span> latitude<span style="color: #339933;">,</span> y <span style="color: #339933;">=</span> longitude<span style="color: #339933;">,</span> color <span style="color: #339933;">=</span> zip<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span>
  geom_point<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
ggsave<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;latitude_vs_longitude_color.png&quot;</span><span style="color: #339933;">,</span> height <span style="color: #339933;">=</span> <span style="color: #0000dd;">7</span><span style="color: #339933;">,</span> width <span style="color: #339933;">=</span> <span style="color: #0000dd;">10</span><span style="color: #009900;">&#41;</span>
ggplot<span style="color: #009900;">&#40;</span>zipcodes<span style="color: #339933;">,</span> aes<span style="color: #009900;">&#40;</span>x <span style="color: #339933;">=</span> longitude<span style="color: #339933;">,</span> y <span style="color: #339933;">=</span> latitude<span style="color: #339933;">,</span> color <span style="color: #339933;">=</span> zip<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span>
  geom_point<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
ggsave<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;longitude_vs_latitude_color.png&quot;</span><span style="color: #339933;">,</span> height <span style="color: #339933;">=</span> <span style="color: #0000dd;">7</span><span style="color: #339933;">,</span> width <span style="color: #339933;">=</span> <span style="color: #0000dd;">10</span><span style="color: #009900;">&#41;</span>
ggplot<span style="color: #009900;">&#40;</span>subset<span style="color: #009900;">&#40;</span>zipcodes<span style="color: #339933;">,</span> longitude <span style="color: #339933;">&lt;</span> <span style="color: #0000dd;">0</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> aes<span style="color: #009900;">&#40;</span>x <span style="color: #339933;">=</span> longitude<span style="color: #339933;">,</span> y <span style="color: #339933;">=</span> latitude<span style="color: #339933;">,</span> color <span style="color: #339933;">=</span> zip<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">+</span>
  geom_point<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>
ggsave<span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;usa_color.png&quot;</span><span style="color: #339933;">,</span> height <span style="color: #339933;">=</span> <span style="color: #0000dd;">7</span><span style="color: #339933;">,</span> width <span style="color: #339933;">=</span> <span style="color: #0000dd;">10</span><span style="color: #009900;">&#41;</span></pre></td></tr></table></div>

<h3>Picture</h3>
<h4>(Latitude, Zipcode) Scatterplot</h4>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/latitude_vs_zip.png" alt="Latitude vs zip" title="latitude_vs_zip.png" border="0" width="600" height="420" /><br />
</center></p>
<h4>(Longitude, Zipcode) Scatterplot</h4>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/longitude_vs_zip.png" alt="Longitude vs zip" title="longitude_vs_zip.png" border="0" width="600" height="420" /><br />
</center></p>
<h4>(Latitude, Longitude) Heatmap</h4>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/latitude_vs_longitude_color.png" alt="Latitude vs longitude color" title="latitude_vs_longitude_color.png" border="0" width="600" height="420" /><br />
</center></p>
<h4>(Longitude, Latitude) Heatmap</h4>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/longitude_vs_latitude_color.png" alt="Longitude vs latitude color" title="longitude_vs_latitude_color.png" border="0" width="600" height="420" /><br />
</center></p>
<h4>(Longitude, Latitude) Heatmap without Non-States</h4>
<p><center><br />
<img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/usa_color.png" alt="Usa color" title="usa_color.png" border="0" width="600" height="420" /><br />
</center></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/10/06/eda-before-cda/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Finder Bug in OS X</title>
		<link>http://www.johnmyleswhite.com/notebook/2012/10/02/finder-bug-in-os-x/</link>
		<comments>http://www.johnmyleswhite.com/notebook/2012/10/02/finder-bug-in-os-x/#comments</comments>
		<pubDate>Tue, 02 Oct 2012 21:40:50 +0000</pubDate>
		<dc:creator>John Myles White</dc:creator>
				<category><![CDATA[Mac OS X]]></category>
		<category><![CDATA[Programming]]></category>

		<guid isPermaLink="false">http://www.johnmyleswhite.com/?p=4711</guid>
		<description><![CDATA[Four years after I first noticed it, Finder still has a bug in it that causes it to report a negative number of items waiting for deletion:]]></description>
				<content:encoded><![CDATA[<p>Four years after <a href="http://www.johnmyleswhite.com/notebook/2008/08/30/negative-items-left-to-delete-under-mac-os-x-leopard/">I first noticed it</a>, Finder still has a bug in it that causes it to report a negative number of items waiting for deletion:</p>
<p><img src="http://www.johnmyleswhite.com/notebook/wp-content/uploads/2012/10/FinderBug.png" alt="FinderBug" title="FinderBug.png" border="0" width="433" height="169" /></p>
]]></content:encoded>
			<wfw:commentRss>http://www.johnmyleswhite.com/notebook/2012/10/02/finder-bug-in-os-x/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.441 seconds -->
<!-- Cached page served by WP-Cache -->
