Dec 16 2009

Quick Review of Matrix Algebra in R

Lately, I’ve been running a series of fMRI experiments on visual perception. In the interests of understanding the underlying properties of the images I’m using as stimuli, I’ve been trying to learn more about the matrix transformations commonly used for image compression and image manipulation. Thankfully, R provides simple-to-use implementations for all of the matrix operations I wanted to play around with, so it’s been quite easy to get started. For the next few posts, I thought that I’d review the standard matrix techniques for image compression and editing, giving full examples of their implementation in R and demonstrations of their real-world value.

Before I start, I should make sure that you’re familiar with basic matrix operations in R. I imagine almost every R user knows a little bit about matrix algebra and probably knows the basics of using R to perform matrix algebra, but here’s a quick review to make sure I don’t leave anyone in the dark:

Building Matrices

You can build a matrix in R using the matrix function:

1
2
3
4
5
6
m <- matrix(c(1, 0, 0, 1), nrow = 2, ncol = 2, byrow = TRUE)
 
m
#      [,1] [,2]
# [1,]    1    0
# [2,]    0    1

You can test whether an item you’ve been given is a matrix using is.matrix and you can convert appropriate objects to matrices using as.matrix:

1
2
3
4
5
6
7
8
9
m <- matrix(c(1, 0, 0, 1), nrow = 2, ncol = 2, byrow = TRUE)
 
is.matrix(m)
# TRUE
 
as.matrix(data.frame(x = c(1, 0), y = c(0, 1)))
#      x y
# [1,] 1 0
# [2,] 0 1

Diagonal Matrices

The matrix I’ve been building in the examples above is a diagonal matrix, so you could also construct it using diag:

1
2
3
4
5
6
m <- diag(nrow = 2, ncol = 2)
 
m
#      [,1] [,2]
# [1,]    1    0
# [2,]    0    1

In general, you can generate the n by n identity matrix as:

1
2
3
n <- 10
 
diag(nrow = n, ncol = n)

I’ll be exploiting a more interesting use of diag in my next post, so let’s see how you can build matrices other than the identity with diag by specifying a vector of entries along the diagonal:

1
2
3
4
5
6
m <- diag(c(2, 1), nrow = 2, ncol = 2)
 
m
#      [,1] [,2]
# [1,]    2    0
# [2,]    0    1

This form of diag turns out to be extremely useful, as you’ll see once I cover the SVD’s syntax in R.

Matrix Algebra: Addition, Scalar Multiplication, Matrix Multiplication

The three core operations that can be performed on matrices are addition, scalar multiplication and matrix multiplication. These are easy to work with in R:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
m <- matrix(c(0, 2, 1, 0), nrow = 2, ncol = 2, byrow = TRUE)
 
m + m
#      [,1] [,2]
# [1,]    0    4
# [2,]    2    0
 
2 * m
#      [,1] [,2]
# [1,]    0    4
# [2,]    2    0
 
m %*% m
#      [,1] [,2]
# [1,]    2    0
# [2,]    0    2

Be careful with the * operator: it does not perform matrix multiplication, but rather an entry-wise multiplication:

1
2
3
4
5
m * m
 
#      [,1] [,2]
# [1,]    0    4
# [2,]    1    0

Matrix Transposes and Inverses

The next few matrix operations are a little more complex: transposition and inversion. Transposition is the easier of the two. To get the transpose of a matrix, you simply call the t function:

1
2
3
4
t(m)
#      [,1] [,2]
# [1,]    0    1
# [2,]    2    0

In contrast, inversion is a little more complex, partly because the function you’d want to use has a non-obvious name: solve.

1
2
3
4
5
solve(m)
 
#      [,1] [,2]
# [1,]  0.0    1
# [2,]  0.5    0

The reason that solve is called solve is that it’s a general purpose function you can use to solve matrix equations without wasting time computing the full inverse, which is often inefficient. If you want to know more about the computational efficiency issues, you should look into the ideas behind the even faster variant, qr.solve.

Now, you probably know this already, but the definition of a matrix’s inverse is that the product of the matrix and its inverse is the identity matrix, if the inverse exists. I always find this a good way to make sure that I’m correctly computing the inverse of a matrix:

1
2
3
4
5
solve(m) %*% m == diag(nrow = nrow(m), ncol = ncol(m))
 
#      [,1] [,2]
# [1,] TRUE TRUE
# [2,] TRUE TRUE

Further Matrix Operations

The car package defines an inv function, which is simply a new name for solve:

1
2
3
4
5
6
library('car')
inv
 
# function (x) 
# solve(x)
# <environment: namespace:car>

I think this is pretty clever, though I’m loathe to import a package just for this one snippet.

More interestingly, the MASS package defines a ginv function, which gives the matrix pseudoinverse, a generalization of matrix inversion that works for all matrices:

1
2
3
4
5
6
7
8
9
10
11
12
library('MASS')
 
m <- matrix(c(1, 1, 2, 2), nrow = 2, ncol = 2)
 
solve(m)
# Error in solve.default(m) : 
#  Lapack routine dgesv: system is exactly singular
 
ginv(m)
#      [,1] [,2]
# [1,]  0.1  0.1
# [2,]  0.2  0.2

In practice, you can usually get around using the pseudoinverse, but it’s nice to know that it’s at hand all the time.

Eigenvalues and Eigenvectors

Eigenvectors are surely the bane of every starting student of linear algebra, though their considerable power to simplify problems makes them the darling of every applied mathematician. Thankfully, R makes it easy to get these for every matrix:

1
2
3
4
5
6
7
8
9
10
m <- diag(nrow = 2, ncol = 2)
 
eigen(m)
# $values
# [1] 1 1
#
# $vectors
#      [,1] [,2]
# [1,]    0   -1
# [2,]    1    0

Matrix Metadata

Last, but not least, you can get metadata about the shape of a matrix using dim, nrow and ncol:

1
2
3
4
5
6
7
8
9
10
m <- diag(nrow = 2, ncol = 2)
 
dim(m)
# [1] 2 2
 
nrow(m)
# [1] 2
 
ncol(m)
# [1] 2

Hopefully those operations were already familiar to any readers out there, as I doubt that they’ll be clear from this short explanation without prior knowledge. I just felt compelled to review them before using any of them in my next set of posts. Tomorrow, I’ll start going through more interesting matrix algorithms in R, beginning with the SVD.


Dec 15 2009

Some Pitfalls of Object Oriented Programming in R

Yesterday, I wrote a post about implementing setter methods for objects in R. On my way to understanding how to implement these methods correctly, I made a bunch of mistakes. To keep people from wasting their time with these same mistakes, I’m reviewing them here.

Mistake One: Naming Conventions

Every generic method will search for a class method using a single convention: if the generic method name is GENERIC_METHOD and the class name is CLASS_NAME, then the class method is always GENERIC_METHOD.CLASS_NAME. This is true no matter how strange the generic method looks: the generic method id<- for class user has a class method called id<-.user. The strange <- in the method name doesn't change the fact that .user comes at the end of the class method name.

Mistake Two: Inspecting Existing Functions with Strange Names

If you want to get a sense how the function class works, you can simply type class at the R interpreter. Unfortunately, the parsing rules for R insure that this will never work for methods that end in "<-". For that reason, it's not at all clear how to get the definitions of these strangely named functions to come up in the interpreter. When defining these sort of methods, it's sufficient to simply enclose the method name, but that will not work for the purposes of introspection:

1
2
3
'class<-'
 
# [1] "class<-"

As you can see, 'class<-' evaluates to a string, which is neither surprising nor undesirable. Unfortunately, this leaves it unclear how to get at the definition of class<-. The secret for viewing the definitions of these strangely named functions is to enclose the function name in backticks:

1
2
3
`class<-`
 
# function (x, value)  .Primitive("class<-")

Mistake Three: Wasting Time on `<<-`

Before I learned that the trick to writing setter methods is to return an edited copy of the changed object, I was convinced that I needed to evaluate an assignment in the caller's scope. While pursuing this idea, I discovered the <<- operator. It's quite an interesting operator, but it's not the solution to the setter method definition problem.

You see, <<- is an assignment operator that will go searching through enclosing scopes (more precisely, R's environments) until it finds a variable that can be assigned to. Here's an example of how you could use <<- to do something surprising in R:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
i <- 1
 
"++" <- function()
{
  i <<- i + 1
}
 
"++"()
 
print(i)
 
# [1] 2
 
"++"()
 
print(i)
 
# [1] 3

This is pretty cool. In fact, it was cool enough that I wasted a good thirty minutes playing with it before realizing that it wasn't helping anything.

Mistake Four: UseMethod is Magic

Figuring out how to write code that calls UseMethod for functions with many arguments is not trivial: UseMethod's behavior is a little magical, and the R help docs are as spartan as ever with regard to calling UseMethod. If you naively try to write a function like this,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
i <- 2
 
class(i) <- 'myint'
 
raise.power <- function(x, power)
{
  UseMethod('raise.power', x, power)
}
 
raise.power.myint <- function(x, power)
{
  return(as.numeric(x ** power))
}
 
raise.power(i, 3)

you'll succeed at defining a generic method that generates warnings every single time it's run. This is because you are never allowed to pass more than two arguments to UseMethod. You see, UseMethod automatically discovers the arguments passed to its caller when it's called and otherwise completely ignores the arguments passed to it after the second argument. For that reason, what you need to write is

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
i <- 2
 
class(i) <- 'myint'
 
raise.power <- function(x, power)
{
  UseMethod('raise.power', x)
}
 
raise.power.myint <- function(x, power)
{
  return(as.numeric(x ** power))
}
 
raise.power(i, 3)

with the latter arguments dropped from the call to UseMethod. If you're like me, you'll probably feel that you're going to lose value along the way the first few times that you write this. I suggest you try this approach out, convince yourself that it works, and then just take advantage of the UseMethod voodoo to solve your problems.


Dec 14 2009

Object-Oriented Programming in R: The Setter Methods

With a little guidance from the indefatigable Hadley Wickham, I figured out today how to implement the setter methods that were missing from my example user class. To review, let’s rebuild the getter methods for my user object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
user <- list(id = 1,
             password = '286755fad04869ca523320acce0dc6a4',
             email = 'jmw@johnmyleswhite.com')
 
class(user) <- 'user'
 
id <- function(x)
{
  UseMethod('id', x)
}
 
id.user <- function(user)
{
  return(user[['id']])
}
 
password <- function(x)
{
  UseMethod('password', x)
}
 
password.user <- function(user)
{
  return(user[['password']])
}
 
email <- function(x)
{
  UseMethod('email', x)
}
 
email.user <- function(user)
{
  return(user[['email']])
}

With these working, my goal was to write setter methods that would work like the ideal code below:

1
2
3
4
5
6
id(user) <- 2
 
if (id(user) == 2)
{
  print("Succeeded in editing the user's id attribute.")
}

Defining this sort of setter method turned out to require a little effort. I wasted quite a lot of time walking down dead ends, but, thankfully, Hadley gave me the exact piece of information I was missing early this morning. The dead ends were interesting in themselves, though, so I’ll review them in another post tomorrow.

For now I’ll just explain the correct implementation for my desired setter methods, which you can see below:

1
2
3
4
5
6
7
8
9
10
'id<-' <- function(x, value)
{
  UseMethod('id<-', x)
}
 
'id<-.user' <- function(x, value)
{
  x[['id']] <- value
  return(x)
}

There are three important ideas here:

  1. The generic setter method for id should be called 'id<-', which must always be quoted to prevent parsing errors.
  2. The class level setter method should be called 'id<-.user', which isn't surprising, though I had imagined it might be called 'id.user<-' at one point.
  3. The setter method should make a change to a copy of the original object and return the edited copy to the caller. R handles the assignment of this edited return value to the original object behind the scenes. In fact, not using this return value approach will make otherwise plausible looking code fail to edit your object correctly. I stumbled on that for quite a while before I was shown the light.

With this in mind, a final user class looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
user <- list(id = 1,
             password = '286755fad04869ca523320acce0dc6a4',
             email = 'jmw@johnmyleswhite.com')
 
class(user) <- 'user'
 
id <- function(x)
{
  UseMethod('id', x)
}
 
id.user <- function(user)
{
  return(user[['id']])
}
 
password <- function(x)
{
  UseMethod('password', x)
}
 
password.user <- function(user)
{
  return(user[['password']])
}
 
email <- function(x)
{
  UseMethod('email', x)
}
 
email.user <- function(user)
{
  return(user[['email']])
}
 
'id<-' <- function(x, value)
{
  UseMethod('id<-', x)
}
 
'id<-.user' <- function(x, value)
{
  x[['id']] <- value
  return(x)
}
 
'password<-' <- function(x, value)
{
  UseMethod('password<-', x)
}
 
'password<-.user' <- function(x, value)
{
  x[['password']] <- value
  return(x)
}
 
'email<-' <- function(x, value)
{
  UseMethod('email<-', x)
}
 
'email<-.user' <- function(x, value)
{
  x[['email']] <- value
  return(x)
}

Dec 13 2009

The Most Basic Elements of Object-Oriented Programming in R

Until recently, I’ve never had any reason to learn how to define my own classes in R. Having learned this week, I was surprised to find out how easy it is to start implementing classes in R. If you know nothing about creating classes and class methods in R, here’s a very quick overview of the three core ideas behind R’s object system. If you already know the basics of defining classes in R, I’d suggest skipping this post, as it’s not likely to be of much value to you.

The first thing you should know is that the S3 object-oriented system feels like a hack that was put on top of the original S language. If you’re familiar with Perl, you’ll feel right at home with the basic ideas behind this approach; you’ll probably also appreciate the conceptual simplicity of the results.

With that in mind, you should be aware that any normal piece of R code is filled with objects, because the base packages and most well-designed user packages use lots of custom classes. You can see this for yourself by starting to use class to perform introspection on some of the items in your programs:

1
2
3
4
5
6
7
8
9
10
11
12
13
class(1)
# [1] "numeric"
 
class('a')
# [1] "character"
 
x <- 1:10
y <- 5 * x + rnorm(length(x), 0, 1)
 
fit <- lm(y ~ x)
 
class(fit)
# [1] "lm"

More interesting than looking at the classes of existing objects is defining new classes of your own. Thankfully, class, like names and many other functions in R, can be assigned to directly, like so:

1
2
3
4
5
6
a <- 1
 
class(a) <- 'octonion'
 
class(a)
# [1] "octonion"

If you’re familiar with Perl, you can think of this assignment to class as analogous to using bless. If you’re not a Perl user, hopefully this approach to defining classes still makes sense to you, because it’s so simple: the only thing that decides the class of an object is the value you’ve set for the class attribute. To convince yourself that the class of an object is just an attribute, you can use attr:

1
attr(a, 'class')

This simple metadata driven approach to building objects is the core of the S3 object system, as far as I can tell.

Of course, setting class by itself is pretty useless: you want to be able to define class methods. That’s where the other two main ideas come in. The first trick is to use generic methods to get polymorphism out of the language; the second trick is to define methods on your classes using a simple naming convention. To see how this works, let’s use an example that should be familiar to anyone who’s ever built a database-backed website.

Instead of my would-be octonion class, we’ll consider a user object. To define an object of class ‘user’, we can do the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
user <- list(id = 1,
             password = '41bfe136a536b7749104415bc364df2e',
             email = 'jmw@johnmyleswhite.com')
 
class(user) <- 'user'
 
user
 
#$id
#[1] 1
#
#$password
#[1] "41bfe136a536b7749104415bc364df2e"
#
#$email
#[1] "jmw@johnmyleswhite.com"
#
#attr(,"class")
#[1] "user"

The first thing that comes to mind after building this object is that we should define getter and setter methods for accessing and modifying the contents of this user object. For example, we want a way to get the id attribute for our object. We’d also like our approach to generalize to other objects with an id attribute. Ideally, we should be able to write something like this:

1
id(user)

Now, it is obviously possible to define an id function that operates only on user objects, but that’s not the right approach if we also want to have another sort of object that would have an id method as well. This polymorphism concern is the problem that generic methods and specialized naming conventions solve.

First, we are going to define a specialized id method that only operates on user objects. Because it will only work on ‘user’ objects, we’ll call it id.user:

1
2
3
4
5
6
7
id.user <- function(user.object)
{
	return(user.object[['id']])
}
 
id.user(user)
# [1] 1

Given this, we can then define a generic method that will operate on objects of many classes and reroute our general function calls to the correct class-level method:

1
2
3
4
5
6
7
id <- function(object)
{
	UseMethod('id', object)
}
 
id(user)
# [1] 1

Here UseMethod just searches for an id.user function, finds it and then calls it with user as its argument. If we had a ‘profile’ object, id would search for an id.profile function and call that. You can see this by trying id on a variable whose class we set to be profile using class.

1
2
3
4
5
6
7
8
profile <- list(id = 1)
 
class(profile) <- 'profile'
 
id(profile)
 
# Error in UseMethod("id", user) : 
#  no applicable method for 'id' applied to an object of class "profile"

With these ideas in mind, it’s easy to do something similar for the rest of our attributes as well:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
password.user <- function(user.object)
{
	return(user.object[['password']])
}
 
password <- function(user)
{
	UseMethod('password', user)
}
 
email.user <- function(user.object)
{
	return(user.object[['email']])
}
 
email <- function(user)
{
	UseMethod('email', user)
}

Defining setter methods is a little more tricky. We could use a trick like the eval hacks I used to implement push and pop recently, but Hadley Wickham wisely chastened me for doing that. I’m still trying to decide what’s the best approach.

As I see it, there are least three possible method call styles you might define:

1
2
3
4
5
user <- password(user, 'new_value')
 
password(user, 'new_value')
 
password(user) <- 'new_value'

I would really like to start implementing this last sort, but I haven’t figured out how to yet. The others can be implemented using the generic methods approach I just outlined above, along with some ugly calls to eval for the second call style. To implement the first, simply do something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
password.user <- function(user, new.value = NULL)
{
  if (is.null(new.value))
  {
    return(user[['password']])
  }
  else
  {
    user[['password']] <- new.value
    return(user)
  }
}
 
password <- function(user, new_value = NULL)
{
  UseMethod('password', user, new_value)
}

If you have suggestions on how to implement the third approach, please let me know. Also, I am not at all happy with the use of NULL in my current setter implementation, because that makes it impossible to set the value of an attribute to NULL. If people have suggestions, I’d very much appreciate them. Does R implement arity-specific function definitions?


Dec 11 2009

A Lot of Deaths are Partly Self-Induced

I’m a little surprised by Andrew Gelman’s post today, doubting the wisdom of a passage from Gary Becker’s work that reads:

According to the economic approach, therefore, most (if not all!) deaths are to some extent “suicides” in the sense that they could have been postponed if more resources had been invested in prolonging life.

I disagree with Gelman in finding this passage unimpressive or even slightly ridiculous. I think there is an important insight that Becker is trying to push on: we humans have far more control over our lives than we care to admit. I also think that not admitting the extent of our power to control our lives exculpates us for many of our failures, which is why it is one of the more popular approaches to dealing with the problems in our lives. As La Rochefoucauld always pointed out, nothing trumps self-love.

And I don’t think that emphasizing the unexpectedly large amount of control we have in our lives is just an idea that the economic approach has to offer us. I think you’d get the same idea out of reading Sartre’s discussion of a young man choosing whether or not to fight in the French Resistance in Existentialism is a Humanism.

But, to focus on the thrust of Gelman’s argument, my problems are really with the following passage:

My impression, though, is that people are dying all the time without wanting to do so, and Becker’s argument seems pretty silly to me. To spell it out in a little more detail: Suppose a person is standing on the sidewalk and is run over by an out-of-control car. I don’t see how you can call it suicidal of the pedestrian that the driver was not paying attention. Nor do I see it as suicidal if someone develops kidney cancer and dies, nor do I see it as suicidal if a kid is playing and falls out of a high window, or if someone in the Middle East is hit by a bomb while sitting in a school, etc. Beyond this, just as there used to be millions of people who smoked and died of cancer before people knew that smoking kills, I’m sure there are millions of people now doing something that they don’t realize is dangerous.

The first section is worrisome, because it seems to be catering to the availability bias: we humans are almost certain to overestimate the number of deaths caused by car accidents, so, by referring to it without explicit numbers, we’re likely to anchor ourselves on inaccurate quantities. In general, once any emotional topic comes up, I feel the need to find something more immutable than intuitions to trust in. Once a kid falls out of a window, rationality tends to follow them.

Now, I freely concede that all of the examples Gelman lists are not at all suicide-like, but what I worry about is the relative importance of those causes of death amid the frequent causes of death, at least in the U.S. I especially worry about this given the last two sentences of the above-cited passage: why focus on the “millions of people now doing something that they don’t realize is dangerous?” Why not focus instead on the millions of people killing themselves by smoking, fully cognizant of their self-destruction all the while? What possible reason would you have for focusing on one or the other?

I’d say that, in general, you can focus on those who smoke if you want to prove Becker right, and that you can focus on those who don’t if you want to prove him wrong. Either way, I am reasonably certain that you’ll implicitly distort your internal estimates of the relative numbers.

With that in mind, I think the right approach is to get some plausible estimate from real data. Here’s a simplistic start:

(1) Go to the CDC website and find a reasonably current list of the leading causes of death in the U.S. here.

(2) Mark each item as a possible point for Becker’s case. This is clearly debatable, but I’ll take as granted that “heart disease” (over-eating and under-exercising), “chronic lower respiratory diseases” (smoking), “diabetes” (over-eating and under-exercising), and “influenza and pneumonia” (refusing vaccination) are sometimes self-induced. My own bias is to assume that they’re usually self-induced, but that’s pure intuition, so please don’t trust that idea. In fact, what I’d really like is for you, dear reader, to prove me wrong with more elaborate statistics.

With these marked data points, I get a data set that looks like this:

Cause of Death Deaths Partly Self-Induced
Heart disease 631636 1
Cancer 559888 0
Stroke (cerebrovascular diseases) 137119 0
Chronic lower respiratory diseases 124583 1
Accidents (unintentional injuries) 121599 0
Diabetes 72449 1
Alzheimer’s disease 72432 0
Influenza and Pneumonia 56326 1
Nephritis, nephrotic syndrome, and nephrosis 45344 0
Septicemia 34234 0

Copying and pasting the resulting table and running two lines of R code lets me avoid the relevant mental arithmetic:

1
2
3
death.data <- read.csv('death_data.csv', header = TRUE, sep = '\t')
 
with(subset(death.data, Partly.Self.Induced == 1), sum(Deaths)) / with(death.data, sum(Deaths))

And I get 48% of all deaths as being self-induced. That’s definitely not “almost all”, but plainly a close call for “most.” Obviously this number is dubious at best: my point is only that so many cognitive biases operate in thinking about this sort of issue that you have to use numbers as a form of mental hygiene. Obviously pushing this issue into the realm of statistics only endangers my claims, since Gelman is so much more capable than I am as a statistician, but I think my general point is accurate: whether a given person thinks the numbers support Becker or disprove him says more about that person’s general outlook on life than about the actual question at hand.


Dec 10 2009

Times Series Methods versus Recurrence Relations

This term, I’ve been sitting in on Rene Carmona’s course on Modern Regression and Time Series Analysis. Much of the material on regression covered in the course was familiar to me already, but I’ve never felt that I had a real command of times series analysis methods.

When Carmona defined the AR(p) model in class a few weeks ago, it struck me that, though I’d seen the defining equation several times before, I’d never realized earlier that the AR(p) model subsumes all possible linear recurrence relations. Also, the AR(p) model has the nice property that, if you already know the correct value of p, fitting the AR(p) model can be done with an ordinary least squares regression.

With these observations in mind, I decided to see how well I could derive the formulas for simple recurrence relations from a small data set. The results I got on my 2.4 GHz Intel Core 2 Duo MacBook are a useful case study in the dangers of naively using the default methods for fitting AR(p) models, as well as a particularly clear example of the inevitable inaccuracies in floating point arithmetic.

I hope John D. Cook will forgive me for using the Fibonacci sequence as my example. While I totally agree with John that the Fibonacci sequence is not the ideal object to study if you’re interested in day-to-day programming tasks, its simplicity makes it perfect for understanding how recurrence relations work.

The workhouse for fitting an AR(p) model in R is, predictably, the ar function. To see how well it would work for my purposes, I stored the first 15 Fibonacci terms in a vector and ran ar using all of its defaults settings. Here’s the results:

1
2
3
4
5
6
7
8
9
10
11
12
fibs <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610)
 
ar(fibs)
 
#Call:
#ar(x = fibs)
#
#Coefficients:
#     1  
#0.5922  
#
#Order selected 1  sigma^2 estimated as  21590

These results are pretty terrible: the order for the model is chosen to be 1, which is clearly wrong. Given the wrong order, it’s no surprise that the estimated coefficient is off, though it’s strange that the result is so far off from the ideal coefficient for an order 1 model, which is (1 + sqrt(5)) / 2, or 1.618034. Thankfully, you can force ar to use the order you want by overriding some of the defaults:

1
2
3
4
5
6
7
8
9
10
ar(fibs, order.max = 2, aic = FALSE)
 
#Call:
#ar(x = fibs, aic = FALSE, order.max = 2)
#
#Coefficients:
#      1        2  
# 0.6108  -0.0315  
#
#Order selected 2  sigma^2 estimated as  23366

You choose your preferred order using order.max, but this will only be an upper bound if you allow the function to use AIC scores to determine the order of the AR(p) model.

To figure out what was going wrong, I decided to use lm instead of ar. To do that, I needed subsets of my input data:

1
2
3
fibs.1 <- fibs[1:(length(fibs) - 2)]
fibs.2 <- fibs[2:(length(fibs) - 1)]
fibs.3 <- fibs[3:length(fibs)]

Given these subset inputs, the call to lm is simple:

1
2
3
4
5
6
7
8
lm(fibs.3 ~ fibs.1 + fibs.2)
 
#Call:
#lm(formula = fibs.3 ~ fibs.1 + fibs.2)
#
#Coefficients:
#(Intercept)       fibs.1       fibs.2  
#  2.365e-14    1.000e+00    1.000e+00

As you can see, lm gets the right results, more or less. The non-zero intercept value is unfortunate, but suggests how easily floating point errors slip into these calculations.

Having gotten good results with lm, I decided to review ar a bit more: this led me to the conclusion that I should try using the method = 'ols' setting instead of method = 'yule-walker'.

1
2
3
4
5
6
7
8
9
10
11
12
ar(fibs, order.max = 2, method = 'ols')
 
#Call:
#ar(x = fibs, order.max = 2, method = "ols")
#
#Coefficients:
#1  2  
#1  1  
#
#Intercept: 106.4 (6.715e-07) 
#
#Order selected 2  sigma^2 estimated as  6.276e-17

This clearly works, though I find the output line about the intercept term confusing. I have to say that I’m a little surprised that the Yule-Walker method gives such bad results in this example: I’m not sure yet whether this is caused by the small sample size, a data set that can be fit without any error, intrinsic problems with the method, or something else I can’t even conceive of.

Knowing that ar could work if the OLS method was enforced, I decided to try letting the AIC have its way again after reintroducing this method level constraint:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
ar(fibs, method = 'ols')
 
#Call:
#ar(x = fibs, method = "ols")
#
#Coefficients:
#1  2  
#1  1  
#
#Intercept: 106.4 (6.715e-07) 
#
#Order selected 2  sigma^2 estimated as  6.276e-17 
#Warning message:
#In ar.ols(x, aic = aic, order.max = order.max, na.action = na.action,  :
#  model order: 3singularities in the computation of the projection
#  matrixresults are only valid up to model order2

As you can see, this also works, though there’s an error that I assume is a result of having input data that can be perfectly fit. In short, I think the take away lesson is that you can easily find the formula for recurrence relations using ar as long as you make sure you use ordinary least squares for fitting the various possible models.

Just to confirm that the OLS method would also find the ideal coefficient if forced to use an order 1 model, I ran ar one last time:

1
2
3
4
5
6
7
8
9
10
11
12
ar(fibs, order.max = 1, method = 'ols')
 
#Call:
#ar(x = fibs, order.max = 1, method = "ols")
#
#Coefficients:
#     1  
#1.6182  
#
#Intercept: 65.74 (0.05852) 
#
#Order selected 1  sigma^2 estimated as  0.04309

That result is satisfying and further confirms that the problems I had at the start are entirely attributable to using the Yule-Walker method with this data set.

[EDIT 12.11.2009: Replaced unfortunate term 'sparse' with non-overloaded word 'small'.]


Dec 9 2009

R Function Usage Frequencies, Addendum

Since people have asked, here is a GitHub page with all of the code used to generate my R function usage analyses: cran-function-usage-analysis.


Dec 9 2009

Abstract Data Type Operations in R

This morning, I got a chance to read enough of the R Language Definition to finish my implementations of push and pop. While I was at it, I also wrote implementations of unshift, shift, queue and dequeue. Here they are:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
push <- function(vector, item)
{
  vector.lvalue.symbol <- substitute(vector)
  new.expression <- paste(vector.lvalue.symbol,
                          ' <- c(', vector.lvalue.symbol, ', ', item, ')',
                          sep = '')
  eval(parse(text = new.expression),
       sys.frame(sys.parent()))
}
 
pop <- function(vector)
{
  vector.lvalue.symbol <- substitute(vector)
  temp.env <- new.env()
  last.element <- eval(parse(text = paste(vector.lvalue.symbol,
                                          '[length(', vector.lvalue.symbol, ')]',
                                          sep = '')),
                       sys.frame(sys.parent()))
  assign('tmp', last.element, envir = temp.env)
  eval(parse(text = paste(vector.lvalue.symbol,
                          ' <- ', vector.lvalue.symbol,
                          '[-length(', vector.lvalue.symbol, ')]',
                          sep = '')),
       sys.frame(sys.parent()))
  return(get('tmp', envir = temp.env))
}
 
unshift <- function(vector, item)
{
  vector.lvalue.symbol <- substitute(vector)
  new.expression <- paste(vector.lvalue.symbol,
                          ' <- c(', item, ', ', vector.lvalue.symbol, ')',
                          sep = '')
  eval(parse(text = new.expression), sys.frame(sys.parent()))
}
 
shift <- function(vector)
{
  vector.lvalue.symbol <- substitute(vector)
  temp.env <- new.env()
  last.element <- eval(parse(text = paste(vector.lvalue.symbol,
                                          '[1]',
                                          sep = '')),
                       sys.frame(sys.parent()))
  assign('tmp', last.element, envir = temp.env)
  eval(parse(text = paste(vector.lvalue.symbol,
                          ' <- ', vector.lvalue.symbol,
                          '[-1]',
                          sep = '')),
       sys.frame(sys.parent()))
  return(get('tmp', envir = temp.env))
}
 
queue <- function(vector, item)
{
  vector.lvalue.symbol <- substitute(vector)
  new.expression <- paste(vector.lvalue.symbol,
                          ' <- c(', vector.lvalue.symbol, ', ', item, ')',
                          sep = '')
  eval(parse(text = new.expression), sys.frame(sys.parent()))
}
 
dequeue <- function(vector)
{
  vector.lvalue.symbol <- substitute(vector)
  temp.env <- new.env()
  last.element <- eval(parse(text = paste(vector.lvalue.symbol,
                                          '[1]',
                                          sep = '')),
                       sys.frame(sys.parent()))
  assign('tmp', last.element, envir = temp.env)
  eval(parse(text = paste(vector.lvalue.symbol,
                          ' <- ', vector.lvalue.symbol,
                          '[-1]',
                          sep = '')),
       sys.frame(sys.parent()))
  return(get('tmp', envir = temp.env))
}

In general, the secret to writing these pseudo-macros is to use substitute. For the three functions that need to return a value as well as edit the passed parameter, you also need to use new.env, assign and get to edit the relevant symbol tables during function execution.

To check that these functions work, try the following examples after defining the functions above:

1
2
3
4
5
6
7
8
9
10
11
12
13
v <- c(1)
push(v, 2)
v
pop(v) == 2
v
unshift(v, 0)
v
shift(v) == 0
v
queue(v, 2)
v
dequeue(v) == 1
v

Dec 8 2009

R Function Usage Frequencies, Take 2

Yesterday, Hadley Wickham commented on my post on the frequency of calling various R functions that it would be helpful to have the number of packages that call a function in addition to the number of times that the function is called. I compiled the relevant data last night: you can grab it here This data set includes a row for every function I found, indexed by each of the packages and files in which it was used. At this higher level of resolution, I record the number of times each function was called.

To get a sense of the correspondence between these measures, below I’ve plotted the number of packages using each function in my data set against the log number of times each function is called:

package_function_frequencies.png

And here’s a new top 25 most called functions table:

Function Packages Using Function
function 1903
if 1846
c 1795
length 1791
list 1679
for 1656
return 1559
stop 1538
paste 1526
rep 1512
matrix 1419
is.null 1413
sum 1396
max 1368
cat 1308
names 1297
is.na 1241
min 1216
cbind 1175
nrow 1158
sqrt 1157
t 1134
print 1120
class 1120
seq 1098


Dec 7 2009

Implementing Push and Pop in R

Having grown up with Perl, there are two functions that I desperately miss while programming in R: push and pop. Continually writing

1
vector <- c(vector, new.entry)

tries my patience, while writing

1
2
vector <- rep(NA, inscrutable.constant)
vector[inscrutable.index] <- new.entry

makes me feel like I’m programming in C, rather than a higher-level programming language. That said, here’s a simplistic hack to provide something like an implementation of push and pop in R:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
push <- function(vector.name, item)
{
  eval.parent(parse(text = paste(vector.name, ' <- c(', vector.name, ', ', item, ')',
                                 sep = '')),
              n = 1)
}
 
pop <- function(vector.name)
{
  eval.parent(parse(text = paste(vector.name, ' <- ', 
                                 vector.name, '[-length(', vector.name, ')]',
                                 sep = '')),
              n = 1)
}

Both of these functions are more than a little ugly, because you have to pass in a string that names the vector you want to change, rather than providing its name as a bareword. Even worse, this version of pop doesn’t let you get the value of the item you pop off of the stack, because I’m not sure how to introduce a temporary variable into the parent’s environment without occasionally clobbering the value of an existing variable. If I knew more about scoping and lazy evaluation in R, I think I could implement these two functions as pseudo-macros and solve both concerns. If you know how to do this, please do let me know.