Feb 27 2009

The Fairness Doctrine Dies

Bless you, American Senate, for knocking down The Fairness Doctrine. You’ve done a good deed today.

Attempts to make society just based on naive numerical equality procedures strike me as uniformly insipid.


Feb 25 2009

Text Processing in R

On a regular basis, I have to process text in R. I invariably find that I need a function whose name or usage I can’t bring to mind. To help my future self, I’m writing this review of R’s built-in text processing functions. Hopefully, this review will also be of use to others.

Character Vectors == Arrays of Strings
The first source of confusion for me is the R type system. In R, a string is considered to be a character vector, but an R character vector would be an array of strings in any other programming language. Consider the following example:

1
2
str = 'string'
str[1] # This evaluates to 'string'.

To get access to the individual characters in an R string, you need to use the substr function:

1
2
str = 'string'
substr(str, 1, 1) # This evaluates to 's'.

For the same reason, you can’t use length to find the number of characters in a string. You have to use nchar instead.

But let’s go back to substr. The first argument to substr is a character vector, the second is the index of the first character you want, and the third is the index of the last character you want. So you can also use substr as follows:

1
2
3
str = 'string'
substr(str, 1, 2) == 'st'
substr(str, 5, 6) == 'ng'

As you can see, substr lets you access the individual characters of a string using an indexing/slicing strategy.

To break strings apart into vectors of characters, you can use the strsplit function, which works a lot like the split function in Perl. Here’s an example:

1
strsplit('0-0-1', '-') # Evaluates to list('0', '0', '1')

Putting Things Back Together Again
Now that you can pull strings apart, you need to be able to put the characters back together again into strings. You can do this using paste. paste is an idiosyncratic function: it is the only function for concatenation of strings in R, but it also handles the work of more sophisticated functions like Perl’s join. Try the following:

1
2
3
str1 = 'first'
str2 = 'second'
print(paste(str1, str))

As you’ll see, there’s an odd space added to the output. That’s because paste has an optional argument that provides a separator used when combining strings that defaults to a single space. So,

1
paste('first', 'second') == paste('first', 'second', sep = ' ')

You can get rid of the space by specifying a null separator instead.

1
print(paste('first', 'second', sep = ''))

Changing Case
To change the case of strings or individual characters, you need to use the tolower and toupper functions. You can use these with substr to make a function that turns most common words into their title case form:

1
2
3
4
5
pseudo.titlecase = function(str)
{
	substr(str, 1, 1) = toupper(substr(str, 1, 1))
	return(str)
}

With a little more sophistication, you can make a full title case function à la John Gruber. The result of my attempt to do this is fairly long, so I’ve posted it to my GitHub account. I’ll probably see about adding it to the R repository at some point if I can incorporate enough features to make it worth using. If you’re interested in helping or using what I’ve written, you can check out the code here.

Finally, there is a chartr function that translates characters in the input into the corresponding characters you select. For instance, you might try this:

1
chartr('abc', 'XYZ', 'abcabc') # Evaluates to "XYZXYZ".

This will remind Perl users of tr, which I personally never use. Nevertheless, it’s nice knowing that it’s there in R, albeit with a slightly different name.

Substring Containment
Finally, you might want to know if a string is contained in another string or set of strings. You can do this using the charmatch function:

1
2
charmatch("m",   c("mean", "mode")) # returns 0
charmatch("med", c("mean", "median")) # returns 2

I tend to use regular expressions by the time I would need substring matching, so I’m not sure if I would ever use charmatch in practice.

Future Ideas
For more sophisticated text processing, you would want to use regular expressions and the grep family of functions. I’ll have to read about them and write up something about their use in the future. R also implements an approximate regular expression matching system using Levenshtein edit distances, but I haven’t tried using that yet.


Feb 25 2009

Your Disdain for This Banal Existence

Today’s Rolcats image is certainly my favorite so far.


Feb 24 2009

The Frozen Ocean

I rarely write about music, but I thought that I should recommend The Frozen Ocean, Dave Swanson from Life in Your Way’s new band.


Feb 22 2009

Norvig and Partisanship

Peter Norvig has a great post on his website about models used to claim that the primary system induces greater partisanship in our elected officials.

I’d encourage everyone to read it, if for no other reason than to see for yourself that Norvig really does call someone his “homey” while writing about simulated election results.

In general, I think Norvig’s website is one of the treasure troves of knowledge on the Internet. Where else can you learn about the evils of Powerpoint and then learn about the virtues of Bayesian spelling correctors?


Feb 20 2009

Efficiency versus Readability

Every programming language — also every programmer — must trade off between writing code that is readable and producing code that executes the absolute minimum number of instructions to perform a task. This is a constant source of potential decisions, because it is almost always possible to make code less understandable while making it more efficient. For example, iterating over an entire list can be inefficient if you are only looking for three elements that may well be at the front of the list, but code that implements a generic search of the entire list — think grep — is usually much simpler to understand.


Feb 17 2009

Pearson vs. Spearman Correlation Coefficients

One of the misuses of statistical terminology that annoys me most is the use of the word “correlation” to describe any variable that increases as another variable increases. This monotonic trend seems worth looking for, but it plainly is not what most people discover when they use standard correlation coefficients. This is because the Pearson product moment correlation coefficient, which is usually the only correlation coefficient students learn to calculate, is strongly biased towards linear trends: those in which a variable y is a noisy linear function of a variable x. Only the Spearman correlation coefficient, which is usually not taught to students, actually detects a general monotonic trend. You can see this for yourself easily by seeing what the correlation coefficient is between x and progressively higher-degree polynomials in x.

Pearson vs Spearman.png

If the Pearson correlation coefficient actually detected monotonic trends, it wouldn’t plunge to zero as the degree of the polynomial in x increases. This is precisely what the Spearman correlation coefficient does.

I hope that we can reconcile our intuitive thinking and our statistical practice by ending the self-contradiction in which the word “correlation” is used in discourse to describe the behavior of an ideal Spearman correlation coefficient, while in practice correlations are computed using Pearson’s formula.


Feb 17 2009

The End of The Exclusionary Rule?

Dear Supreme Court,

Please do not end the Exclusionary Rule. We do not need to encourage the police to disregard the constitutional requirement for warrants.

Thank you.


Feb 16 2009

Yale Courses Online

Yale has updated its impressive set of videotaped lectures. For those interested in automating downloading the videos for any course, the script below should be useful. You’ll need to install the Perl module WWW::Mechanize before you can run the script. You’ll also want to update the list of courses URLs to reflect the courses that you want to download.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/usr/local/bin/perl
 
use strict;
use warnings;
 
use WWW::Mechanize;
use File::Spec;
 
my $mech = WWW::Mechanize->new();
 
my @courses = (
    'http://oyc.yale.edu/astronomy/frontiers-and-controversies-in-astrophysics/content/downloads',
    'http://oyc.yale.edu/economics/financial-markets/content/downloads'
);
 
for my $course (@courses)
{
    $mech->get($course);
 
    for my $link ($mech->find_all_links)
    {
        if ($link->url =~ m/mov$/ and $link->text =~ m/high/i)
        {
            print $link->url;
            print $link->text;
            print "\n\n";
 
            my (undef, undef, $filename) = File::Spec->splitpath($link->url);
 
            print "$filename\n";
 
            $mech->get( $link->url, ':content_file' => $filename );
        }
    }
}

Feb 16 2009

Arnold Kling on Liberaltarianism

Arnold Kling has just written an interesting response to Ross Douthat’s piece in The Atlantic on Liberaltarianism. I agree with a great deal of what Arnold says, especially his claim that the role of libertarian thinkers in America is “to restrain the power-hungry elites in both parties.”