A few months ago I decided to apply word frequency analysis ideas to R code. My idea was simple: the functions that one should invest the most effort into learning are precisely those functions that are used most frequently in real world code. In fact, this simple idea can be applied to spoken languages as valuably as to programming languages: the words that will most improve your ability to understand day-to-day speech in a foreign language are precisely the words that occur most frequently in day-to-day speech. Our language classrooms tend to screw this up by emphasizing words like “spoon” over “dude”, even though people my age will use the word “dude” many more times a day than they’ll use “spoon”.1 In large part, this is because the usefulness of a word tends to be confounded with its respectability when you learn a language in an academic setting. This over-emphasis on respectability is why you’re never taught curse words, even though they’re as practically useful as the majority of other words.
But I digress. Returning to R, the attached CSV data set contains the results of my word frequency analysis. To produce this data set, I spidered all of the source code for every R package on CRAN, split the resulting corpus text into lexical tokens, and counted the occurrences of each token. To make things simpler, I decided to treat every token followed by a parenthesized expression as a function, even though this gives me some syntactical units like if in my list of functions.
I think the resulting raw data set is probably as useful as any summaries I can offer of it, but I’ll offer some obvious results for those interested.
Function call frequencies in R, unsurprisingly, seem to follow Zipf’s law or some variety thereof. You can see this fairly clearly in the following logarithmic plot:

And here are the top 25 most frequently used functions in R, including some syntactical units:
| Function Name | Occurrences in Corpus |
|---|---|
| if | 333421 |
| c | 143123 |
| function | 137490 |
| length | 109847 |
| paste | 62906 |
| cat | 56199 |
| return | 56001 |
| stop | 54161 |
| is.null | 53575 |
| list | 51862 |
| log | 49479 |
| for | 48950 |
| rep | 39090 |
| names | 34439 |
| sum | 31475 |
| as.integer | 29417 |
| matrix | 29344 |
| is.na | 20230 |
| dim | 20202 |
| max | 19995 |
| nrow | 19789 |
| as.double | 19705 |
| attr | 18802 |
| t | 17889 |
| 17461 |
Some fun extensions to this work would be to use this frequency data set as a standard for the abnormality of code style: you could compare an individual programmer’s code to the standard frequency data to see which functions such-and-such a programmer tends to overuse and underuse. I personally tend to underuse stop and I never use attr at all.
Another interesting project would be to use this data set as input to a text classifier that would attempt to predict the author of a piece of R code based on token frequency information, in line with Mosteller and Wallace’s famous analysis of the Federalist Papers or anti-spam programs>.
- To this day, I still mix up the words for “spoon” (“cuchara”) and “knife” (“cuchillo”) in Spanish, thought I never make any mistakes with slang words like “dude” (“tio”).↩
This is neat! It would also be useful to have the number of packages each function is used in.
Thanks, Hadley. I’ll produce a data set containing the function name, package name and occurrence count sometime tonight; then I can produce a list of functions ranked by the number of packages they occur in.
And that is why almost all of my scripts include
len <- function(x) length(x)
"gth" is just too much extra fluff for such a frequently used function.