The State of Statistics in Julia

Updated 12.2.2012: Added sample output based on a suggestion from Stefan Karpinski.

Introduction

Over the last few weeks, the Julia core team has rolled out a demo version of Julia’s package management system. While the Julia package system is still very much in beta, it nevertheless provides the first plausible way for non-expert users to see where Julia’s growing community of developers is heading.

To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis, I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis.

Getting Julia Installed

If you’d like to try out Julia for yourself, you’ll first need to clone the current Julia repo from GitHub and then build Julia from source as described in the Julia README. Compiling Julia for the first time can take up to two hours, but updating Julia afterwards will be quite fast once you’ve gotten a working copy of the language and its dependencies installed on your system. After you have Julia built, you should add its main directory to your path and then open up the Julia REPL by typing julia at the command line.

Installing Packages

Once Julia’s REPL is running, you can use the following commands to start installing packages:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
julia> require("pkg")
 
julia> Pkg.init()
Initialized empty Git repository in /Users/johnmyleswhite/.julia/.git/
Cloning into 'METADATA'...
remote: Counting objects: 443, done.
remote: Compressing objects: 100% (208/208), done.
remote: Total 443 (delta 53), reused 423 (delta 33)
Receiving objects: 100% (443/443), 38.98 KiB, done.
Resolving deltas: 100% (53/53), done.
[master (root-commit) dbd486e] empty package repo
 2 files changed, 4 insertions(+)
 create mode 100644 .gitmodules
 create mode 160000 METADATA
 create mode 100644 REQUIRE
 
julia> Pkg.add("DataFrames", "Distributions", "MCMC", "Optim", "NHST", "Clustering")
Installing DataFrames: v0.0.0
Cloning into 'DataFrames'...
remote: Counting objects: 1340, done.
remote: Compressing objects: 100% (562/562), done.
remote: Total 1340 (delta 760), reused 1229 (delta 655)
Receiving objects: 100% (1340/1340), 494.79 KiB, done.
Resolving deltas: 100% (760/760), done.
Installing Distributions: v0.0.0
Cloning into 'Distributions'...
remote: Counting objects: 49, done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 49 (delta 8), reused 49 (delta 8)
Receiving objects: 100% (49/49), 17.29 KiB, done.
Resolving deltas: 100% (8/8), done.
Installing MCMC: v0.0.0
Cloning into 'MCMC'...
warning: no common commits
remote: Counting objects: 155, done.
remote: Compressing objects: 100% (97/97), done.
remote: Total 155 (delta 66), reused 140 (delta 51)
Receiving objects: 100% (155/155), 256.68 KiB, done.
Resolving deltas: 100% (66/66), done.
Installing NHST: v0.0.0
Cloning into 'NHST'...
remote: Counting objects: 20, done.
remote: Compressing objects: 100% (18/18), done.
remote: Total 20 (delta 2), reused 19 (delta 1)
Receiving objects: 100% (20/20), 4.31 KiB, done.
Resolving deltas: 100% (2/2), done.
Installing Optim: v0.0.0
Cloning into 'Optim'...
remote: Counting objects: 497, done.
remote: Compressing objects: 100% (191/191), done.
remote: Total 497 (delta 318), reused 476 (delta 297)
Receiving objects: 100% (497/497), 79.68 KiB, done.
Resolving deltas: 100% (318/318), done.
Installing Options: v0.0.0
Cloning into 'Options'...
remote: Counting objects: 10, done.
remote: Compressing objects: 100% (8/8), done.
remote: Total 10 (delta 1), reused 6 (delta 0)
Receiving objects: 100% (10/10), done.
Resolving deltas: 100% (1/1), done.
Installing Clustering: v0.0.0
Cloning into 'Clustering'...
remote: Counting objects: 38, done.
remote: Compressing objects: 100% (28/28), done.
remote: Total 38 (delta 7), reused 38 (delta 7)
Receiving objects: 100% (38/38), 7.77 KiB, done.
Resolving deltas: 100% (7/7), done.

That will get you started with some of the core tools for doing statistical programming in Julia. You’ll probably also want to install another package called “RDatasets”, which provides access to 570 of the classic data sets available in R. This package has a much larger file size than the others, which is why I recommend installing it after you’ve first installed the other packages:

1
2
3
4
5
6
7
8
9
10
require("pkg")
 
julia> Pkg.add("RDatasets")
Installing RDatasets: v0.0.0
Cloning into 'RDatasets'...
remote: Counting objects: 609, done.
remote: Compressing objects: 100% (588/588), done.
remote: Total 609 (delta 21), reused 605 (delta 17)
Receiving objects: 100% (609/609), 10.56 MiB | 1.15 MiB/s, done.
Resolving deltas: 100% (21/21), done.

Assuming that you’ve gotten everything working, you can then type the following to load Fisher’s classic Iris data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
julia> load("RDatasets")
Warning: redefinition of constant NARule ignored.
Warning: New definition ==(NAtype,Any) is ambiguous with ==(Any,AbstractArray{T,N}).
         Make sure ==(NAtype,AbstractArray{T,N}) is defined first.
Warning: New definition ==(Any,NAtype) is ambiguous with ==(AbstractArray{T,N},Any).
         Make sure ==(AbstractArray{T,N},NAtype) is defined first.
Warning: New definition replace!(PooledDataVec{S},NAtype,T) is ambiguous with replace!(PooledDataVec{S},T,NAtype).
         Make sure replace!(PooledDataVec{S},NAtype,NAtype) is defined first.
Warning: New definition promote_rule(Type{AbstractDataVec{T}},Type{T}) is ambiguous with promote_rule(Type{AbstractDataVec{S}},Type{T}).
         Make sure promote_rule(Type{AbstractDataVec{T}},Type{T}) is defined first.
Warning: New definition ^(NAtype,T<:Union(String,Number)) is ambiguous with ^(Any,Integer).
         Make sure ^(NAtype,_<:Integer) is defined first.
Warning: New definition ^(DataVec{T},Number) is ambiguous with ^(Any,Integer).
         Make sure ^(DataVec{T},Integer) is defined first.
Warning: New definition ^(DataFrame,Union(NAtype,Number)) is ambiguous with ^(Any,Integer).
         Make sure ^(DataFrame,Integer) is defined first.
 
julia> using DataFrames
 
julia> using RDatasets
 
julia> iris = data("datasets", "iris")
DataFrame  (150,6)
              Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]        1          5.1         3.5          1.4         0.2    "setosa"
[2,]        2          4.9         3.0          1.4         0.2    "setosa"
[3,]        3          4.7         3.2          1.3         0.2    "setosa"
[4,]        4          4.6         3.1          1.5         0.2    "setosa"
[5,]        5          5.0         3.6          1.4         0.2    "setosa"
[6,]        6          5.4         3.9          1.7         0.4    "setosa"
[7,]        7          4.6         3.4          1.4         0.3    "setosa"
[8,]        8          5.0         3.4          1.5         0.2    "setosa"
[9,]        9          4.4         2.9          1.4         0.2    "setosa"
[10,]      10          4.9         3.1          1.5         0.1    "setosa"
[11,]      11          5.4         3.7          1.5         0.2    "setosa"
[12,]      12          4.8         3.4          1.6         0.2    "setosa"
[13,]      13          4.8         3.0          1.4         0.1    "setosa"
[14,]      14          4.3         3.0          1.1         0.1    "setosa"
[15,]      15          5.8         4.0          1.2         0.2    "setosa"
[16,]      16          5.7         4.4          1.5         0.4    "setosa"
[17,]      17          5.4         3.9          1.3         0.4    "setosa"
[18,]      18          5.1         3.5          1.4         0.3    "setosa"
[19,]      19          5.7         3.8          1.7         0.3    "setosa"
[20,]      20          5.1         3.8          1.5         0.3    "setosa"
  :
[131,]    131          7.4         2.8          6.1         1.9 "virginica"
[132,]    132          7.9         3.8          6.4         2.0 "virginica"
[133,]    133          6.4         2.8          5.6         2.2 "virginica"
[134,]    134          6.3         2.8          5.1         1.5 "virginica"
[135,]    135          6.1         2.6          5.6         1.4 "virginica"
[136,]    136          7.7         3.0          6.1         2.3 "virginica"
[137,]    137          6.3         3.4          5.6         2.4 "virginica"
[138,]    138          6.4         3.1          5.5         1.8 "virginica"
[139,]    139          6.0         3.0          4.8         1.8 "virginica"
[140,]    140          6.9         3.1          5.4         2.1 "virginica"
[141,]    141          6.7         3.1          5.6         2.4 "virginica"
[142,]    142          6.9         3.1          5.1         2.3 "virginica"
[143,]    143          5.8         2.7          5.1         1.9 "virginica"
[144,]    144          6.8         3.2          5.9         2.3 "virginica"
[145,]    145          6.7         3.3          5.7         2.5 "virginica"
[146,]    146          6.7         3.0          5.2         2.3 "virginica"
[147,]    147          6.3         2.5          5.0         1.9 "virginica"
[148,]    148          6.5         3.0          5.2         2.0 "virginica"
[149,]    149          6.2         3.4          5.4         2.3 "virginica"
[150,]    150          5.9         3.0          5.1         1.8 "virginica"
 
julia> head(iris)
DataFrame  (6,6)
          Sepal.Length Sepal.Width Petal.Length Petal.Width  Species
[1,]    1          5.1         3.5          1.4         0.2 "setosa"
[2,]    2          4.9         3.0          1.4         0.2 "setosa"
[3,]    3          4.7         3.2          1.3         0.2 "setosa"
[4,]    4          4.6         3.1          1.5         0.2 "setosa"
[5,]    5          5.0         3.6          1.4         0.2 "setosa"
[6,]    6          5.4         3.9          1.7         0.4 "setosa"
 
julia> tail(iris)
DataFrame  (6,6)
            Sepal.Length Sepal.Width Petal.Length Petal.Width     Species
[1,]    145          6.7         3.3          5.7         2.5 "virginica"
[2,]    146          6.7         3.0          5.2         2.3 "virginica"
[3,]    147          6.3         2.5          5.0         1.9 "virginica"
[4,]    148          6.5         3.0          5.2         2.0 "virginica"
[5,]    149          6.2         3.4          5.4         2.3 "virginica"
[6,]    150          5.9         3.0          5.1         1.8 "virginica"

Now that you can see that Julia can handle complex data sets, let’s talk a little bit about the packages that make statistical analysis in Julia possible.

The DataFrames Package

The DataFrames package provides data structures for working with tabular data in Julia. At a minimum, this means that DataFrames provides tools for dealing with individual columns of missing data, which are called DataVec‘s. A collection of DataVec‘s allows one to build up a DataFrame, which provides a tabular data structure like that used by R’s data.frame type.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
julia> load("DataFrames")
 
julia> using DataFrames
 
julia> data = {"Value" => [1, 2, 3], "Label" => ["A", "B", "C"]}
Warning: imported binding for data overwritten in module Main
{"Label"=>["A", "B", "C"],"Value"=>[1, 2, 3]}
 
julia> df = DataFrame(data)
DataFrame  (3,2)
        Label Value
[1,]      "A"     1
[2,]      "B"     2
[3,]      "C"     3
 
julia> df["Value"]
3-element DataVec{Int64}
 
[1,2,3]
 
julia> df[1, "Value"] = NA
NA
 
 
julia> head(df)
DataFrame  (3,2)
        Label Value
[1,]      "A"    NA
[2,]      "B"     2
[3,]      "C"     3

Distributions

The Distributions package provides tools for working with probability distributions in Julia. It reifies distributions as types in Julia’s large type hierarchy, which means that quite generic names like rand can be used to sample from complex distributions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
julia> load("Distributions")
julia> using Distributions
 
julia> x = rand(Normal(11.0, 3.0), 10_000)
10000-element Float64 Array:
  6.87693
 13.3676 
  7.25008
  8.82833
 10.6911 
  7.1004 
 13.7449 
  5.96412
  8.57957
 15.27374.89007
 15.1509 
  6.32376
  7.83847
 14.4476 
 14.2974 
  9.74783
  9.67398
 14.4992 
 
julia> mean(x)
11.00366217730023
 
julia> var(x)
Warning: Possible conflict in library symbol ddot_
9.288938550823996

Optim

The Optim package provides tools for numerical optimization of arbitrary functions in Julia. It provides a function, optimize, which works a bit like R’s optim function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
julia> load("Optim")
julia> using Optim
 
julia> f = v -> (10.9 - v[1])^2 + (7.3 - v[2])^2
#<function>
 
julia> initial_guess = [0.0, 0.0]
2-element Float64 Array:
 0.0
 0.0
 
julia> results = optimize(f, initial_guess)
Warning: Possible conflict in library symbol dcopy_
OptimizationResults("Nelder-Mead",[0.333333, 0.333333],[10.9, 7.29994],3.2848148720460163e-9,38,true)
 
julia> results.minimum
2-element Float64 Array:
 10.9    
  7.29994

MCMC

The MCMC package provides tools for sampling from arbitrary probability distributions using Markov Chain Monte Carlo. It provides functions like slice_sampler, which allows one to sample from a (potentially unnormalized) density function using Radford Neal’s slice sampling algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
julia> load("MCMC")
 
julia> using MCMC
 
julia> d = Normal(17.29, 1.0)
Normal(17.29,1.0)
 
julia> f = x -> logpdf(d, x)
#<function>
 
julia> [slice_sampler(0.0, f) for i in 1:100]
100-element (Float64,Float64) Array:
 (2.7589100475626323,-106.49522613611775) 
 (22.840595204318323,-16.323492094305458) 
 (0.11800384424353683,-148.35766451986206)
 (25.507580447082677,-34.68325273534245)  
 (25.794565860846134,-37.08275877393945)  
 (25.898128716394307,-37.96887853221083)  
 (9.309878825853284,-32.76010551023705)   
 (30.824102772255355,-92.50490745818972)  
 (9.108789186504177,-34.38504372063516)   
 (25.547686903330494,-35.01363502992266)(5.795001414731885,-66.98643477086263)   
 (15.50115292212293,-2.518925467219337)   
 (12.046429369881345,-14.666455009726143) 
 (17.25455052645699,-0.919566865791911)   
 (25.494698549206657,-34.57747767488159)  
 (1.8340810959111111,-120.36165311809079) 
 (2.7112428736526177,-107.18901820771696) 
 (9.21203292192012,-33.54571459047587)    
 (19.12274407701784,-2.5984139591266584)

NHST

The NHST package provides tools for testing standard statistical hypotheses using null hypothesis significance testing tools like the t-test and the chi-squared test.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
julia> load("Distributions")
 
julia> using Distributions
 
julia> load("NHST")
 
julia> using NHST
 
julia> d1 = Normal(17.29, 1.0)
Normal(17.29,1.0)
 
julia> d2 = Normal(0.0, 1.0)
Normal(0.0,1.0)
 
julia> x = rand(d1, 1_000)
1000-element Float64 Array:
 15.7085
 18.585 
 16.6036
 18.962 
 17.8715
 16.6814
 17.9676
 16.8924
 16.6022
 17.981317.1339
 17.3964
 18.6184
 16.7238
 18.5003
 16.1618
 17.9198
 17.4928
 18.715 
 
julia> y = rand(d2, 1_000)
1000-element Float64 Array:
  0.664885 
  0.147182 
  0.96265  
  0.24282  
  1.881    
 -0.632478 
  0.539297 
  0.996562 
 -0.483302 
  0.5146292.06249  
 -0.549444 
  0.857575 
 -1.47464  
 -2.33243  
  0.510751 
 -0.381069 
 -1.49165  
  0.0521203
 
julia> t_test(x, y)
HypothesisTest("t-Test",{"t"=>392.2838409538002},{"df"=>1989.732411290855},0.0,[17.1535, 17.3293],{"mean of x"=>17.24357323225425,"mean of y"=>0.0021786523177457794},0.0,"two-sided","Welch Two Sample t-test","x and y",1989.732411290855)

Clustering

The Clustering package provides tools for doing simple k-means style clustering.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
julia> load("Clustering")
 
julia> using Clustering
 
julia> srand(1)
 
julia> n = 100
100
 
julia> x = vcat(randn(n, 2), randn(n, 2) .+ 10)
200x2 Float64 Array:
  0.0575636  -0.112322 
 -1.8329     -0.101326 
  0.370699   -0.956183 
  1.31816    -1.44351  
  0.787598    0.148386 
  0.712214   -1.293    
 -1.8578     -1.06208  
 -0.746303   -0.0439182
  1.12082    -2.00616  
  0.364646   -1.0933110.1974     10.5583   
 11.0832      8.92082  
 11.5414     11.6022   
  9.0453     11.5093   
  8.86714    10.4233   
 10.7336     10.7201   
  8.60415     9.13942  
  8.62482     8.51701  
 10.5044     10.3841   
 
julia> true_assignments = vcat(zeros(n), ones(n))
200-element Float64 Array:
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.0
 0.01.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 
julia> results = k_means(x, 2)
Warning: Possible conflict in library symbol dgesdd_
Warning: Possible conflict in library symbol dsyrk_
Warning: Possible conflict in library symbol dgemm_
KMeansOutput([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1  ...  2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],2x2 Float64 Array:
 -0.0166203  -0.248904
 10.0418     10.0074  ,3,422.9820560670007,true)
 
julia> results.assignments
200-element Int64 Array:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 12
 2
 2
 2
 2
 2
 2
 2
 2

While all of this software is still quite new and often still buggy, being able to work with these tools through a simple package systems had made me more excited than ever before about the future of Julia as a language for data analysis. There is, of course, one thing conspicuously lacking right now: a really powerful visualization toolkit for interactive graphics like that provided by R’s ggplot2 package. Hopefully something will come into being within the next few months.

18 responses to “The State of Statistics in Julia”

  1. Tim

    What might be really nice would be to effectively leapfrog the state of much R graphics and do something like ‘tabplotd3′, which pipes everything to d3 via Rook. Julia seems ideally suited to this step as a default and it would provide capabilities currently not found elsewhere. I hope it’s doable. I’ve written a few packages that spit out (semi-static) d3 visualizations from R and they were always a big hit.

  2. Tim

    By the way, thanks much for this post and others like it regarding Julia for stats. Much obliged.

  3. bootz

    Julia seems interesting, but I am missing a “Julia vs. X” or “Why Julia?” blog posts (with something more then benchmark results and REPL access…)

  4. Stefan Karpinski

    @bootz: Does this blog post from when we originally announced Julia back in February fit the bill?

    http://julialang.org/blog/2012/02/why-we-created-julia/

    Or is that too “visiony”? There’s also a technical report which has a fair amount of discussion of the motivations behind creating the language:

    http://julialang.org/images/julia-dynamic-2012-tr.pdf

  5. Alex

    I was very impressed by pkg package, unfortunately didn’t manage to follow post after RDataset load:

    julia
    _
    _ _ _(_)_ | A fresh approach to technical computing
    (_) | (_) (_) | Documentation: http://docs.julialang.org
    _ _ _| |_ __ _ | Type “help()” to list help topics
    | | | | | | |/ _` | |
    | | |_| | | | (_| | | Version 0.0.0+103529310.r2458.dirty
    _/ |\__’_|_|_|\__’_| | Commit 2458fa9da4 (2012-12-03 09:47:36)*
    |__/ |

    julia> load(“RDatasets”)
    Warning: redefinition of constant NARule ignored.
    Assertion failed: (CastInst::castIsValid(opc, C, Ty) && “Invalid constantexpr cast!”), function getCast, file Constants.cpp, line 1366.
    Illegal instruction: 4

    not sure where to dig – on mac mountain lion with homebrew (from https://github.com/staticfloat/homebrew-julia).

  6. Stefan Karpinski

    That seems like a recompilation error where you have mismatched pieces of the binary being linked together for illegal operation hilarity. I’m not sure how to force a clean recompile on homebrew. Do you mind opening an issue on GitHub? Either under the main julia repo [https://github.com/JuliaLang/julia] or the homebrew one.

  7. Alex

    Done in homebrew-julia: https://github.com/staticfloat/homebrew-julia/issues/18
    Stefan are you maintaining it?

  8. Stefan Karpinski

    Thanks. I don’t, but Elliot Saba does and he’s very active. I’m also watching that repo. We’ll get you sorted out.

  9. John Thompson

    Tim, which packages have you written for d3 visualizations in R?

  10. Tim

    None are released to CRAN as of yet, for a number of reasons. I could update and release d3.chord, as the paper for which that was originally written has evolved, and the focus shifted a little bit. Take a couple hundred people with a particular malignancy, genotype them all (exome or whole genome) and do several other assays while you’re at it, and think about how to visualize co-occurring mutations to get a grip on what might be driving the disease process in various cases. Early on, it seemed like regular old chord plots were a good way to do this. Later on, as mutations piled up, we thought, there has to be a better way to present this than a standard Circos plot (and of course, there is).

    d3.chord is a simple library built to take cross-tabs of binary covariates and turn them into a chord plot. It’s old enough (over a year) now that it needs to be updated to work with recent versions of d3; the only use case I’ve tried for it is my own. But it could be pushed out if there’s interest.

    r2d3 (Hadley’s? package) seems to be idling; I had abandoned most efforts to push anything in anticipation of him coming up with a unified means of cranking out d3 plots. But special-purpose libraries for specific plots might not be the worst thing, while waiting for that to come together.

    Unlike tabplotd3, the output from d3.chord is static — the matrix is written into a JavaScript object and all of the manipulations happen upon it. On the other hand, this means it can be published easily.

    I’ll stick an old version of d3.chord onto GitHub (minus some unpublished data) and see what happens.

  11. Eivind Eklund

    I’m going to give a quick critique of aspects of your packaging system that I hope can save your from a situation that can cost your language a lot of momentum over the long run.

    Let me first give some quick credentials: I’ve been involved in operating systems packaging since 1996, though not actively for a while. I’ve also been heavily involved in the design of a packaging system for Ruby (which ultimately failed for a combination of effort and political reasons, though it enjoyed popularity for a short while.)

    First, your ultimate goal for writing a packaging system should be

    “Make it easy for end users of your language to install libraries for their development needs”.

    It is easy to get distracted in various seemingly useful goals like

    “Provide tools to programmers for packaging their programs and libraries” (programmers are not necessarily the best people to do packaging, as their interests do not fully align with the end users’ interests, and packaging occasionally requires expert knowledge that goes outside the requirements for a normal programmer – including discussion between expert packagers to find the right solution to a particular case.)

    “Make it easy for end users of programs in your language to install a program (including dependencies).” (Bad on two levels: First, an end user should be able to install many programs, without concern for whether they conflict. Second, an end user usually do not care about the programming language that something is written in, so he or she should often not need to know about your packaging system at all.)

    “Provide the most powerful tools possible.” (You want to provide the right tools for supporting the goal of users having a consistent experience; extra powerful tools in the hands on non-expert packagers can easily result in inconsistent experiences.)

    This leads up to what I call the “standard mistake” in programming language packaging: Any programmer can package and upload to the repository, with maximum flexibility in what can be done with dependencies. You’ve included the ability to downgrade versions of other packages – this is neat, and unfortunately likely to be very problematic in practice.

    In contrast, the most functional model I’ve seen for packaging is what we may call the operating systems model of packaging (used by FreeBSD, Debian, etc):

    – There exists a repository of working programs/libraries. This is kept internally functioning at all times.
    – When a developer release a new version of a library/program, a packager will look at it in context and see how it can be incorporated in the repository.
    – Any inclusion of an update includes updating the dependent packages so they keep working.
    – Packager can do patches to the software to make sure it is reasonably packaging compliant.
    – Version number handling is taken care of by the packager in addition to the developer; if a package revision contains breaking changes, additional formal version numbers are updated by the packager.
    – If a package update contains enough breaking changes that it is not feasible to update the dependencies, the packager may do a package split – where e.g. you end up with foo1 and foo2, even though the original developer intended foo2 to completely replace foo1.
    – There’s often run some kind of branch system that only does backward compatible updates along a branch, so a user can keep updating along that branch and have little risk of breakage, while still getting some new features and security fixes.
    – Packagers give feedback to developers, leading to expert guidance of developers in how to do things correctly so it works well.

    This model has large advantages for end user use of the packages. It means they have the ability to install packages that work together, and that they have a central group that takes care of keeping the invariant, and doing basic quality control on the packages (ie, that they build and install and follow some basic conventions.)

    It has some disadvantages too, of course:
    – It creates more centralized work in packaging, which means you need somebody to do that
    – It leads to some minor conflicts with developers who are given extra requirements
    – It leads to slightly slower updates in general
    – It creates a minor barrier to entry for developers

    One trade off that may be the optimal one for a new language is to set up a repository of this form (dependencies, one explicit edge, dependencies verified to work before a new version is allow out), but allow developers to get write access very, very easily for their own packages, with guidance from somebody that acts as packaging experts. This can transition to a full packager-oriented model (less developer write access) as the system matures and more complicated cases come up.

    I hope this is useful; feel free to contact me at eeklund@gmail.com if you’d like to discuss anything more.

  12. Alex

    Eivind, apart from system specific packaging these days there significant number of “programmers” tools:

    *pip/setuptools/distutils for python
    * gem – ruby
    * rebar/agner for erlang
    * CPAN for perl
    etc.

    So there is nothing wrong with source level packages as long as it feels developers needs.
    My understanding that guys made “rebar on steroids”, which is quite handy. It’s ease of installation of libraries/development tools which allow language to thrive – on this one I agree with you, but I failed to see how distribution specific packaging helps with it.

  13. Eivind Eklund

    Alex, I’m not saying distribution specific packaging (assuming you by “distribution” mean “operating system”). There is clearly room for packaging done at the language level – waiting for all operating systems to package every library for every language is futile.

    What I’m trying to say is that to get a coherent user experience, it is important to treat libraries as something that work together as a set, and maintain this cohesion so users can get hold of working libraries.

    I don’t know how pip/setuptools are doing – I’ve not ended up using them (my Python programming is mostly in an employer shielded environment) – and I’ve never programmed in Erlang.

    rubygems regularly fails; and in Hackage it seems like it’s pervasive. CPAN seems to only occasionally fail. I suspect CPAN is mainly working due to the active community of librarians that attempt to enforce hygiene for the modules, leading to developers learning what to do, though a formal bugtracker for any module uploaded and a formal way of transferring modules between maintainers even if the present maintainer goes away certainly helps a lot too.

    Even with the best of the programmer based ones I’ve come across, there’s regular problems – while with a system where there are librarians/packagers that gate the updates, it is much easier to get to a point where you have the libraries Just Work for the end users.

    I think this should be the goal – to have libraries that work, not to have tools to put together a situation where one library can work.

  14. Eivind Eklund

    John, feel free to forward.