Cross Validation

This package implements several cross validation schemes: Kfold, LOOCV, and RandomSub. Each scheme is an iterable object, of which each element is a vector of indices (indices of samples selected for training).

Cross Validation Schemes

Kfold(n, k)

k-fold cross validation over a set of n samples, which are randomly partitioned into k disjoint validation sets of nearly the same sizes. This generates k training subsets of length about n*(1-1/k).

julia> collect(Kfold(10, 3))
3-element Array{Any,1}:
StratifiedKfold(strata, k)

Like Kfold, but indexes in each strata (defined by unique values of an iterator strata) are distributed approximately equally across the k folds. Each strata should have at least k members.

julia> collect(StratifiedKfold([:a, :a, :a, :b, :b, :c, :c, :a, :b, :c], 3))
3-element Array{Any,1}:

Leave-one-out cross validation over a set of n samples.

julia> collect(LOOCV(4))
4-element Array{Any,1}:
RandomSub(n, sn, k)

Repetitively random subsampling. Particularly, this generates k subsets of length sn from a data set with n samples.

julia> collect(RandomSub(10, 5, 3))
3-element Array{Any,1}:
StratifiedRandomSum(strata, sn, k)

Like RandomSub, but indexes in each strata (defined by unique values of an iterator strata) are distributed approximately equally across the k subsets. sn should be greater than the number of strata, so that each stratum can be represented in each subset.

julia> collect(StratifiedRandomSub([:a, :a, :a, :b, :b, :c, :c, :a, :b, :c], 7, 5))
5-element Array{Any,1}:

Cross Validation Function

The package also provides a function cross_validate as below to run a cross validation procedure.

cross_validate(estfun, evalfun, n, gen)

Run a cross validation procedure.

  • estfun

    The estimation function, which takes a vector of training indices as input and returns a learned model, as:

    model = estfun(train_inds)
  • evalfun

    The evaluation function, which takes a model and a vector of testing indices as input and returns a score that indicates the goodness of the model, as

    score = evalfun(model, test_inds)
  • n – The total number of samples.
  • gen – An iterable object that provides training indices, e.g., one of the cross validation schemes listed above.

a vector of scores obtained in the multiple runs.


# A simple example to demonstrate the use of cross validation
# Here, we consider a simple model: using a mean vector to represent
# a set of samples. The goodness of the model is assessed in terms
# of the RMSE (root-mean-square-error) evaluated on the testing set

using MLBase

# functions
compute_center(X::Matrix{Float64}) = vec(mean(X, 2))

compute_rmse(c::Vector{Float64}, X::Matrix{Float64}) =
    sqrt(mean(sum(abs2(X .- c),1)))

# data
const n = 200
const data = [2., 3.] .+ randn(2, n)

# cross validation
scores = cross_validate(
    inds -> compute_center(data[:, inds]),        # training function
    (c, inds) -> compute_rmse(c, data[:, inds]),  # evaluation function
    n,              # total number of samples
    Kfold(n, 5))    # cross validation plan: 5-fold

# get the mean and std of the scores
(m, s) = mean_and_std(scores)

Please refer to examples/crossval.jl for the entire script.