Cross Validation¶
This package implements several cross validation schemes: Kfold, LOOCV, and RandomSub. Each scheme is an iterable object, of which each element is a vector of indices (indices of samples selected for training).
Cross Validation Schemes¶
- Kfold(n, k)¶
k-fold cross validation over a set of n samples, which are randomly partitioned into k disjoint validation sets of nearly the same sizes. This generates k training subsets of length about n*(1-1/k).
julia> collect(Kfold(10, 3)) 3-element Array{Any,1}: [1,3,4,6,7,8,10] [2,5,7,8,9,10] [1,2,3,4,5,6,9]
- StratifiedKfold(strata, k)¶
Like Kfold, but indexes in each strata (defined by unique values of an iterator strata) are distributed approximately equally across the k folds. Each strata should have at least k members.
julia> collect(StratifiedKfold([:a, :a, :a, :b, :b, :c, :c, :a, :b, :c], 3)) 3-element Array{Any,1}: [1,2,4,6,8,9,10] [3,4,5,7,8,10] [1,2,3,5,6,7,9]
- LOOCV(n)¶
Leave-one-out cross validation over a set of n samples.
julia> collect(LOOCV(4)) 4-element Array{Any,1}: [2,3,4] [1,3,4] [1,2,4] [1,2,3]
- RandomSub(n, sn, k)¶
Repetitively random subsampling. Particularly, this generates k subsets of length sn from a data set with n samples.
julia> collect(RandomSub(10, 5, 3)) 3-element Array{Any,1}: [1,2,5,8,9] [2,5,7,8,10] [1,3,5,6,7]
- StratifiedRandomSum(strata, sn, k)¶
Like RandomSub, but indexes in each strata (defined by unique values of an iterator strata) are distributed approximately equally across the k subsets. sn should be greater than the number of strata, so that each stratum can be represented in each subset.
julia> collect(StratifiedRandomSub([:a, :a, :a, :b, :b, :c, :c, :a, :b, :c], 7, 5)) 5-element Array{Any,1}: [1,2,3,4,6,7,9] [1,3,4,6,8,9,10] [1,3,5,7,8,9,10] [1,2,4,7,8,9,10] [1,2,3,4,5,6,10]
Cross Validation Function¶
The package also provides a function cross_validate as below to run a cross validation procedure.
- cross_validate(estfun, evalfun, n, gen)¶
Run a cross validation procedure.
Parameters: - estfun –
The estimation function, which takes a vector of training indices as input and returns a learned model, as:
model = estfun(train_inds)
- evalfun –
The evaluation function, which takes a model and a vector of testing indices as input and returns a score that indicates the goodness of the model, as
score = evalfun(model, test_inds)
- n – The total number of samples.
- gen – An iterable object that provides training indices, e.g., one of the cross validation schemes listed above.
Returns: a vector of scores obtained in the multiple runs.
Example:
# A simple example to demonstrate the use of cross validation # # Here, we consider a simple model: using a mean vector to represent # a set of samples. The goodness of the model is assessed in terms # of the RMSE (root-mean-square-error) evaluated on the testing set # using MLBase # functions compute_center(X::Matrix{Float64}) = vec(mean(X, 2)) compute_rmse(c::Vector{Float64}, X::Matrix{Float64}) = sqrt(mean(sum(abs2(X .- c),1))) # data const n = 200 const data = [2., 3.] .+ randn(2, n) # cross validation scores = cross_validate( inds -> compute_center(data[:, inds]), # training function (c, inds) -> compute_rmse(c, data[:, inds]), # evaluation function n, # total number of samples Kfold(n, 5)) # cross validation plan: 5-fold # get the mean and std of the scores (m, s) = mean_and_std(scores)
Please refer to examples/crossval.jl for the entire script.
- estfun –