Welcome to MLBase’s documentation!¶
MLBase.jl is a Julia package that provides useful tools for machine learning applications. It can be considered as a Swiss knife for you when you are writing machine learning codes.
Dependencies:
- Reexport: to support name reexport
- StatsBase: all names in StatsBase are reexported
- ArrayViews: view is reexported
- Iterators: to support grid search
Contents:
Data Preprocessing Utilities¶
The package provide a variety of functions for data preprocessing.
Data Repetition¶
- repeach(a, n)¶
Repeat each element in vector a for n times. Here n can be either a scalar or a vector with the same length as a.
using MLBase repeach(1:3, 2) # --> [1, 1, 2, 2, 3, 3] repeach(1:3, [3,2,1]) # --> [1, 1, 1, 2, 2, 3]
- repeachcol(a, n)¶
Repeat each column in matrix a for n times. Here n can be either a scalar or a vector with length(n) == size(a,2).
- repeachrow(a, n)¶
Repeat each row in matrix a for n times. Here n can be either a scalar or a vector with length(n) == size(a,1).
Label Processing¶
In machine learning, we often need to first attach each class with an integer label. This package provides a type LabelMap that captures the association between discrete values (e.g a finite set of strings) and integer labels.
Together with LabelMap, the package also provides a function labelmap to construct the map from a sequence of discrete values, and a function labelencode to map discrete values to integer labels.
julia> lm = labelmap(["a", "a", "b", "b", "c"])
LabelMap (with 3 labels):
[1] a
[2] b
[3] c
julia> labelencode(lm, "b")
2
julia> labelencode(lm, ["a", "c", "b"])
3-element Array{Int64,1}:
1
3
2
Note that labelencode can be applied to either single value or an array.
The package also provides a function groupindices to group indices based on associated labels.
julia> groupindices(3, [1, 1, 1, 2, 2, 3, 2])
3-element Array{Array{Int64,1},1}:
[1,2,3]
[4,5,7]
[6]
# using lm as constructed above
julia> groupindices(lm, ["a", "a", "c", "b", "b"])
3-element Array{Array{Int64,1},1}:
[1,2]
[4,5]
[3]
Classification¶
A classification procedure, no matter how sophisticated it is, generally consists of two steps: (1) assign a score/distance to each class, and (2) choose the class that yields the highest score/lowest distance.
This package provides a function classify and its friends to accomplish the second step, that is, to predict labels based on scores.
- classify(x[, ord])¶
Classify based on scores given in x and the order of scores specified in ord.
Generally, ord can be any instance of type Ordering. However, it usually enough to use either Forward or Reverse:
- ord = Forward: higher value indicates better match (e.g., similarity)
- ord = Reverse: lower value indicates better match (e.g., distances)
When ord is omitted, it is defaulted to Forward.
When x is a vector, it produces an integer label. When x is a matrix, it produces a vector of integers, each for a column of x.
classify([0.2, 0.5, 0.3]) # --> 2 classify([0.2, 0.5, 0.3], Forward) # --> 2 classify([0.2, 0.5, 0.3], Reverse) # --> 1 classify([0.2 0.5 0.3; 0.7 0.6 0.2]') # --> [2, 1] classify([0.2 0.5 0.3; 0.7 0.6 0.2]', Forward) # --> [2, 1] classify([0.2 0.5 0.3; 0.7 0.6 0.2]', Reverse) # --> [1, 3]
- classify!(r, x[, ord])
Write predicted labels to r.
- classify_withscore(x[, ord])¶
Return a pair as (label, score), where score is the input score corresponding to the predicted label.
- classify_withscores(x[, ord])¶
This function applies to a matrix x comprised of multiple samples (each being a column). It returns a pair (labels, scores).
- classify_withscores!(r, s, x[, ord])
Write predicted labels to r and corresponding scores to s.
Performance Evaluation¶
This package provides tools to assess the performance of a machine learning algorithm.
Classification Performance¶
- correctrate(gt, pred)¶
Compute correct rate of predictions given by pred w.r.t. the ground truths given in gt.
- errorrate(gt, pred)¶
Compute error rate of predictions given by pred w.r.t. the ground truths given in gt.
- confusmat(k, gt, pred)¶
Compute the confusion matrix of the predictions given by pred w.r.t. the ground truths given in gt. Here, k is the number of classes.
It returns an integer matrix R of size (k, k), such that R(i, j) == countnz((gt .== i) & (pred .== j)).
Examples:
julia> gt = [1, 1, 1, 2, 2, 2, 3, 3]; julia> pred = [1, 1, 2, 2, 2, 3, 3, 3]; julia> C = confusmat(3, gt, pred) # compute confusion matrix 3x3 Array{Int64,2}: 2 1 0 0 2 1 0 0 2 julia> C ./ sum(C, 2) # normalize per class 3x3 Array{Float64,2}: 0.666667 0.333333 0.0 0.0 0.666667 0.333333 0.0 0.0 1.0 julia> trace(C) / length(gt) # compute correct rate from confusion matrix 0.75 julia> correctrate(gt, pred) 0.75
Hit rate (for retrieval tasks)¶
- hitrate(gt, ranklist, k)¶
Compute the hitrate of rank k for a ranked list of predictions given by ranklist w.r.t. the ground truths given in gt.
Particularly, if gt[i] is contained in ranklist[1:k, i], then the prediction for the i-th sample is said to be hit within rank ``k``. The hitrate of rank k is the fraction of predictions that hit within rank k.
- hitrates(gt, ranklist, ks)¶
Compute hit-rates of multiple ranks (as given by a vector ks). It returns a vector of hitrates r, where r[i] corresponding to the rank ks[i].
Note that computing hit-rates for multiple ranks jointly is more efficient than computing them separately.
Receiver Operating Characteristics (ROC)¶
Receiver Operating Characteristics (ROC) is often used to measure the performance of a detector, thresholded classifier, or a verification algorithm.
The ROC Type¶
This package uses an immutable type ROCNums defined below to capture the ROC of an experiment:
immutable ROCNums{T<:Real}
p::T # positive in ground-truth
n::T # negative in ground-truth
tp::T # correct positive prediction
tn::T # correct negative prediction
fp::T # (incorrect) positive prediction when ground-truth is negative
fn::T # (incorrect) negative prediction when ground-truth is positive
end
One can compute a variety of performance measurements from an instance of ROCNums (say r):
- true_positive(r)¶
the number of true positives (r.tp)
- true_negative(r)¶
the number of true negatives (r.tn)
- false_positive(r)¶
the number of false positives (r.fp)
- false_negative(r)¶
the number of false negatives (r.fn)
- true_postive_rate(r)¶
the fraction of positive samples correctly predicted as positive, defined as r.tp / r.p
- true_negative_rate(r)¶
the fraction of negative samples correctly predicted as negative, defined as r.tn / r.n
- false_positive_rate(r)¶
the fraction of negative samples incorrectly predicted as positive, defined as r.fp / r.n
- false_negative_rate(r)¶
the fraction of positive samples incorrectly predicted as negative, defined as r.fn / r.p
- recall(r)¶
Equivalent to true_positive_rate(r).
- precision(r)¶
the fraction of positive predictions that are correct, defined as r.tp / (r.tp + r.fp).
- f1score(r)¶
the harmonic mean of recall(r) and precision(r).
Computing ROC Curves¶
The package provides a function roc to compute an instance of ROCNums or a sequence of such instances from predictions.
- roc(gt, pred)¶
Compute an ROC instance based on ground-truths given in gt and predictions given in pred.
- roc(gt, scores, thres[, ord])
Compute an ROC instance or an ROC curve (a vector of ROC instances), based on given scores and a threshold thres.
Prediction will be made as follows:
- When ord = Forward: predicts 1 when scores[i] >= thres otherwise 0.
- When ord = Reverse: predicts 1 when scores[i] <= thres otherwise 0.
When ord is omitted, it is defaulted to Forward.
Returns:
- When thres is a single number, it produces a single ROCNums instance;
- When thres is a vector, it produces a vector of ROCNums instances.
Note: Jointly evaluating an ROC curve for multiple thresholds is generally much faster than evaluating for them individually.
- roc(gt, (preds, scores), thres[, ord])
Compute an ROC instance or an ROC curve (a vector of ROC instances) for multi-class classification, based on given predictions, scores and a threshold thres.
Prediction is made as follows:
- When ord = Forward: predicts preds[i] when scores[i] >= thres otherwise 0.
- When ord = Reverse: predicts preds[i] when scores[i] <= thres otherwise 0.
When ord is omitted, it is defaulted to Forward.
Returns:
- When thres is a single number, it produces a single ROCNums instance.
- When thres is a vector, it produces an ROC curve (a vector of ROCNums instances).
Note: Jointly evaluating an ROC curve for multiple thresholds is generally much faster than evaluating for them individually.
- roc(gt, scores, n[, ord])
Compute an ROC curve (a vector of ROC instances), with respect to n evenly spaced thresholds from minimum(scores) and maximum(scores). (See above for details)
- roc(gt, (preds, scores), n[, ord])
Compute an ROC curve (a vector of ROC instances) for multi-class classification, with respect to n evenly spaced thresholds from minimum(scores) and maximum(scores). (See above for details)
- roc(gt, scores, ord])
Equivalent to roc(gt, scores, 100, ord).
- roc(gt, (preds, scores), ord])
Equivalent to roc(gt, (preds, scores), 100, ord).
- roc(gt, scores)
Equivalent to roc(gt, scores, 100, Forward).
- roc(gt, (preds, scores))
Equivalent to roc(gt, (preds, scores), 100, Forward).
Cross Validation¶
This package implements several cross validation schemes: Kfold, LOOCV, and RandomSub. Each scheme is an iterable object, of which each element is a vector of indices (indices of samples selected for training).
Cross Validation Schemes¶
- Kfold(n, k)¶
k-fold cross validation over a set of n samples, which are randomly partitioned into k disjoint validation sets of nearly the same sizes. This generates k training subsets of length about n*(1-1/k).
julia> collect(Kfold(10, 3)) 3-element Array{Any,1}: [1,3,4,6,7,8,10] [2,5,7,8,9,10] [1,2,3,4,5,6,9]
- StratifiedKfold(strata, k)¶
Like Kfold, but indexes in each strata (defined by unique values of an iterator strata) are distributed approximately equally across the k folds. Each strata should have at least k members.
julia> collect(StratifiedKfold([:a, :a, :a, :b, :b, :c, :c, :a, :b, :c], 3)) 3-element Array{Any,1}: [1,2,4,6,8,9,10] [3,4,5,7,8,10] [1,2,3,5,6,7,9]
- LOOCV(n)¶
Leave-one-out cross validation over a set of n samples.
julia> collect(LOOCV(4)) 4-element Array{Any,1}: [2,3,4] [1,3,4] [1,2,4] [1,2,3]
- RandomSub(n, sn, k)¶
Repetitively random subsampling. Particularly, this generates k subsets of length sn from a data set with n samples.
julia> collect(RandomSub(10, 5, 3)) 3-element Array{Any,1}: [1,2,5,8,9] [2,5,7,8,10] [1,3,5,6,7]
- StratifiedRandomSum(strata, sn, k)¶
Like RandomSub, but indexes in each strata (defined by unique values of an iterator strata) are distributed approximately equally across the k subsets. sn should be greater than the number of strata, so that each stratum can be represented in each subset.
julia> collect(StratifiedRandomSub([:a, :a, :a, :b, :b, :c, :c, :a, :b, :c], 7, 5)) 5-element Array{Any,1}: [1,2,3,4,6,7,9] [1,3,4,6,8,9,10] [1,3,5,7,8,9,10] [1,2,4,7,8,9,10] [1,2,3,4,5,6,10]
Cross Validation Function¶
The package also provides a function cross_validate as below to run a cross validation procedure.
- cross_validate(estfun, evalfun, n, gen)¶
Run a cross validation procedure.
Parameters: - estfun –
The estimation function, which takes a vector of training indices as input and returns a learned model, as:
model = estfun(train_inds)
- evalfun –
The evaluation function, which takes a model and a vector of testing indices as input and returns a score that indicates the goodness of the model, as
score = evalfun(model, test_inds)
- n – The total number of samples.
- gen – An iterable object that provides training indices, e.g., one of the cross validation schemes listed above.
Returns: a vector of scores obtained in the multiple runs.
Example:
# A simple example to demonstrate the use of cross validation # # Here, we consider a simple model: using a mean vector to represent # a set of samples. The goodness of the model is assessed in terms # of the RMSE (root-mean-square-error) evaluated on the testing set # using MLBase # functions compute_center(X::Matrix{Float64}) = vec(mean(X, 2)) compute_rmse(c::Vector{Float64}, X::Matrix{Float64}) = sqrt(mean(sum(abs2(X .- c),1))) # data const n = 200 const data = [2., 3.] .+ randn(2, n) # cross validation scores = cross_validate( inds -> compute_center(data[:, inds]), # training function (c, inds) -> compute_rmse(c, data[:, inds]), # evaluation function n, # total number of samples Kfold(n, 5)) # cross validation plan: 5-fold # get the mean and std of the scores (m, s) = mean_and_std(scores)
Please refer to examples/crossval.jl for the entire script.
- estfun –
Model Tuning¶
Many machine learning algorithms and models come with design parameters that need to be set in advance. A widely adopted pratice is to search the parameters (usually through brute-force loops) that yields the best performance over a validation set. The package provides functions to facilitate this.
- gridtune(estfun, evalfun, params...; ...)¶
Search the best setting of parameters over a Cartesian grid (i.e. all combinations of parameters).
Parameters: - estfun – The model estimation function that takes design parameters as input and produces the model.
- evalfun – The function that evaluates the model, producing a score value.
- params – A series of parameters, given in the form of (param_name, param_values).
Returns: a 3-tuple, as (best_model, best_cfg, best_score). Here, best_cfg is a tuple comprised of the parameters in the best setting (the one that yields the best score).
Keyword arguments:
ord: It may take either of Forward or Reverse:
- ord=Forward: higher score value indicates better model (default)
- ord=Reverse: lower score value indicates better model.
verbose: boolean, whether to show progress information. (default = false).
Note: For some learning algorithms, there may be some constraint of the parameters (e.g one parameter must be smaller than another, etc). If a certain combination of parameters is not valid, the estfun may return nothing, in which case, the function would ignore those particular settings.
Example:
using MLBase using MultivariateStats ## prepare data n_tr = 20 # number of training samples n_te = 10 # number of testing samples d = 5 # dimension of observations theta = randn(d) X_tr = randn(n_tr, d) y_tr = X_tr * theta + 0.1 * randn(n_tr) X_te = randn(n_te, d) y_te = X_te * theta + 0.1 * randn(n_te) ## tune the model function estfun(regcoef, bias) s = ridge(X_tr, y_tr, regcoef; bias=bias) return bias ? (s[1:end-1], s[end]) : (s, 0.0) end evalfun(m) = msd(X_te * m[1] + m[2], y_te) r = gridtune(estfun, evalfun, ("regcoef", [1.0e-3, 1.0e-2, 1.0e-1, 1.0]), ("bias", (true, false)); ord=Reverse, # smaller msd value indicates better model verbose=true) # show progress information best_model, best_cfg, best_score = r ## print results a, b = best_model println("Best model:") println(" a = $(a')"), println(" b = $b") println("Best config: regcoef = $(best_cfg[1]), bias = $(best_cfg[2])") println("Best score: $(best_score)")