Data Preprocessing Utilities

The package provide a variety of functions for data preprocessing.

Data Repetition

repeach(a, n)

Repeat each element in vector a for n times. Here n can be either a scalar or a vector with the same length as a.

using MLBase

repeach(1:3, 2) # --> [1, 1, 2, 2, 3, 3]
repeach(1:3, [3,2,1]) # --> [1, 1, 1, 2, 2, 3]
repeachcol(a, n)

Repeat each column in matrix a for n times. Here n can be either a scalar or a vector with length(n) == size(a,2).

repeachrow(a, n)

Repeat each row in matrix a for n times. Here n can be either a scalar or a vector with length(n) == size(a,1).

Label Processing

In machine learning, we often need to first attach each class with an integer label. This package provides a type LabelMap that captures the association between discrete values (e.g a finite set of strings) and integer labels.

Together with LabelMap, the package also provides a function labelmap to construct the map from a sequence of discrete values, and a function labelencode to map discrete values to integer labels.

julia> lm = labelmap(["a", "a", "b", "b", "c"])
LabelMap (with 3 labels):
[1] a
[2] b
[3] c

julia> labelencode(lm, "b")
2

julia> labelencode(lm, ["a", "c", "b"])
3-element Array{Int64,1}:
 1
 3
 2

Note that labelencode can be applied to either single value or an array.

The package also provides a function groupindices to group indices based on associated labels.

julia> groupindices(3, [1, 1, 1, 2, 2, 3, 2])
3-element Array{Array{Int64,1},1}:
 [1,2,3]
 [4,5,7]
 [6]

 # using lm as constructed above
julia> groupindices(lm, ["a", "a", "c", "b", "b"])
3-element Array{Array{Int64,1},1}:
 [1,2]
 [4,5]
 [3]