Package 'FSelectorRcpp'

Title: 'Rcpp' Implementation of 'FSelector' Entropy-Based Feature Selection Algorithms with a Sparse Matrix Support
Description: 'Rcpp' (free of 'Java'/'Weka') implementation of 'FSelector' entropy-based feature selection algorithms based on an MDL discretization (Fayyad U. M., Irani K. B.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In 13'th International Joint Conference on Uncertainly in Artificial Intelligence (IJCAI93), pages 1022-1029, Chambery, France, 1993.) <https://www.ijcai.org/Proceedings/93-2/Papers/022.pdf> with a sparse matrix support.
Authors: Zygmunt Zawadzki [aut, cre], Marcin Kosinski [aut], Krzysztof Slomczynski [ctb], Damian Skrzypiec [ctb], Patrick Schratz [ctb]
Maintainer: Zygmunt Zawadzki <[email protected]>
License: GPL-2
Version: 0.3.13
Built: 2024-11-17 06:15:27 UTC
Source: https://github.com/mi2-warsaw/fselectorrcpp

Help Index


Direct Interface to Information Gain.

Description

Direct Interface to Information Gain.

Usage

.information_gain(
  x,
  y,
  type = c("infogain", "gainratio", "symuncert"),
  equal = FALSE,
  discIntegers = TRUE,
  nbins = 5,
  threads = 1
)

Arguments

x

A data.frame, sparse matrix or formula with attributes.

y

A vector with response variable or data.frame if formula is used.

type

Method name.

equal

A logical. Whether to discretize dependent variable with the equal frequency binning discretization or not.

discIntegers

logical value. If true (default), then integers are treated as numeric vectors and they are discretized. If false integers are treated as factors and they are left as is.

nbins

Number of bins used for discretization. Only used if 'equal = TRUE' and the response is numeric.

threads

defunct. Number of threads for parallel backend - now turned off because of safety reasons.

Details

In principle using information_gain is safer.

data.frame with the following columns:

  • attributes - variables names.

  • importance - worth of the attributes.


Select Attributes by Score Depending on the Cutoff

Description

Select attributes by their score/rank/weights, depending on the cutoff that may be specified by the percentage of the highest ranked attributes or by the number of the highest ranked attributes.

Usage

cut_attrs(attrs, k = 0.5)

Arguments

attrs

A data.frame with attributes' importance.

k

A numeric. For k >= 1 it takes floor(k) and then it indicates how many attributes to take with the highest attribute rank (chooses k best attributes). For 0 < k < 1 it stands for the percent of top attributes to take (chooses best k * 100% of attributes).

Author(s)

Damian Skrzypiec [email protected] and Zygmunt Zawadzki [email protected]

Examples

x <- information_gain(Species ~ ., iris)
cut_attrs(attrs = x)
to_formula(cut_attrs(attrs = x), "Species")
cut_attrs(attrs = x, k = 1)

Discretization

Description

Discretize a range of numeric attributes in the dataset into nominal attributes. Minimum Description Length (MDL) method is set as the default control. There is also available equalsizeControl method.

Usage

discretize(
  x,
  y,
  control = list(mdlControl(), equalsizeControl()),
  all = TRUE,
  discIntegers = TRUE,
  call = NULL
)

mdlControl()

equalsizeControl(k = 10)

customBreaksControl(breaks)

Arguments

x

Explanatory continuous variables to be discretized or a formula.

y

Dependent variable for supervised discretization or a data.frame when x ia a formula.

control

discretizationControl object containing the parameters for discretization algorithm. Possible inputs are mdlControl or equalsizeControl, so far. If passed as a list, the first element is used.

all

Logical indicating if a returned data.frame should contain other features that were not discretized. (Example: should Sepal.Width be returned, when you pass iris and discretize Sepal.Length, Petal.Length, Petal.Width.)

discIntegers

logical value. If true (default), then integers are treated as numeric vectors and they are discretized. If false integers are treated as factors and they are left as is.

call

Keep as NULL. Inner method parameter for consistency.

k

Number of partitions.

breaks

custom breaks used for partitioning.

Author(s)

Zygmunt Zawadzki [email protected]

References

U. M. Fayyad and K. B. Irani. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In 13th International Joint Conference on Uncertainly in Artificial Intelligence(IJCAI93), pages 1022-1029, 1993.

Examples

# vectors
discretize(x = iris[[1]], y = iris[[5]])

# list and vector
head(discretize(x = list(iris[[1]], iris$Sepal.Width), y = iris$Species))

# formula input
head(discretize(x = Species ~ ., y = iris))
head(discretize(Species ~ ., iris))

# use different methods for specific columns
ir1 <- discretize(Species ~ Sepal.Length, iris)
ir2 <- discretize(Species ~ Sepal.Width, ir1, control = equalsizeControl(3))
ir3 <- discretize(Species ~ Petal.Length, ir2, control = equalsizeControl(5))
head(ir3)

# custom breaks
ir <- discretize(Species ~ Sepal.Length, iris,
  control = customBreaksControl(breaks = c(0, 2, 5, 7.5, 10)))
head(ir)

## Not run: 
# Same results
library(RWeka)
Rweka_disc_out <- RWeka::Discretize(Species ~ Sepal.Length, iris)[, 1]
FSelectorRcpp_disc_out <- FSelectorRcpp::discretize(Species ~ Sepal.Length,
                                                    iris)[, 1]
table(Rweka_disc_out, FSelectorRcpp_disc_out)
# But faster method
library(microbenchmark)
microbenchmark(FSelectorRcpp::discretize(Species ~ Sepal.Length, iris),
               RWeka::Discretize(Species ~ Sepal.Length, iris))


## End(Not run)

Transform a data.frame using split points returned by discretize function.

Description

Transform a data.frame using split points returned by discretize function.

Usage

discretize_transform(disc, data, dropColumns = NA)

extract_discretize_transformer(disc)

Arguments

disc

a result of the discretize function.

data

a data.frame to transform using cutpoints from disc.

dropColumns

determine

Value

A new data.frame with discretized columns using cutpoints from the result of discretize function.

Examples

set.seed(123)
idx <- sort(sample.int(150, 100))
iris1 <- iris[idx, ]
iris2 <- iris[-idx, ]
disc <- discretize(Species ~ ., iris)
head(discretize_transform(disc, iris2))

# Chain discretization:
ir1 <- discretize(Species ~ Sepal.Length, iris1)
ir2 <- discretize(Species ~ Sepal.Width, ir1, control = equalsizeControl(3))
ir3 <- discretize(Species ~ Petal.Length, ir2, control = equalsizeControl(5))

## note that Petal.Width is untouched:
head(discretize_transform(ir3, iris2))

## extract_discretize_transformer
discObj <- extract_discretize_transformer(ir3)
head(discretize_transform(discObj, iris2))

Entropy-based Filters

Description

Algorithms that find ranks of importance of discrete attributes, basing on their entropy with a continous class attribute. This function is a reimplementation of FSelector's information.gain, gain.ratio and symmetrical.uncertainty.

Usage

information_gain(
  formula,
  data,
  x,
  y,
  type = c("infogain", "gainratio", "symuncert"),
  equal = FALSE,
  discIntegers = TRUE,
  nbins = 5,
  threads = 1
)

Arguments

formula

An object of class formula with model description.

data

A data.frame accompanying formula.

x

A data.frame or sparse matrix with attributes.

y

A vector with response variable.

type

Method name.

equal

A logical. Whether to discretize dependent variable with the equal frequency binning discretization or not.

discIntegers

logical value. If true (default), then integers are treated as numeric vectors and they are discretized. If false integers are treated as factors and they are left as is.

nbins

Number of bins used for discretization. Only used if 'equal = TRUE' and the response is numeric.

threads

defunct. Number of threads for parallel backend - now turned off because of safety reasons.

Details

type = "infogain" is

H(Class)+H(Attribute)H(Class,Attribute)H(Class) + H(Attribute) - H(Class, Attribute)

type = "gainratio" is

H(Class)+H(Attribute)H(Class,Attribute)H(Attribute)\frac{H(Class) + H(Attribute) - H(Class, Attribute)}{H(Attribute)}

type = "symuncert" is

2H(Class)+H(Attribute)H(Class,Attribute)H(Attribute)+H(Class)2\frac{H(Class) + H(Attribute) - H(Class, Attribute)}{H(Attribute) + H(Class)}

where H(X) is Shannon's Entropy for a variable X and H(X, Y) is a joint Shannon's Entropy for a variable X with a condition to Y.

Value

data.frame with the following columns:

  • attributes - variables names.

  • importance - worth of the attributes.

Author(s)

Zygmunt Zawadzki [email protected]

Examples

irisX <- iris[-5]
y <- iris$Species

## data.frame interface
information_gain(x = irisX, y = y)

# formula interface
information_gain(formula = Species ~ ., data = iris)
information_gain(formula = Species ~ ., data = iris, type = "gainratio")
information_gain(formula = Species ~ ., data = iris, type = "symuncert")

# sparse matrix interface
library(Matrix)
i <- c(1, 3:8); j <- c(2, 9, 6:10); x <- 7 * (1:7)
x <- sparseMatrix(i, j, x = x)
y <- c(1, 1, 1, 1, 2, 2, 2, 2)

information_gain(x = x, y = y)
information_gain(x = x, y = y, type = "gainratio")
information_gain(x = x, y = y, type = "symuncert")

RReliefF filter

Description

The algorithm finds weights of continuous and discrete attributes basing on a distance between instances.

Usage

relief(formula, data, x, y, neighboursCount = 5, sampleSize = 10)

Arguments

formula

An object of class formula with model description.

data

A data.frame accompanying formula.

x

A data.frame with attributes.

y

A vector with response variable.

neighboursCount

number of neighbours to find for every sampled instance

sampleSize

number of instances to sample

Details

The function and it's manual page taken directly from FSelector: Piotr Romanski and Lars Kotthoff (2018). FSelector: Selecting Attributes. R package version 0.31. https://CRAN.R-project.org/package=FSelector

Value

a data.frame containing the worth of attributes in the first column and their names as row names

References

Igor Kononenko: Estimating Attributes: Analysis and Extensions of RELIEF. In: European Conference on Machine Learning, 171-182, 1994.

Marko Robnik-Sikonja, Igor Kononenko: An adaptation of Relief for attribute estimation in regression. In: Fourteenth International Conference on Machine Learning, 296-304, 1997.

Examples

data(iris)

weights <- relief(Species~., iris, neighboursCount = 5, sampleSize = 20)
print(weights)
subset <- cut_attrs(weights, 2)
f <- to_formula(subset, "Species")
print(f)

Create a formula Object

Description

Utility function to create a formula object. Note that it may be very useful when you use pipes.

Usage

to_formula(attrs, class)

Arguments

attrs

Character vector with names of independent variables.

class

Single string with a dependent variable's name.

Examples

# evaluator from FSelector package
evaluator <- function(subset, data, dependent = names(iris)[5]) {
  library(rpart)
  k <- 5
  splits <- runif(nrow(data))
  results <- sapply(1:k, function(i) {
    test.idx <- (splits >= (i - 1) / k) & (splits < i / k)
    train.idx <- !test.idx
    test <- data[test.idx, , drop = FALSE]
    train <- data[train.idx, , drop = FALSE]
    tree <- rpart(to_formula(subset, dependent), train)
    error.rate <- sum(test[[dependent]] != predict(tree, test, type = "c")) /
    nrow(test)
    return(1 - error.rate)
  })
  return(mean(results))
}

set.seed(123)
fit <- feature_search(attributes = names(iris)[-5], fun = evaluator, data = iris,
                mode = "exhaustive", parallel = FALSE)
fit$best
names(fit$best)[fit$best == 1]
# with to_formula
to_formula(names(fit$best)[fit$best == 1], "Species")