Motivation, Installation and Quick Workflow

FSelectorRcpp is an Rcpp (free of Java/Weka) implementation of FSelector entropy-based feature selection algorithms with a sparse matrix support. It is also equipped with a parallel backend.

Installation

install.packages('FSelectorRcpp') # stable release version on CRAN
devtools::install_github('mi2-warsaw/FSelectorRcpp') # dev version
# windows users should have Rtools for devtools installation
# https://cran.r-project.org/bin/windows/Rtools/

Motivation

In the modern statistical learning the biggest bottlenecks are computation times of model training procedures and the overfitting. Both are caused by the same issue - the high dimension of explanatory variables space. Researchers have encountered problems with too big sets of features used in machine learning algorithms also in terms of model interpretaion. This motivates applying feature selection algorithms before performing statisical modeling, so that on smaller set of attributes the training time will be shorter, the interpretation might be clearer and the noise from non important features can be avoided. More motivation can be found in (John, Kohavi, and Pfleger 1994).

Many methods were developed to reduce the curse of dimensionality like Principal Component Analysis (Pearson 1901) or Singular Value Decomposition (Eckart and Young 1936) which approximates the variables by smaller number of combinations of original variables, but this approach is hard to interpret in the final model.

Sophisticated methods of attribute selection as Boruta algoritm (Kursa and Rudnicki 2010), genetic algorithms Aziz et al. (2013) or simulated annealing techniques (Khachaturyan, Semenovsovskaya, and Vainshtein 1981) are known and broadly used but in some cases for those algorithms computations can take even days, not to mention that datasets are growing every hour.

Few classification and regression models can reduce redundand variables during the training phase of statistical learning process, e.g. Decision Trees Breiman et al. (1984), LASSO Regularized Generalized Linear Models (with cross-validation) (Friedman, Hastie, and Tibshirani 2010) or Regularized Support Vector Machine (Xu, Caramanis, and Mannor 2009), but still computations starting with full set of explanatory variables are time consuming and the understaning of the feature selection procedure in this case is not simple and those methods are sometimes used without the understanding.

In business applications there appear a need to provide a fast feature selection that is extremely easy to understand. For such demands easy methods are prefered. This motivates using simple techniques like Entropy Based Feature Selection (Largeron, Moulin, and Géry 2011), where every feature can be checked independently so that computations can be performed in a parallel to shorter the procedure’s time. For this approach we provide an R interface to Rcpp reimplementation (Dirk Eddelbuettel 2011) of methods included in FSelector package which we also extended with parallel background and sparse matrix support. This has significant impact on computations time and can be used on greater datasets, comparing to FSelector. Additionally we avoided the Weka (Hall et al. 2009) dependency and we provided faster discretization implementations than those from entropy package, used originally in FSelector.

Quick Workflow

library(magrittr)
library(FSelectorRcpp)

A simple entropy based feature selection workflow. Information gain is an easy, linear algorithm that computes the entropy of a dependent and explanatory variables, and the conditional entropy of a dependent variable with a respect to each explanatory variable separately. This simple statistic enables to calculate the belief of the distribution of a dependent variable when we only know the distribution of a explanatory variable.

information_gain(               # Calculate the score for each attribute
    formula = Species ~ .,      # that is on the right side of the formula.
    data = iris,                # Attributes must exist in the passed data.
    type  = "infogain"          # Choose the type of a score to be calculated.
  ) %>%                          
  cut_attrs(                    # Then take attributes with the highest rank.
    k = 2                       # For example: 2 attrs with the higehst rank.
  ) %>%                         
  to_formula(                   # Create a new formula object with 
    attrs = .,                  # the most influencial attrs.
    class = "Species"           
  ) %>%
  glm(
    formula = .,                # Use that formula in any classification algorithm.
    data = iris,                
    family = "binomial"         
  )

Call:  glm(formula = ., family = "binomial", data = iris)

Coefficients:
 (Intercept)   Petal.Width  Petal.Length  
      -69.45         33.89         17.60  

Degrees of Freedom: 149 Total (i.e. Null);  147 Residual
Null Deviance:      191 
Residual Deviance: 5.17e-09     AIC: 6

Apply a complicated feature selection search engine, that checks combinations of subsets of the specified attributes’ set, to score the currently considered subset, depending on the criterion passed with the fun parameter.

evaluator_R2_lm <-    # Create a scorer function.
  function(
    attributes,       # That takes the currently considered subset of attributes
    data,             # from a specified dataset.
    dependent = 
      names(data)[1]  # To find features that best describe the dependent variable.
  ) {
    summary(          # In this situation we take the r.squared statistic
      lm(             # from the summary of a linear model object.
        to_formula(   # This is the score to use to choose between considered 
          attributes, # subsets of attributes.
          dependent
        ),
        data = data)
    )$r.squared
  }

feature_search(          # feature_search work in 2 modes - 'exhaustive' and 'greedy'
  attributes = 
    names(iris)[-1],     # It takes attribues and creates combinations of it's subsets.
  fun = evaluator_R2_lm, # And it calculates the score of a subset that depends on the 
  data = iris,           # evaluator function passed in the `fun` parameter.
  mode = "exhaustive",   # exhaustive - means to check all possible 
  sizes =                # attributes' subset combinations 
    1:length(attributes) # of sizes passed in sizes.
)$all
  Sepal.Width Petal.Length Petal.Width Species     values
1           1            0           0       0 0.01382265
2           0            1           0       0  0.7599546
3           0            0           1       0  0.6690277
4           0            0           0       1  0.6187057

References

Aziz, Amira Sayed A., Ahmad Taher Azar, Mostafa A. Salama, Aboul Ella Hassanien, and Sanaa El Ola Hanfy. 2013. “Genetic Algorithms with Different Feature Selection Techniques for Anomaly Detectors Generation.” In Proceedings of the 2013 Federated Conference on Computer Science and Information Systems, edited by M. Paprzycki M. Ganzha L. Maciaszek, pages 769–774. IEEE.
Breiman, L., J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. Monterey, CA: Wadsworth; Brooks.
Dirk Eddelbuettel, Romain Franccois. 2011. Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. https://www.jstatsoft.org/v40/i08/.
Eckart, C., and G. Young. 1936. “The Approximation of One Matrix by Another of Lower Rank.” Psychometrika 1 (3): 211–18. https://doi.org/10.1007/BF02288367.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://www.jstatsoft.org/v33/i01/.
Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. “The WEKA Data Mining Software: An Update.” SIGKDD Explor. Newsl. 11 (1): 10–18. https://doi.org/10.1145/1656274.1656278.
John, George H., Ron Kohavi, and Karl Pfleger. 1994. “Irrelevant Features and the Subset Selection Problem.” In MACHINE LEARNING: PROCEEDINGS OF THE ELEVENTH INTERNATIONAL, 121–29. Morgan Kaufmann.
Khachaturyan, A., S. Semenovsovskaya, and B. Vainshtein. 1981. The thermodynamic approach to the structure analysis of crystals.” Acta Crystallographica Section A 37 (5): 742–54.
Kuhn, Max, and Kjell Johnson. 2013. “Applied Predictive Modeling.” New York, NY: Springer. 2013.
Kursa, Miron B., and Witold R. Rudnicki. 2010. “Feature Selection with the Boruta Package.” Journal of Statistical Software 36 (11): 1–13. https://www.jstatsoft.org/v36/i11/.
Largeron, Christine, Christophe Moulin, and Mathias Géry. 2011. “Entropy Based Feature Selection for Text Categorization.” In Proceedings of the 2011 ACM Symposium on Applied Computing, 924–28. SAC ’11. New York, NY, USA: ACM. https://doi.org/10.1145/1982185.1982389.
Pearson, Karl. 1901. “LIII. On Lines and Planes of Closest Fit to Systems of Points in Space.” Philosophical Magazine Series 6 2 (11): 559–72.
Rokach, Lior, and Oded Maimon. 2008. Data Mining with Decision Trees: Theroy and Applications. River Edge, NJ, USA: World Scientific Publishing Co., Inc.
Xu, Huan, Constantine Caramanis, and Shie Mannor. 2009. “Robustness and Regularization of Support Vector Machines.” J. Mach. Learn. Res. 10 (December): 1485–1510. https://dl.acm.org/doi/10.5555/1577069.1755834.