The information_gain and discretize functions are the two
most important functions in the FSelectorRcpp
package.
Usually, there are easy to use, and the user should not worry about
anything. However, there’s one case where the user intervention may be
required.
By default information_gain
and discretize
discretizes numeric and integer columns, and leaves other ones as is. In
the example below the x
column is discretized:
library(FSelectorRcpp)
data <- data.frame(
y = factor(c("A", "A", "A", "B", "B", "A")),
x = c(1.2, 2.1, 4.1, 7.3, 8.2, 3.2),
z = c("x", "x", "y", "y", "y", "x"),
stringsAsFactors = FALSE)
discretize(data, y ~ .)
#> y x z
#> 1 A (-Inf,5.7] x
#> 2 A (-Inf,5.7] x
#> 3 A (-Inf,5.7] y
#> 4 B (5.7, Inf] y
#> 5 B (5.7, Inf] y
#> 6 A (-Inf,5.7] x
So far, so good. However, there is a problem when the column is of a
type integer. Because integers might be used to encode levels of a
factor, e.g., a number of a day in a week (0-6, 1-7), id from some
mapping (e.g., 1 - Poland, 2 - USA, 3 - Germany
), and
probably many other (gender and so on), possibly any variable that is
going to be one-hot encoded in the final model. In such cases,
discretizing those values don’t make any sense, because they’re already
discrete. On the other hand, some variables might be encoded as
integers, but they could be discretized. E.g., age, income, height,
weight, and so on.
It would be tough for FSelectorRcpp to guess if the integer encoded
variable should be discretized because it depends heavily on the
context. So we (as the authors of the FSelectorRcpp
)
decided that by default the integers columns will be treated like
numerics, and they will be discretized. However, the user can control
this behavior by using discIntegers
parameter.
See the example below to get some idea about the behavior of the
discretize
function when integers columns are present:
library(FSelectorRcpp)
data <- data.frame(
y = factor(c("A", "A", "A", "B", "B", "A")),
x = c(1.2, 2.1, 4.1, 7.3, 8.2, 3.2),
z = c("x", "x", "y", "y", "y", "x"),
int = as.integer(c(1, 1, 1, 2, 2, 1)),
uniqueInt = as.integer(c(10, 20, 11, 22, 25, 11)),
stringsAsFactors = FALSE)
# default (integers are discretized)
discretize(data, y ~ .)
#> y x z int uniqueInt
#> 1 A (-Inf,5.7] x (-Inf,1.5] (-Inf,21]
#> 2 A (-Inf,5.7] x (-Inf,1.5] (-Inf,21]
#> 3 A (-Inf,5.7] y (-Inf,1.5] (-Inf,21]
#> 4 B (5.7, Inf] y (1.5, Inf] (21, Inf]
#> 5 B (5.7, Inf] y (1.5, Inf] (21, Inf]
#> 6 A (-Inf,5.7] x (-Inf,1.5] (-Inf,21]
# discIntegers is set to FALSE - integers are left as is.
discretize(data, y ~ ., discIntegers = FALSE)
#> y x z int uniqueInt
#> 1 A (-Inf,5.7] x 1 10
#> 2 A (-Inf,5.7] x 1 20
#> 3 A (-Inf,5.7] y 1 11
#> 4 B (5.7, Inf] y 2 22
#> 5 B (5.7, Inf] y 2 25
#> 6 A (-Inf,5.7] x 1 11
There might be a case that you have both types of integer columns in
your data, and you can’t directly use discIntegers
. In this
case, you need to manually convert to numeric
the columns
which should be discretized.
The code below shows a simple approach to convert the columns to numeric if they contain a lot of distinct values:
can_discretize <- function(x, treshold = 0.9) {
is.integer(x) &&
(length(unique(x)) / length(x) > treshold)
}
can_discretize(1:10)
#> [1] TRUE
can_discretize(rnorm(10))
#> [1] FALSE
can_discretize(as.integer(c(rep(1,10), rep(2, 10))))
#> [1] FALSE
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(123)
dt <- tibble(
x = 1:20,
y = rnorm(20),
z = as.integer(c(rep(1,10), rep(2, 10)))
)
glimpse(dt)
#> Rows: 20
#> Columns: 3
#> $ x <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
#> $ y <dbl> -0.56047565, -0.23017749, 1.55870831, 0.07050839, 0.12928774, 1.71506499, 0.46091621, -1.26506123, -0.68685285, -0.44566197, 1.22408180, 0…
#> $ z <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
glimpse(dt %>% mutate_if(can_discretize, as.numeric))
#> Rows: 20
#> Columns: 3
#> $ x <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
#> $ y <dbl> -0.56047565, -0.23017749, 1.55870831, 0.07050839, 0.12928774, 1.71506499, 0.46091621, -1.26506123, -0.68685285, -0.44566197, 1.22408180, 0…
#> $ z <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2