Factors often cause confusion for new (and not-so-new) data analysts. In particular, the print() method for factors shows the level names when printing factors. So, factors with textual level names (e.g., {“male”, “female”}) look much like character vectors when printed, and factors with numeric level names (e.g., {“0”, “1”}) look like numeric vectors. However, factors are not character vectors, numeric vectors, nor any other type of atomic vector: they have their own class, factor, which tells R functions to treat factors as encoded categorical variables.
f1 <-factor(sample(0:1, 10, TRUE))f2 <-factor(sample(c("foo", "bar"), 10, TRUE))f1 # looks a lot like an integer vector
[1] 1 1 1 1 0 0 0 1 1 0
Levels: 0 1
f2 # looks like a character vector
[1] foo foo foo bar foo foo foo foo foo foo
Levels: bar foo
# Not numeric or characteris.numeric(f1)
[1] FALSE
is.integer(f1)
[1] FALSE
is.character(f1)
[1] FALSE
is.numeric(f2)
[1] FALSE
is.integer(f2)
[1] FALSE
is.character(f2)
[1] FALSE
# Not any type of vector, at all, actuallyis.vector(f1)
[1] FALSE
is.vector(f2)
[1] FALSE
Similarly, even though factors are built from integer vectors, we can’t do mathematical operations on vectors. Since factors represent discrete categories, not continuous quantities, mathematical operations on factors are generally meaningless.
f1 +1
[1] NA NA NA NA NA NA NA NA NA NA
mean(f1)
[1] NA
In the examples above, the requested operations return missing values, NA, because we’re asking R to do a nonsensical calculation. So, the result cannot be meaningfully defined. By design, treating a factor as if it were numeric produces errors and/or null results. This conservative behavior ensures that categorical data are not accidentally interpreted as numeric values, which will usually produce misleading results.