Because data frames are just lists, we can access a data frame’s columns using the same methods we would use for lists. To access a single column by name, the most efficient method is typically the $ operator.
We can also use the single, [], or double, [[]], square bracket operators. As with lists, these operators differ in two respects how many columns they can select and how they format resulting selection.
[]: Can select multiple elements and always returns a data frame.
[[]]: Can select only one element and returns the column contents as a vector (or whatever type of object the column contained).
# Return a one-column data frame comprising the 'b' column from d1d1["b"]
b
1 bar
2 bar
3 foo
4 bar
5 bar
6 foo
7 foo
8 bar
9 foo
10 foo
# The same as above, but using the column index instead of the column named1[2]
b
1 bar
2 bar
3 foo
4 bar
5 bar
6 foo
7 foo
8 bar
9 foo
10 foo
# Return a two-column data frame comprising the 'a' and 'b' columns from d1d1[c("a", "b")]
a b
1 FALSE bar
2 TRUE bar
3 TRUE foo
4 TRUE bar
5 TRUE bar
6 FALSE foo
7 TRUE foo
8 TRUE bar
9 FALSE foo
10 FALSE foo
# The same as above, but using the column indices instead of the column namesd1[1:2]
a b
1 FALSE bar
2 TRUE bar
3 TRUE foo
4 TRUE bar
5 TRUE bar
6 FALSE foo
7 TRUE foo
8 TRUE bar
9 FALSE foo
10 FALSE foo
# Return the 'b' column from d1 as a character vectord1[["b"]]
Data frames also support matrix-style subsetting, where we define the selection by specifying both the row and column indices.
d1[1:2, 2:3]
b c
1 bar 0.4957513
2 bar 0.3761720
d1[ , 1:2]
a b
1 FALSE bar
2 TRUE bar
3 TRUE foo
4 TRUE bar
5 TRUE bar
6 FALSE foo
7 TRUE foo
8 TRUE bar
9 FALSE foo
10 FALSE foo
d1[2:3, ]
a b c
2 TRUE bar 0.3761720
3 TRUE foo 0.9735387
In most cases, matrix style subsetting behaves the same was as the [] list-style operator—you can select any number of elements, and the selection is returned as a data frame—but there is one exception If you select a single column using matrix-style subsetting, the selection will be converted to a vector.
Naturally, we can overwrite the columns of a data frame using the same procedures that we use to modify list slots. When modifying one column at a time, we directly apply the intuitive operations.
# View the original data framed1
a b c
1 FALSE bar 0.49575135
2 TRUE bar 0.37617197
3 TRUE foo 0.97353867
4 TRUE bar 0.30581726
5 TRUE bar 0.58252890
6 FALSE foo 0.75880503
7 TRUE foo 0.04691056
8 TRUE bar 0.27594473
9 FALSE foo 0.45335390
10 FALSE foo 0.04816235
## Modify some list elementsd1$a <- LETTERS[1:10]d1[[2]] <-rnorm(10)d1["c"] <-rep(c(TRUE, FALSE), each =5)# View the modified data framed1
a b c
1 A -0.5906956 TRUE
2 B -0.5742008 TRUE
3 C -1.7411826 TRUE
4 D 0.0294189 TRUE
5 E 1.1534412 TRUE
6 F 0.6410818 FALSE
7 G -1.3527182 FALSE
8 H -1.7820197 FALSE
9 I 0.7982336 FALSE
10 J 1.5504525 FALSE
When modifying multiple columns with the [], operator, it’s best to supply the replacement values as a data frame or list with the same size as the selected columns.
# Replace the first two columns of d1 with an equivalently sized data frame# extracted from the 'iris' datasetd1[1:2] <- iris[1:10, 1:2]d1
When the replacement size doesn’t match the selection size, R will use recycling to resolve the discrepancy, but it’s not always easy to predict how the replacement will behave.
# Replace the first two columns of d1 by recycling the vector `1:5`d1[1:2] <-1:5d1
# Replace the 'a' and 'c' columns in d1 with a list containing vectors that# will need to be recycledd1[c("a", "c")] <-list(c("yes", "no"), 7:8)d1
a b c
1 yes 1 7
2 no 2 8
3 yes 3 7
4 no 4 8
5 yes 5 7
6 no 1 8
7 yes 2 7
8 no 3 8
9 yes 4 7
10 no 5 8
As with matrices, R is oddly specific (in a slightly different way) about the kinds of size discrepancies it will automatically resolved when modifying data frames.
OK
Replacement length > Selection length
Replacement length cleanly divides the selection length
Replacement length exceeds selection length
Replacement list contains more slots than columns selected
Replacement data frame contains more slots than columns selected
Not OK
Replacement length does not cleanly divide selection length
# Works: Replace the first two columns of d1 by using the first 20 elements# from the `100:500`d1[1:2] <-100:500d1
# Works: Replace the 'a' and 'c' columns in d1 with a the first two columns# from a three-column data framed1[c("a", "c")] <-data.frame("foo", "bar", "baz")d1
a b c
1 foo 110 bar
2 foo 111 bar
3 foo 112 bar
4 foo 113 bar
5 foo 114 bar
6 foo 115 bar
7 foo 116 bar
8 foo 117 bar
9 foo 118 bar
10 foo 119 bar
# Fails: Replace the first two columns of d1 by using non-conformable vector 1:3d1[1:2] <-1:3
Error in `[<-.data.frame`(`*tmp*`, 1:2, value = 1:3): replacement has 3 items, need 20
# Fails: Replace the 'a' and 'c' columns in d1 with a list containing# non-conformable vectorsd1[c("a", "c")] <-list(letters[4], 1:8)
Error in `[<-.data.frame`(`*tmp*`, c("a", "c"), value = list("d", 1:8)): replacement element 2 has 8 rows, need 10
Adding Columns
As with lists, we can add new columns to an existing data frame using the $ or [[]] operators.
# Create an empty list(d2 <-data.frame(a =1:10, b ="bob"))
a b
1 1 bob
2 2 bob
3 3 bob
4 4 bob
5 5 bob
6 6 bob
7 7 bob
8 8 bob
9 9 bob
10 10 bob
## Various ways of adding new single columnsd2$c <- letters[1:10]d2[["d"]] <-runif(10)d2[[5]] <-rnorm(10)d2["alice"] <-TRUEd2
a b c d V5 alice
1 1 bob a 0.39551254 0.9116125 TRUE
2 2 bob b 0.02311933 0.3341865 TRUE
3 3 bob c 0.16358210 1.3088335 TRUE
4 4 bob d 0.96102468 -0.5141460 TRUE
5 5 bob e 0.34147122 -0.3722525 TRUE
6 6 bob f 0.59231193 0.3462801 TRUE
7 7 bob g 0.40779073 0.7340008 TRUE
8 8 bob h 0.94040284 1.4507504 TRUE
9 9 bob i 0.28812749 0.4430496 TRUE
10 10 bob j 0.60931561 0.1972789 TRUE
We can add multiple columns using the [] operator.
a b c d V5 alice V7 V8 foo bar
1 1 bob a 0.39551254 0.9116125 TRUE -2.4591839 -0.71035093 TRUE FALSE
2 2 bob b 0.02311933 0.3341865 TRUE -0.4603825 -0.04109860 TRUE FALSE
3 3 bob c 0.16358210 1.3088335 TRUE 0.1271004 0.85363963 TRUE FALSE
4 4 bob d 0.96102468 -0.5141460 TRUE 0.2922169 -0.83978767 TRUE FALSE
5 5 bob e 0.34147122 -0.3722525 TRUE 0.3838345 -0.04331845 TRUE FALSE
6 6 bob f 0.59231193 0.3462801 TRUE 0.9490207 -0.44625041 TRUE FALSE
7 7 bob g 0.40779073 0.7340008 TRUE 0.5023876 -0.03732082 TRUE FALSE
8 8 bob h 0.94040284 1.4507504 TRUE 0.2338697 0.93048072 TRUE FALSE
9 9 bob i 0.28812749 0.4430496 TRUE -1.4386591 -0.07930940 TRUE FALSE
10 10 bob j 0.60931561 0.1972789 TRUE 0.8374561 2.30531232 TRUE FALSE
Practice
Run the following code to create an empty data frame containing 10 observations of the 3 variables: a, b, c. Then populate the data frame as described below.
Fill column a with the integer sequence from -9 to 0.
Use the column name to assign the new values.
Fill column b with the even integers between 1 and 20 (inclusive).
Use the numeric column index to assign the new values.
Replace the odd rows in column c with the odd integers between 11 and 20 (inclusive).
Do not overwrite the missing values in the even rows.