Manipulating Data Frames

Accessing Data Frame Elements

Because data frames are just lists, we can access a data frame’s columns using the same methods we would use for lists. To access a single column by name, the most efficient method is typically the $ operator.

d1 <- data.frame(a = sample(c(TRUE, FALSE), 10, replace = TRUE),
                 b = sample(c("foo", "bar"), 10, replace = TRUE),
                 c = runif(10)
                 )
d1$b

 [1] "bar" "bar" "foo" "bar" "bar" "foo" "foo" "bar" "foo" "foo"

data(iris)
iris$Petal.Length

  [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
 [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
 [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
 [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
 [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
 [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
[109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
[127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
[145] 5.7 5.2 5.0 5.2 5.4 5.1

We can also use the single, [], or double, [[]], square bracket operators. As with lists, these operators differ in two respects how many columns they can select and how they format resulting selection.

[]: Can select multiple elements and always returns a data frame.
[[]]: Can select only one element and returns the column contents as a vector (or whatever type of object the column contained).

# Return a one-column data frame comprising the 'b' column from d1
d1["b"]

     b
1  bar
2  bar
3  foo
4  bar
5  bar
6  foo
7  foo
8  bar
9  foo
10 foo

# The same as above, but using the column index instead of the column name
d1[2]

     b
1  bar
2  bar
3  foo
4  bar
5  bar
6  foo
7  foo
8  bar
9  foo
10 foo

# Return a two-column data frame comprising the 'a' and 'b' columns from d1
d1[c("a", "b")]

       a   b
1  FALSE bar
2   TRUE bar
3   TRUE foo
4   TRUE bar
5   TRUE bar
6  FALSE foo
7   TRUE foo
8   TRUE bar
9  FALSE foo
10 FALSE foo

# The same as above, but using the column indices instead of the column names
d1[1:2]

       a   b
1  FALSE bar
2   TRUE bar
3   TRUE foo
4   TRUE bar
5   TRUE bar
6  FALSE foo
7   TRUE foo
8   TRUE bar
9  FALSE foo
10 FALSE foo

# Return the 'b' column from d1 as a character vector
d1[["b"]]

 [1] "bar" "bar" "foo" "bar" "bar" "foo" "foo" "bar" "foo" "foo"

# The same as above, but using the column index instead of the column name
d1[[2]]

 [1] "bar" "bar" "foo" "bar" "bar" "foo" "foo" "bar" "foo" "foo"

Matrix-Style Selection

Data frames also support matrix-style subsetting, where we define the selection by specifying both the row and column indices.

d1[1:2, 2:3]

    b         c
1 bar 0.4957513
2 bar 0.3761720

d1[ , 1:2]

       a   b
1  FALSE bar
2   TRUE bar
3   TRUE foo
4   TRUE bar
5   TRUE bar
6  FALSE foo
7   TRUE foo
8   TRUE bar
9  FALSE foo
10 FALSE foo

d1[2:3, ]

     a   b         c
2 TRUE bar 0.3761720
3 TRUE foo 0.9735387

In most cases, matrix style subsetting behaves the same was as the [] list-style operator—you can select any number of elements, and the selection is returned as a data frame—but there is one exception If you select a single column using matrix-style subsetting, the selection will be converted to a vector.

d1[ , 1]

 [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE

Modifying Data Frame Elements

Naturally, we can overwrite the columns of a data frame using the same procedures that we use to modify list slots. When modifying one column at a time, we directly apply the intuitive operations.

# View the original data frame
d1

       a   b          c
1  FALSE bar 0.49575135
2   TRUE bar 0.37617197
3   TRUE foo 0.97353867
4   TRUE bar 0.30581726
5   TRUE bar 0.58252890
6  FALSE foo 0.75880503
7   TRUE foo 0.04691056
8   TRUE bar 0.27594473
9  FALSE foo 0.45335390
10 FALSE foo 0.04816235

## Modify some list elements
d1$a <- LETTERS[1:10]
d1[[2]] <- rnorm(10)
d1["c"] <- rep(c(TRUE, FALSE), each = 5)

# View the modified data frame
d1

   a          b     c
1  A -0.5906956  TRUE
2  B -0.5742008  TRUE
3  C -1.7411826  TRUE
4  D  0.0294189  TRUE
5  E  1.1534412  TRUE
6  F  0.6410818 FALSE
7  G -1.3527182 FALSE
8  H -1.7820197 FALSE
9  I  0.7982336 FALSE
10 J  1.5504525 FALSE

When modifying multiple columns with the [], operator, it’s best to supply the replacement values as a data frame or list with the same size as the selected columns.

# Replace the first two columns of d1 with an equivalently sized data frame
# extracted from the 'iris' dataset
d1[1:2] <- iris[1:10, 1:2]
d1

     a   b     c
1  5.1 3.5  TRUE
2  4.9 3.0  TRUE
3  4.7 3.2  TRUE
4  4.6 3.1  TRUE
5  5.0 3.6  TRUE
6  5.4 3.9 FALSE
7  4.6 3.4 FALSE
8  5.0 3.4 FALSE
9  4.4 2.9 FALSE
10 4.9 3.1 FALSE

# Replace the 'a' and 'c' columns in d1 with an equivalently sized list
d1[c("a", "c")] <- list(rnorm(10), runif(10))
d1

            a   b         c
1  -0.5783726 3.5 0.3187378
2   0.2407284 3.0 0.6187480
3   0.2428955 3.2 0.2440667
4   1.1796540 3.1 0.4660315
5  -0.1874361 3.6 0.2406703
6  -0.9529284 3.9 0.6078704
7   0.5847080 3.4 0.3974096
8   2.1370098 3.4 0.7703554
9  -0.1480493 2.9 0.1016348
10  1.7395392 3.1 0.8043306

Matrix-Style Selection

If we only want to replace part of a column, we can use matrix-style selection to choose the target cells.

d1[1:5, 2] <- 41:45
d1

            a    b         c
1  -0.5783726 41.0 0.3187378
2   0.2407284 42.0 0.6187480
3   0.2428955 43.0 0.2440667
4   1.1796540 44.0 0.4660315
5  -0.1874361 45.0 0.2406703
6  -0.9529284  3.9 0.6078704
7   0.5847080  3.4 0.3974096
8   2.1370098  3.4 0.7703554
9  -0.1480493  2.9 0.1016348
10  1.7395392  3.1 0.8043306

d1[3:6, c("a", "c")] <- list(-99, 888)
d1

             a    b           c
1   -0.5783726 41.0   0.3187378
2    0.2407284 42.0   0.6187480
3  -99.0000000 43.0 888.0000000
4  -99.0000000 44.0 888.0000000
5  -99.0000000 45.0 888.0000000
6  -99.0000000  3.9 888.0000000
7    0.5847080  3.4   0.3974096
8    2.1370098  3.4   0.7703554
9   -0.1480493  2.9   0.1016348
10   1.7395392  3.1   0.8043306

Recycling

When the replacement size doesn’t match the selection size, R will use recycling to resolve the discrepancy, but it’s not always easy to predict how the replacement will behave.

# Replace the first two columns of d1 by recycling the vector `1:5`
d1[1:2] <- 1:5
d1

   a b           c
1  1 1   0.3187378
2  2 2   0.6187480
3  3 3 888.0000000
4  4 4 888.0000000
5  5 5 888.0000000
6  1 1 888.0000000
7  2 2   0.3974096
8  3 3   0.7703554
9  4 4   0.1016348
10 5 5   0.8043306

# Replace the 'a' and 'c' columns in d1 with a list containing vectors that
# will need to be recycled
d1[c("a", "c")] <- list(c("yes", "no"), 7:8)
d1

     a b c
1  yes 1 7
2   no 2 8
3  yes 3 7
4   no 4 8
5  yes 5 7
6   no 1 8
7  yes 2 7
8   no 3 8
9  yes 4 7
10  no 5 8

As with matrices, R is oddly specific (in a slightly different way) about the kinds of size discrepancies it will automatically resolved when modifying data frames.

OK
- Replacement length > Selection length
- Replacement length cleanly divides the selection length
- Replacement length exceeds selection length
- Replacement list contains more slots than columns selected
- Replacement data frame contains more slots than columns selected
Not OK
- Replacement length does not cleanly divide selection length

# Works: Replace the first two columns of d1 by using the first 20 elements
# from the `100:500`
d1[1:2] <- 100:500
d1

     a   b c
1  100 110 7
2  101 111 8
3  102 112 7
4  103 113 8
5  104 114 7
6  105 115 8
7  106 116 7
8  107 117 8
9  108 118 7
10 109 119 8

# Works: Replace the 'a' and 'c' columns in d1 with a list containing vectors
# that are too long
d1[c("a", "c")] <- list(rnorm(100), runif(100))
d1

             a   b         c
1   2.93028916 110 0.2347217
2  -0.28528570 111 0.1280478
3  -0.07273495 112 0.8675576
4  -1.39133153 113 0.7250560
5  -0.54652858 114 0.5995073
6   1.09164117 115 0.1685879
7   0.71862569 116 0.7697158
8  -1.88902780 117 0.3991020
9   2.85287584 118 0.8479088
10  0.65070784 119 0.1337530

# Works: Replace the 'a' and 'c' columns in d1 with a the first two slots in a
# length-3 list
d1[c("a", "c")] <- list(1, 2, 3)
d1

# Works: Replace the 'a' and 'c' columns in d1 with a the first two columns
# from a three-column data frame
d1[c("a", "c")] <- data.frame("foo", "bar", "baz")
d1

     a   b   c
1  foo 110 bar
2  foo 111 bar
3  foo 112 bar
4  foo 113 bar
5  foo 114 bar
6  foo 115 bar
7  foo 116 bar
8  foo 117 bar
9  foo 118 bar
10 foo 119 bar

# Fails: Replace the first two columns of d1 by using non-conformable vector 1:3
d1[1:2] <- 1:3

Error in `[<-.data.frame`(`*tmp*`, 1:2, value = 1:3): replacement has 3 items, need 20

# Fails: Replace the 'a' and 'c' columns in d1 with a list containing
# non-conformable vectors
d1[c("a", "c")] <- list(letters[4], 1:8)

Error in `[<-.data.frame`(`*tmp*`, c("a", "c"), value = list("d", 1:8)): replacement element 2 has 8 rows, need 10

Adding Columns

As with lists, we can add new columns to an existing data frame using the $ or [[]] operators.

# Create an empty list
(d2 <- data.frame(a = 1:10, b = "bob"))

    a   b
1   1 bob
2   2 bob
3   3 bob
4   4 bob
5   5 bob
6   6 bob
7   7 bob
8   8 bob
9   9 bob
10 10 bob

## Various ways of adding new single columns
d2$c <- letters[1:10]
d2[["d"]] <- runif(10)
d2[[5]] <- rnorm(10)
d2["alice"] <- TRUE
d2

    a   b c          d         V5 alice
1   1 bob a 0.39551254  0.9116125  TRUE
2   2 bob b 0.02311933  0.3341865  TRUE
3   3 bob c 0.16358210  1.3088335  TRUE
4   4 bob d 0.96102468 -0.5141460  TRUE
5   5 bob e 0.34147122 -0.3722525  TRUE
6   6 bob f 0.59231193  0.3462801  TRUE
7   7 bob g 0.40779073  0.7340008  TRUE
8   8 bob h 0.94040284  1.4507504  TRUE
9   9 bob i 0.28812749  0.4430496  TRUE
10 10 bob j 0.60931561  0.1972789  TRUE

We can add multiple columns using the [] operator.

d2[7:8] <- rnorm(20)
d2[c("foo", "bar")] <- list(TRUE, FALSE)
d2

    a   b c          d         V5 alice         V7          V8  foo   bar
1   1 bob a 0.39551254  0.9116125  TRUE -2.4591839 -0.71035093 TRUE FALSE
2   2 bob b 0.02311933  0.3341865  TRUE -0.4603825 -0.04109860 TRUE FALSE
3   3 bob c 0.16358210  1.3088335  TRUE  0.1271004  0.85363963 TRUE FALSE
4   4 bob d 0.96102468 -0.5141460  TRUE  0.2922169 -0.83978767 TRUE FALSE
5   5 bob e 0.34147122 -0.3722525  TRUE  0.3838345 -0.04331845 TRUE FALSE
6   6 bob f 0.59231193  0.3462801  TRUE  0.9490207 -0.44625041 TRUE FALSE
7   7 bob g 0.40779073  0.7340008  TRUE  0.5023876 -0.03732082 TRUE FALSE
8   8 bob h 0.94040284  1.4507504  TRUE  0.2338697  0.93048072 TRUE FALSE
9   9 bob i 0.28812749  0.4430496  TRUE -1.4386591 -0.07930940 TRUE FALSE
10 10 bob j 0.60931561  0.1972789  TRUE  0.8374561  2.30531232 TRUE FALSE

Practice

Run the following code to create an empty data frame containing 10 observations of the 3 variables: a, b, c. Then populate the data frame as described below.

Fill column a with the integer sequence from -9 to 0.
- Use the column name to assign the new values.
Fill column b with the even integers between 1 and 20 (inclusive).
- Use the numeric column index to assign the new values.
Replace the odd rows in column c with the odd integers between 11 and 20 (inclusive).
- Do not overwrite the missing values in the even rows.

Interactive Editor
Solution

df <- data.frame(a = rep(NA, 10),
                 b = rep(NA, 10),
                 c = rep(NA, 10)
                 )
df

    a  b  c
1  NA NA NA
2  NA NA NA
3  NA NA NA
4  NA NA NA
5  NA NA NA
6  NA NA NA
7  NA NA NA
8  NA NA NA
9  NA NA NA
10 NA NA NA

df$a <- -9:0
df[2] <- seq(2,20,2)
df[seq(1, 9, 2), "c"] <- seq(11, 19, 2)
df

    a  b  c
1  -9  2 11
2  -8  4 NA
3  -7  6 13
4  -6  8 NA
5  -5 10 15
6  -4 12 NA
7  -3 14 17
8  -2 16 NA
9  -1 18 19
10  0 20 NA