Creating Data Frames

We create data frames using the data.frame() function. This function takes vectors as arguments, and each vector becomes a column in the resulting data frame.

In the following code, we create a data frame called d1 that contains 10 rows and 3 columns.

The first column is the integer sequence from 1 to 10
The second column alternates between -1 and 1
The third column contains the sequence \(\{0.1, 0.2, \ldots, 1.0\}\).

(d1 <- data.frame(1:10, c(-1, 1), seq(0.1, 1, 0.1)))

   X1.10 c..1..1. seq.0.1..1..0.1.
1      1       -1              0.1
2      2        1              0.2
3      3       -1              0.3
4      4        1              0.4
5      5       -1              0.5
6      6        1              0.6
7      7       -1              0.7
8      8        1              0.8
9      9       -1              0.9
10    10        1              1.0

Since we did not name the arguments to data.frame() when creating d1, R generated default (ugly) column names by parsing the argument values.

colnames(d1)

[1] "X1.10"            "c..1..1."         "seq.0.1..1..0.1."

We can assign our own column names by naming the arguments to data.frame().

# Generate a version of d1 with better column names
(d2 <- data.frame(x = 1:10, y = c(-1, 1), z = seq(0.1, 1, 0.1)))

    x  y   z
1   1 -1 0.1
2   2  1 0.2
3   3 -1 0.3
4   4  1 0.4
5   5 -1 0.5
6   6  1 0.6
7   7 -1 0.7
8   8  1 0.8
9   9 -1 0.9
10 10  1 1.0

We can also assign column names to an existing data frame using the names() or colnames() functions.

names(d1) <- c("foo", "bar", "baz")
d1

   foo bar baz
1    1  -1 0.1
2    2   1 0.2
3    3  -1 0.3
4    4   1 0.4
5    5  -1 0.5
6    6   1 0.6
7    7  -1 0.7
8    8   1 0.8
9    9  -1 0.9
10  10   1 1.0

colnames(d1) <- c("alice", "bob", "suzy")
d1

   alice bob suzy
1      1  -1  0.1
2      2   1  0.2
3      3  -1  0.3
4      4   1  0.4
5      5  -1  0.5
6      6   1  0.6
7      7  -1  0.7
8      8   1  0.8
9      9  -1  0.9
10    10   1  1.0

One of the greatest strengths of data frames is their ability to store heterogeneously typed columns. For example, in the following code, we create a data frame called d3 that comprises three uniquely typed columns.

A logical vector containing 10 randomly sampled boolean values.
A character vector containing 10 random selections from the set {“foo”, “bar”}.
A numeric vector containing 10 values randomly sampled from the interval [0, 1].

(d3 <- data.frame(a = sample(c(TRUE, FALSE), 10, replace = TRUE),
                  b = sample(c("foo", "bar"), 10, replace = TRUE),
                  c = runif(10)))

       a   b          c
1  FALSE foo 0.09898253
2   TRUE bar 0.36168087
3  FALSE foo 0.20896820
4   TRUE foo 0.24526171
5   TRUE foo 0.96918501
6  FALSE foo 0.84193358
7  FALSE foo 0.12520002
8   TRUE bar 0.98115814
9  FALSE foo 0.05207236
10 FALSE bar 0.35570100

If we want to quickly generate an empty data frame, we can supply a matrix as the sole argument to data.frame().

(d4 <- data.frame(matrix(NA, 10, 3)))

   X1 X2 X3
1  NA NA NA
2  NA NA NA
3  NA NA NA
4  NA NA NA
5  NA NA NA
6  NA NA NA
7  NA NA NA
8  NA NA NA
9  NA NA NA
10 NA NA NA

Practice

Create a data frame containing 7 observations of 3 variables.

Use a different type for each of the variables.

Interactive Editor
Solution

There are an infinite number of correct solutions to this problem (literally). The following is one possibility.

df <- data.frame(text = rep("text", 7),
                 integer = sample(1:10, 7, replace = TRUE),
                 complex = complex(7, 1:5, 1:3)
                 )
df

  text integer complex
1 text       8    1+1i
2 text       1    2+2i
3 text       3    3+3i
4 text       2    4+1i
5 text       5    5+2i
6 text       9    1+3i
7 text      10    2+1i

Structure of Data Frames

Although data frames look like two-dimensional matrices, they are actually lists of equal-length vectors. Each column in a data frame corresponds to a slot in the underlying list, and each of these list slots hold some soft of vector that contains the data from the corresponding column. So, all data frames are lists.

d1 <- data.frame(a = sample(c(TRUE, FALSE), 10, replace = TRUE),
                 b = sample(c("foo", "bar"), 10, replace = TRUE),
                 c = runif(10)
                 )
d2 <- data.frame(a = 1:10, b = c(-1, 1), c = seq(0.1, 1, 0.1))
d3 <- data.frame(x = -5:4, y = c(1, 10), z = seq(1, 1.5, length.out = 10))

# d1, d2, and d3 are all data frame.
is.data.frame(d1)

[1] TRUE

is.data.frame(d2)

[1] TRUE

is.data.frame(d3)

[1] TRUE

# But they're also lists
is.list(d1)

[1] TRUE

is.list(d2)

[1] TRUE

is.list(d3)

[1] TRUE

All data frames are lists, but not all lists need be data frames.

l1 <- list()
l2 <- list(a = sample(c(TRUE, FALSE), 10, replace = TRUE),
           b = sample(c("foo", "bar"), 10, replace = TRUE),
           c = runif(10)
           )
l3 <- list(d2, d3)

# l1, l2, and l3 are all lists
is.list(l1)

[1] TRUE

is.list(l2)

[1] TRUE

is.list(l3)

[1] TRUE

# But they're not data frames
is.data.frame(l1)

[1] FALSE

is.data.frame(l2)

[1] FALSE

is.data.frame(l3)

[1] FALSE

As you have probably surmised, data frames are not matrices. Although data frames and matrix share many superficial similarities, they’re almost entirely unrelated types of objects.

is.matrix(d1)

[1] FALSE

is.matrix(d2)

[1] FALSE

is.matrix(d3)

[1] FALSE

Consequently, you cannot apply matrix operations (such as matrix multiplication) to a data frame.

d2 %*% t(d3)

Error in d2 %*% t(d3): requires numeric/complex matrix/vector arguments

If you wanted to analyze a data frame like a matrix, you could type cast the data frame into a matrix using the as.matrix() function.

as.matrix(d2) %*% t(as.matrix(d3))

       [,1]       [,2]       [,3]       [,4]       [,5]      [,6]       [,7]
 [1,]  -5.9 -13.894444  -3.888889 -11.883333 -1.8777778 -9.872222  0.1333333
 [2,]  -8.8   2.211111  -4.777778   6.233333 -0.7555556 10.255556  3.2666667
 [3,] -15.7 -21.683333  -9.666667 -15.650000 -3.6333333 -9.616667  2.4000000
 [4,] -18.6  -5.577778 -10.555556   2.466667 -2.5111111 10.511111  5.5333333
 [5,] -25.5 -29.472222 -15.444444 -19.416667 -5.3888889 -9.361111  4.6666667
 [6,] -28.4 -13.366667 -16.333333  -1.300000 -4.2666667 10.766667  7.8000000
 [7,] -35.3 -37.261111 -21.222222 -23.183333 -7.1444444 -9.105556  6.9333333
 [8,] -38.2 -21.155556 -22.111111  -5.066667 -6.0222222 11.022222 10.0666667
 [9,] -45.1 -45.050000 -27.000000 -26.950000 -8.9000000 -8.850000  9.2000000
[10,] -48.0 -28.944444 -27.888889  -8.833333 -7.7777778 11.277778 12.3333333
            [,8]      [,9] [,10]
 [1,] -7.8611111  2.144444 -5.85
 [2,] 14.2777778  7.288889 18.30
 [3,] -3.5833333  8.433333  2.45
 [4,] 18.5555556 13.577778 26.60
 [5,]  0.6944444 14.722222 10.75
 [6,] 22.8333333 19.866667 34.90
 [7,]  4.9722222 21.011111 19.05
 [8,] 27.1111111 26.155556 43.20
 [9,]  9.2500000 27.300000 27.35
[10,] 31.3888889 32.444444 51.50

You should be carefully about any type conversions. If you’re data frame contains mixed column types, R will coerce every column into some compatible type.

(m1 <- as.matrix(d1))

      a       b     c           
 [1,] "FALSE" "bar" "0.13183423"
 [2,] "FALSE" "bar" "0.05030883"
 [3,] "FALSE" "bar" "0.50325438"
 [4,] "FALSE" "bar" "0.35182148"
 [5,] "FALSE" "bar" "0.43251110"
 [6,] "TRUE"  "bar" "0.47063528"
 [7,] "FALSE" "foo" "0.62802475"
 [8,] "TRUE"  "bar" "0.73321050"
 [9,] "TRUE"  "foo" "0.07805386"
[10,] "FALSE" "foo" "0.27256752"

typeof(m1)

[1] "character"

(d4 <- data.frame(x = rep(c(TRUE, FALSE), 3), y = runif(6)))

      x         y
1  TRUE 0.3753425
2 FALSE 0.8523554
3  TRUE 0.3087074
4 FALSE 0.9611211
5  TRUE 0.2122324
6 FALSE 0.2282841

(m4 <- as.matrix(d4))

     x         y
[1,] 1 0.3753425
[2,] 0 0.8523554
[3,] 1 0.3087074
[4,] 0 0.9611211
[5,] 1 0.2122324
[6,] 0 0.2282841

typeof(m4)

[1] "double"