Skip to main content

Loading Data into R

In the following text, we have tried to write functions in a mono-spaced font like this example() and we show code snippets in boxes like this:

> c(1,2,3,4,5)
[1] 1 2 3 4 5

In the example above, the greater than symbol > represents the R command prompt, where we type in the command c(1,2,3,4,5) causing R to print out the result, a five element vector of integers. In the actual R GUI, what you type is usually shown in red and the output in blue (see the screenshots of the R command line)

Variables and Objects

The basic “container” for information in R is called an object. It can contain several different types of information. An object may contain one or a collection of numbers/letters or a mixture of all of them. It may contain a list of commands, which make up a function.

To assign a value to an variable name in R, the ‘gets’ (<-) sign is used (not the equals sign). For example, to assign the value 0.05 to the variable ‘p’, we type:

> p <- 0.05

This creates a scalar object named ‘p’ with the value 0.05, so that whenever ‘p’ is used, it takes the value 0.05:

> p <- 0.05
> 100*p
[1] 5

R is case sensitive, so p and P refer to different objects:

> p <- 0.05
> P <- 10
> 100*p
[1] 5
> 100*P
[1] 1000

If we look at the "Misc" menu in the R Console we see it gives us the opportunity to list or remove the objects that are in memory. Alternatively, we can list using ls() or remove using rm() commands:

> ls()
[1] "p" "P"
> rm(P)
> ls()
[1] "p"
> p
[1] 0.05
> P
Error: object "P" not found

NB: Everything in R is case sensitive, even command names. In other words, typing in LS() will not result in the same outcome as ls() would:

> LS()
Error: could not find function "LS"

Creating Strings

To create a character object, quotes (or speech marks) must be used. For example, to assign the character string "hello" to the object h, we type:

> h <- "hello"

R will actually let you use either double quote characters (shift and two on a British keyboard) or single quotes:

> h <- 'hello'

Creating Vectors

To create a vector containing many values, the concatenate function c() can be used:

> x <- c(1,4,9,16)
> y <- c("hello","world")

If the numbers to be entered form a regular sequence, then this can be achieved by using the colon operator ‘:’. For instance, to create a vector of the numbers from 1 to 10, we type:

> a <- c(1:10)

You can also create vectors of strings, as shown in the previous page about the R command line:

> fruit <- c("apple", "pear", "banana", "mandarin", "kiwi fruit", "lemon", "orange", "lime", "pineapple")

Other useful functions for creating vectors are the sequence function, seq(), and the replicate function rep(), as shown here:

> seq(0,0.5,0.1)
[1] 0.0 0.1 0.2 0.3 0.4 0.5

> rep("A",10)
[1] "A" "A" "A" "A" "A" "A" "A" "A" "A" "A"

Creating Matrices

Matrices can be created using the matrix() function. The number of columns and rows are specified using the ncols and nrows arguments. The values to be entered into the matrix can be specified using the data argument. If a vector of values is supplied, the argument byrow allows you to specify whether to assign values to the matrix by rows or by columns. If no data argument is supplied, a NULL matrix of the specified size will be created.

> matrix(c(1,2,3,11,12,13), nrow=2, ncol=3, byrow=TRUE)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]   11   12   13

Row and column names can be supplied by using the dimnames argument:

> matrix(c(1,2,3,11,12,13), nrow=2, ncol=3, byrow=TRUE, 
+ dimnames=list(c("row1", "row2"), c("C.1", "C.2", "C.3")))

     C.1 C.2 C.3
row1   1   2   3
row2  11  12  13

In the example above, the second line started with a plus sign - you don't actually type this in. R will display this character automatically to indicate you are continuing the previous command.

Alternatively, you use the rownames() and colnames() functions to set them:

> A <- matrix(c(1,2,3,11,12,13), nrow=2, ncol=3, byrow=TRUE)
> rownames(A) <- c("row1","row2")
> colnames(A) <- c("C.1", "C.2", "C.3")
> A
     C.1 C.2 C.3
row1   1   2   3
row2  11  12  13

These two functions rownames() and colnames() can also be used to find out the row and column names of a matrix:

> rownames(A)
[1] "row1" "row2"
> colnames(A)
[1] "C.1" "C.2" "C.3"

Additional rows or columns can be attached to the matrix by using the column bind cbind(), or row bind rbind() functions. These can also be used to create a matrix by binding together a number of vectors. Be careful to ensure that the dimensions are compatible:

> x <- c(1,2,3)
> y <- c(4,5,6)
> z <- c(7,8,9)
> cbind(x,y,z)

     x y z
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

> rbind(x,y,z)

  [,1] [,2] [,3]
x    1    2    3
y    4    5    6
z    7    8    9

Creating Data Frames

Data frames are objects within R that are made up of rows and columns of data (similar to a 2 dimensional matrix) and are one of the most useful formats in which to store your data for statistical analysis within R. The columns contain the values of the different variables with in the experiment (eg age, gender, smoker, etc), and the rows contain the observations for each sample in the study (eg 20-30/30-40/40-50, male/female, true/false, etc).

The easiest way to create a data frame is to create a spreadsheet of the data using a program such as Microsoft Excel, then loading this into R using the read.table() function. An example excel spreadsheet has been created for you:

First, you must save the spreadsheet as a text file. Excel allows you to save spreadsheets in a tab delimited text format. Open the file and go to File > Save As to bring up the save screen. Choose the folder to save the file, then select "Text (Tab delimited) (*.txt)" from the Save as type drop down menu, and then click Save. The text file is now ready to be loaded in R.

Anyone who wants to try this and doesn't have Excel can download this file instead:

Load R as normal, and use the read.table() command in the following way (where you should edit "C:\\temp\\" to match where you saved the text file):

> Santa <-read.table("c:\\temp\\IsThereASantaClaus.txt", sep="\t", header=TRUE)

Typing in the full path is tedious, so you can use the setwd() command, or the "File", "Change dir..." menu to change to the directory where your files are. Now you can just do this:

> Santa <-read.table("IsThereASantaClaus.txt", sep="\t", header=TRUE)

The filename must be quoted (surrounded by speech marks). On Windows a double backslash (‘\\’) is used in the file path definition since a single backslash (‘\’) has a specific meaning in an R character string, as seen in the sep argument: This argument allows you to specify the separator of the data – in this case the separator is tab, denoted by a ‘t’ preceded by a backslash. The header argument specifies that the first row of the table contains the column names.

This is what you get:

> Santa
   Believe Age Gender Presents Behaviour
1    FALSE   9   male       25   naughty
2     TRUE   5   male       20      nice
...
49    TRUE  10   male       57      nice
50   FALSE   4   male        3   naughty

You can check that the column names have been assigned correctly by using the names() or colnames() function, which will give the names of the columns in the data frame of interest. R does not like spaces, so ensure that all of the column names are space free. If necessary, use a ‘_’ or ‘.’ to separate words (e.g. "Avg.Age" or "Avg_Height").

> names(Santa)
[1] "Believe"   "Age"       "Gender"    "Presents"  "Behaviour"

Data frames are more useful to store data than matrices as they can store multiple data types. For instance, three types of data are stored in the data frame created above:

> class(Santa[,"Behaviour"])
[1] "factor"

Checking each column gives:

  • integer – Age and Presents
  • logical – Believe
  • factor – Gender

Factors are variables that take one of a number of values (or levels). They are categorical variables. So the variable Gender is a factor with two levels; male and female. The names of the levels are arbitrary, so changing the names of the levels of Gender to, say, 0 and 1, would not affect the outcome of any statistical tests.

The logical variable Believe is a special logical case of a 2-level factor, where the levels are TRUE and FALSE.

A non-factor variable can be factorised by using the factor() command. If no levels are specified in the arguments, the levels are calculated automatically. You can check whether a variable is a factor variable by using the is.factor() command, which returns a true or false result. To view the levels of the factor, the command levels() can be used:

> levels(Santa[,"Behaviour"])
[1] "naughty" "nice" 

This command can also be used to change the names of the levels of a factor, for example from naughty and nice to bad and good:

> levels(Santa[,"Behaviour"]) <- c("bad","good")
> Santa
   Believe Age Gender Presents Behaviour
1    FALSE   9   male       25       bad
2     TRUE   5   male       20      good
...
49    TRUE  10   male       57      good
50   FALSE   4   male        3       bad