cs-5821/lab2/program
2017-02-08 02:20:48 -05:00

611 lines
16 KiB
Plaintext
Executable File
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

> x <- c (1 ,3 ,2 ,5)
> x
[1] 1 3 2 5
Note that the > is not part of the command; rather, it is printed by R to
indicate that it is ready for another command to be entered. We can also
save things using = rather than <- :
> x = c (1 ,6 ,2)
> x
[1] 1 6 2
> y = c (1 ,4 ,3)
Hitting the up arrow multiple times will display the previous commands,
which can then be edited. This is useful since one often wishes to repeat
a similar command. In addition, typing ?funcname will always cause R to
open a new help file window with additional information about the function
funcname .
We can tell R to add two sets of numbers together. It will then add the
first number from x to the first number from y , and so on. However, x and
y should be the same length. We can check their length using the length()
length()
function.
> length ( x )
[1] 3
> length ( y )
[1] 3
> x+y
[1] 2 10 5
The ls() function allows us to look at a list of all of the objects, such
ls()
as data and functions, that we have saved so far. The rm() function cSan be
rm()
used to delete any that we dont want.
> ls ()
[1] " x " " y "
> rm (x , y )
> ls ()
character (0)
Its also possible to remove all objects at once:
> rm ( list = ls () )44
2. Statistical Learning
The matrix() function can be used to create a matrix of numbers. Before
matrix()
we use the matrix() function, we can learn more about it:
> ? matrix
The help file reveals that the matrix() function takes a number of inputs,
but for now we focus on the first three: the data (the entries in the matrix),
the number of rows, and the number of columns. First, we create a simple
matrix.
> x = matrix ( data = c (1 ,2 ,3 ,4) , nrow =2 , ncol =2)
> x
[ ,1] [ ,2]
[1 ,]
1
3
[2 ,]
2
4
Note that we could just as well omit typing data= , nrow= , and ncol= in the
matrix() command above: that is, we could just type
> x = matrix ( c (1 ,2 ,3 ,4) ,2 ,2)
and this would have the same effect. However, it can sometimes be useful to
specify the names of the arguments passed in, since otherwise R will assume
that the function arguments are passed into the function in the same order
that is given in the functions help file. As this example illustrates, by
default R creates matrices by successively filling in columns. Alternatively,
the byrow=TRUE option can be used to populate the matrix in order of the
rows.
> matrix ( c (1 ,2 ,3 ,4) ,2 ,2 , byrow = TRUE )
[ ,1] [ ,2]
[1 ,]
1
2
[2 ,]
3
4
Notice that in the above command we did not assign the matrix to a value
such as x . In this case the matrix is printed to the screen but is not saved
for future calculations. The sqrt() function returns the square root of each
sqrt()
element of a vector or matrix. The command x^2 raises each element of x
to the power 2 ; any powers are possible, including fractional or negative
powers.
> sqrt ( x )
[ ,1]
[1 ,] 1.00
[2 ,] 1.41
> x ^2
[ ,1]
[1 ,]
1
[2 ,]
4
[ ,2]
1.73
2.00
[ ,2]
9
16
The rnorm() function generates a vector of random normal variables,
rnorm()
with first argument n the sample size. Each time we call this function, we
will get a different answer. Here we create two correlated sets of numbers,
x and y , and use the cor() function to compute the correlation between
cor()
them.2.3 Lab: Introduction to R
45
> x = rnorm (50)
> y = x + rnorm (50 , mean =50 , sd =.1)
> cor ( x , y )
[1] 0.995
By default, rnorm() creates standard normal random variables with a mean
of 0 and a standard deviation of 1. However, the mean and standard devi-
ation can be altered using the mean and sd arguments, as illustrated above.
Sometimes we want our code to reproduce the exact same set of random
numbers; we can use the set.seed() function to do this. The set.seed()
set.seed()
function takes an (arbitrary) integer argument.
> set . seed (1303)
> rnorm (50)
[1] -1.1440
1.3421
. . .
2.1854
0.5364
0.0632
0.5022 -0.0004
We use set.seed() throughout the labs whenever we perform calculations
involving random quantities. In general this should allow the user to re-
produce our results. However, it should be noted that as new versions of
R become available it is possible that some small discrepancies may form
between the book and the output from R .
The mean() and var() functions can be used to compute the mean and
mean()
variance of a vector of numbers. Applying sqrt() to the output of var() var()
will give the standard deviation. Or we can simply use the sd() function.
sd()
> set . seed (3)
> y = rnorm (100)
> mean ( y )
[1] 0.0110
> var ( y )
[1] 0.7329
> sqrt ( var ( y ) )
[1] 0.8561
> sd ( y )
[1] 0.8561
2.3.2 Graphics
The plot() function is the primary way to plot data in R . For instance,
plot()
plot(x,y) produces a scatterplot of the numbers in x versus the numbers
in y . There are many additional options that can be passed in to the plot()
function. For example, passing in the argument xlab will result in a label
on the x-axis. To find out more information about the plot() function,
type ?plot .
>
>
>
>
x = rnorm (100)
y = rnorm (100)
plot (x , y )
plot (x ,y , xlab =" this is the x - axis " , ylab =" this is the y - axis " ,
main =" Plot of X vs Y ")46
2. Statistical Learning
We will often want to save the output of an R plot. The command that we
use to do this will depend on the file type that we would like to create. For
instance, to create a pdf, we use the pdf() function, and to create a jpeg,
pdf()
we use the jpeg() function.
jpeg()
> pdf (" Figure . pdf ")
> plot (x ,y , col =" green ")
> dev . off ()
null device
1
The function dev.off() indicates to R that we are done creating the plot.
dev.off()
Alternatively, we can simply copy the plot window and paste it into an
appropriate file type, such as a Word document.
The function seq() can be used to create a sequence of numbers. For
seq()
instance, seq(a,b) makes a vector of integers between a and b . There are
many other options: for instance, seq(0,1,length=10) makes a sequence of
10 numbers that are equally spaced between 0 and 1 . Typing 3:11 is a
shorthand for seq(3,11) for integer arguments.
> x = seq (1 ,10)
> x
[1] 1 2 3 4 5 6 7
> x =1:10
> x
[1] 1 2 3 4 5 6 7
> x = seq ( - pi , pi , length =50)
8 9 10
8 9 10
We will now create some more sophisticated plots. The contour() func-
contour()
tion produces a contour plot in order to represent three-dimensional data; contour plot
it is like a topographical map. It takes three arguments:
1. A vector of the x values (the first dimension),
2. A vector of the y values (the second dimension), and
3. A matrix whose elements correspond to the z value (the third dimen-
sion) for each pair of ( x , y ) coordinates.
As with the plot() function, there are many other inputs that can be used
to fine-tune the output of the contour() function. To learn more about
these, take a look at the help file by typing ?contour .
>
>
>
>
>
>
y=x
f = outer (x ,y , function (x , y ) cos ( y ) /(1+ x ^2) )
contour (x ,y , f )
contour (x ,y ,f , nlevels =45 , add = T )
fa =( f - t ( f ) ) /2
contour (x ,y , fa , nlevels =15)
The image() function works the same way as contour() , except that it
image()
produces a color-coded plot whose colors depend on the z value. This is2.3 Lab: Introduction to R
47
known as a heatmap, and is sometimes used to plot temperature in weather heatmap
forecasts. Alternatively, persp() can be used to produce a three-dimensional
persp()
plot. The arguments theta and phi control the angles at which the plot is
viewed.
>
>
>
>
>
>
image (x ,y , fa )
persp (x ,y , fa )
persp (x ,y , fa , theta =30)
persp (x ,y , fa , theta =30 , phi =20)
persp (x ,y , fa , theta =30 , phi =70)
persp (x ,y , fa , theta =30 , phi =40)
2.3.3 Indexing Data
We often wish to examine part of a set of data. Suppose that our data is
stored in the matrix A .
> A = matrix (1:16 ,4 ,4)
> A
[ ,1] [ ,2] [ ,3] [ ,4]
[1 ,]
1
5
9
13
[2 ,]
2
6
10
14
[3 ,]
3
7
11
15
[4 ,]
4
8
12
16
Then, typing
> A [2 ,3]
[1] 10
will select the element corresponding to the second row and the third col-
umn. The first number after the open-bracket symbol [ always refers to
the row, and the second number always refers to the column. We can also
select multiple rows and columns at a time, by providing vectors as the
indices.
> A [ c (1 ,3) , c (2 ,4) ]
[ ,1] [ ,2]
[1 ,]
5
13
[2 ,]
7
15
> A [1:3 ,2:4]
[ ,1] [ ,2] [ ,3]
[1 ,]
5
9
13
[2 ,]
6
10
14
[3 ,]
7
11
15
> A [1:2 ,]
[ ,1] [ ,2] [ ,3] [ ,4]
[1 ,]
1
5
9
13
[2 ,]
2
6
10
14
> A [ ,1:2]
[ ,1] [ ,2]
[1 ,]
1
5
[2 ,]
2
648
2. Statistical Learning
[3 ,]
[4 ,]
3
4
7
8
The last two examples include either no index for the columns or no index
for the rows. These indicate that R should include all columns or all rows,
respectively. R treats a single row or column of a matrix as a vector.
> A [1 ,]
[1] 1 5
9 13
The use of a negative sign - in the index tells R to keep all rows or columns
except those indicated in the index.
> A [ - c (1 ,3) ,]
[ ,1] [ ,2] [ ,3] [ ,4]
[1 ,]
2
6
10
14
[2 ,]
4
8
12
16
> A [ - c (1 ,3) ,-c (1 ,3 ,4) ]
[1] 6 8
The dim() function outputs the number of rows followed by the number of
dim()
columns of a given matrix.
> dim ( A )
[1] 4 4
2.3.4 Loading Data
For most analyses, the first step involves importing a data set into R . The
read.table() function is one of the primary ways to do this. The help file
read.table()
contains details about how to use this function. We can use the function
write.table() to export data.
write.
Before attempting to load a data set, we must make sure that R knows table()
to search for the data in the proper directory. For example on a Windows
system one could select the directory using the Change dir. . . option under
the File menu. However, the details of how to do this depend on the op-
erating system (e.g. Windows, Mac, Unix) that is being used, and so we
do not give further details here. We begin by loading in the Auto data set.
This data is part of the ISLR library (we discuss libraries in Chapter 3) but
to illustrate the read.table() function we load it now from a text file. The
following command will load the Auto.data file into R and store it as an
object called Auto , in a format referred to as a data frame. (The text file data frame
can be obtained from this books website.) Once the data has been loaded,
the fix() function can be used to view it in a spreadsheet like window.
However, the window must be closed before further R commands can be
entered.
> Auto = read . table (" Auto . data ")
> fix ( Auto )2.3 Lab: Introduction to R
49
Note that Auto.data is simply a text file, which you could alternatively
open on your computer using a standard text editor. It is often a good idea
to view a data set using a text editor or other software such as Excel before
loading it into R .
This particular data set has not been loaded correctly, because R has
assumed that the variable names are part of the data and so has included
them in the first row. The data set also includes a number of missing
observations, indicated by a question mark ? . Missing values are a common
occurrence in real data sets. Using the option header=T (or header=TRUE ) in
the read.table() function tells R that the first line of the file contains the
variable names, and using the option na.strings tells R that any time it
sees a particular character or set of characters (such as a question mark),
it should be treated as a missing element of the data matrix.
> Auto = read . table (" Auto . data " , header =T , na . strings ="?")
> fix ( Auto )
Excel is a common-format data storage program. An easy way to load such
data into R is to save it as a csv (comma separated value) file and then use
the read.csv() function to load it in.
> Auto = read . csv (" Auto . csv " , header =T , na . strings ="?")
> fix ( Auto )
> dim ( Auto )
[1] 397 9
> Auto [1:4 ,]
The dim() function tells us that the data has 397 observations, or rows, and
dim()
nine variables, or columns. There are various ways to deal with the missing
data. In this case, only five of the rows contain missing observations, and
so we choose to use the na.omit() function to simply remove these rows.
na.omit()
> Auto = na . omit ( Auto )
> dim ( Auto )
[1] 392
9
Once the data are loaded correctly, we can use names() to check the
names()
variable names.
> names ( Auto )
[1] " mpg "
[5] " weight "
[9] " name "
" cylinders "
" d i s p l a c e m e n t " " horsepower "
" a c c e l e r a t i o n " " year "
" origin "
2.3.5 Additional Graphical and Numerical Summaries
We can use the plot() function to produce scatterplots of the quantitative
variables. However, simply typing the variable names will produce an error
message, because R does not know to look in the Auto data set for those
variables.
scatterplot50
2. Statistical Learning
> plot ( cylinders , mpg )
Error in plot ( cylinders , mpg ) : object cylinders not found
To refer to a variable, we must type the data set and the variable name
joined with a $ symbol. Alternatively, we can use the attach() function in
attach()
order to tell R to make the variables in this data frame available by name.
> plot ( Auto$cylinders , Auto$mpg )
> attach ( Auto )
> plot ( cylinders , mpg )
The cylinders variable is stored as a numeric vector, so R has treated it
as quantitative. However, since there are only a small number of possible
values for cylinders , one may prefer to treat it as a qualitative variable.
The as.factor() function converts quantitative variables into qualitative
as.factor()
variables.
> cylinders = as . factor ( cylinders )
If the variable plotted on the x-axis is categorial, then boxplots will
automatically be produced by the plot() function. As usual, a number
of options can be specified in order to customize the plots.
>
>
>
>
>
plot ( cylinders ,
plot ( cylinders ,
plot ( cylinders ,
plot ( cylinders ,
plot ( cylinders ,
ylab =" MPG ")
mpg )
mpg ,
mpg ,
mpg ,
mpg ,
boxplot
col =" red ")
col =" red " , varwidth = T )
col =" red " , varwidth =T , horizontal = T )
col =" red " , varwidth =T , xlab =" cylinders " ,
The hist() function can be used to plot a histogram. Note that col=2
hist()
has the same effect as col="red" .
histogram
> hist ( mpg )
> hist ( mpg , col =2)
> hist ( mpg , col =2 , breaks =15)
The pairs() function creates a scatterplot matrix i.e. a scatterplot for every
pair of variables for any given data set. We can also produce scatterplots
for just a subset of the variables.
scatterplot
matrix
> pairs ( Auto )
> pairs ( mpg + d i s p l a c e m e n t + horsepowe r + weight +
acceleration , Auto )
In conjunction with the plot() function, identify() provides a useful
identify()
interactive method for identifying the value for a particular variable for
points on a plot. We pass in three arguments to identify() : the x-axis
variable, the y-axis variable, and the variable whose values we would like
to see printed for each point. Then clicking on a given point in the plot
will cause R to print the value of the variable of interest. Right-clicking on
the plot will exit the identify() function (control-click on a Mac). The
numbers printed under the identify() function correspond to the rows for
the selected points.2.3 Lab: Introduction to R
51
> plot ( horsepower , mpg )
> identify ( horsepower , mpg , name )
The summary() function produces a numerical summary of each variable in
summary()
a particular data set.
> summary ( Auto )
mpg
Min .
: 9.00
1 st Qu .:17.00
Median :22.75
Mean
:23.45
3 rd Qu .:29.00
Max .
:46.60
cylinders
Min .
:3.000
1 st Qu .:4.000
Median :4.000
Mean
:5.472
3 rd Qu .:8.000
Max .
:8.000
horsepower
Min .
: 46.0
1 st Qu .: 75.0
Median : 93.5
Mean
:104.5
3 rd Qu .:126.0
Max .
:230.0 weight
Min .
:1613
1 st Qu .:2225
Median :2804
Mean
:2978
3 rd Qu .:3615
Max .
:5140
year
Min .
:70.00
1 st Qu .:73.00
Median :76.00
Mean
:75.98
3 rd Qu .:79.00
Max .
:82.00 origin
Min .
:1.000
1 st Qu .:1.000
Median :1.000
Mean
:1.577
3 rd Qu .:2.000
Max .
:3.000
displacement
Min .
: 68.0
1 st Qu .:105.0
Median :151.0
Mean
:194.4
3 rd Qu .:275.8
Max .
:455.0
acceleration
Min .
: 8.00
1 st Qu .:13.78
Median :15.50
Mean
:15.54
3 rd Qu .:17.02
Max .
:24.80
name
amc matador
: 5
ford pinto
: 5
toyota corolla
: 5
amc gremlin
: 4
amc hornet
: 4
chevrolet chevette : 4
( Other )
:365
For qualitative variables such as name , R will list the number of observations
that fall in each category. We can also produce a summary of just a single
variable.
> summary ( mpg )
Min . 1 st Qu .
9.00
17.00
Median
22.75
Mean 3 rd Qu .
23.45
29.00
Max .
46.60
Once we have finished using R , we type q() in order to shut it down, or
q()
quit. When exiting R , we have the option to save the current workspace so
workspace
that all objects (such as data sets) that we have created in this R session
will be available next time. Before exiting R , we may want to save a record
of all of the commands that we typed in the most recent session; this can
be accomplished using the savehistory() function. Next time