mirror of
https://asciireactor.com/otho/cs-5821.git
synced 2024-12-18 07:05:05 +00:00
611 lines
16 KiB
Plaintext
611 lines
16 KiB
Plaintext
|
||
|
||
|
||
> x <- c (1 ,3 ,2 ,5)
|
||
> x
|
||
[1] 1 3 2 5
|
||
Note that the > is not part of the command; rather, it is printed by R to
|
||
indicate that it is ready for another command to be entered. We can also
|
||
save things using = rather than <- :
|
||
> x = c (1 ,6 ,2)
|
||
> x
|
||
[1] 1 6 2
|
||
> y = c (1 ,4 ,3)
|
||
Hitting the up arrow multiple times will display the previous commands,
|
||
which can then be edited. This is useful since one often wishes to repeat
|
||
a similar command. In addition, typing ?funcname will always cause R to
|
||
open a new help file window with additional information about the function
|
||
funcname .
|
||
We can tell R to add two sets of numbers together. It will then add the
|
||
first number from x to the first number from y , and so on. However, x and
|
||
y should be the same length. We can check their length using the length()
|
||
length()
|
||
function.
|
||
> length ( x )
|
||
[1] 3
|
||
> length ( y )
|
||
[1] 3
|
||
> x+y
|
||
[1] 2 10 5
|
||
The ls() function allows us to look at a list of all of the objects, such
|
||
ls()
|
||
as data and functions, that we have saved so far. The rm() function cSan be
|
||
rm()
|
||
used to delete any that we don’t want.
|
||
> ls ()
|
||
[1] " x " " y "
|
||
> rm (x , y )
|
||
> ls ()
|
||
character (0)
|
||
It’s also possible to remove all objects at once:
|
||
> rm ( list = ls () )44
|
||
2. Statistical Learning
|
||
The matrix() function can be used to create a matrix of numbers. Before
|
||
matrix()
|
||
we use the matrix() function, we can learn more about it:
|
||
> ? matrix
|
||
The help file reveals that the matrix() function takes a number of inputs,
|
||
but for now we focus on the first three: the data (the entries in the matrix),
|
||
the number of rows, and the number of columns. First, we create a simple
|
||
matrix.
|
||
> x = matrix ( data = c (1 ,2 ,3 ,4) , nrow =2 , ncol =2)
|
||
> x
|
||
[ ,1] [ ,2]
|
||
[1 ,]
|
||
1
|
||
3
|
||
[2 ,]
|
||
2
|
||
4
|
||
Note that we could just as well omit typing data= , nrow= , and ncol= in the
|
||
matrix() command above: that is, we could just type
|
||
> x = matrix ( c (1 ,2 ,3 ,4) ,2 ,2)
|
||
and this would have the same effect. However, it can sometimes be useful to
|
||
specify the names of the arguments passed in, since otherwise R will assume
|
||
that the function arguments are passed into the function in the same order
|
||
that is given in the function’s help file. As this example illustrates, by
|
||
default R creates matrices by successively filling in columns. Alternatively,
|
||
the byrow=TRUE option can be used to populate the matrix in order of the
|
||
rows.
|
||
> matrix ( c (1 ,2 ,3 ,4) ,2 ,2 , byrow = TRUE )
|
||
[ ,1] [ ,2]
|
||
[1 ,]
|
||
1
|
||
2
|
||
[2 ,]
|
||
3
|
||
4
|
||
Notice that in the above command we did not assign the matrix to a value
|
||
such as x . In this case the matrix is printed to the screen but is not saved
|
||
for future calculations. The sqrt() function returns the square root of each
|
||
sqrt()
|
||
element of a vector or matrix. The command x^2 raises each element of x
|
||
to the power 2 ; any powers are possible, including fractional or negative
|
||
powers.
|
||
> sqrt ( x )
|
||
[ ,1]
|
||
[1 ,] 1.00
|
||
[2 ,] 1.41
|
||
> x ^2
|
||
[ ,1]
|
||
[1 ,]
|
||
1
|
||
[2 ,]
|
||
4
|
||
[ ,2]
|
||
1.73
|
||
2.00
|
||
[ ,2]
|
||
9
|
||
16
|
||
The rnorm() function generates a vector of random normal variables,
|
||
rnorm()
|
||
with first argument n the sample size. Each time we call this function, we
|
||
will get a different answer. Here we create two correlated sets of numbers,
|
||
x and y , and use the cor() function to compute the correlation between
|
||
cor()
|
||
them.2.3 Lab: Introduction to R
|
||
45
|
||
> x = rnorm (50)
|
||
> y = x + rnorm (50 , mean =50 , sd =.1)
|
||
> cor ( x , y )
|
||
[1] 0.995
|
||
By default, rnorm() creates standard normal random variables with a mean
|
||
of 0 and a standard deviation of 1. However, the mean and standard devi-
|
||
ation can be altered using the mean and sd arguments, as illustrated above.
|
||
Sometimes we want our code to reproduce the exact same set of random
|
||
numbers; we can use the set.seed() function to do this. The set.seed()
|
||
set.seed()
|
||
function takes an (arbitrary) integer argument.
|
||
> set . seed (1303)
|
||
> rnorm (50)
|
||
[1] -1.1440
|
||
1.3421
|
||
. . .
|
||
2.1854
|
||
0.5364
|
||
0.0632
|
||
0.5022 -0.0004
|
||
We use set.seed() throughout the labs whenever we perform calculations
|
||
involving random quantities. In general this should allow the user to re-
|
||
produce our results. However, it should be noted that as new versions of
|
||
R become available it is possible that some small discrepancies may form
|
||
between the book and the output from R .
|
||
The mean() and var() functions can be used to compute the mean and
|
||
mean()
|
||
variance of a vector of numbers. Applying sqrt() to the output of var() var()
|
||
will give the standard deviation. Or we can simply use the sd() function.
|
||
sd()
|
||
> set . seed (3)
|
||
> y = rnorm (100)
|
||
> mean ( y )
|
||
[1] 0.0110
|
||
> var ( y )
|
||
[1] 0.7329
|
||
> sqrt ( var ( y ) )
|
||
[1] 0.8561
|
||
> sd ( y )
|
||
[1] 0.8561
|
||
2.3.2 Graphics
|
||
The plot() function is the primary way to plot data in R . For instance,
|
||
plot()
|
||
plot(x,y) produces a scatterplot of the numbers in x versus the numbers
|
||
in y . There are many additional options that can be passed in to the plot()
|
||
function. For example, passing in the argument xlab will result in a label
|
||
on the x-axis. To find out more information about the plot() function,
|
||
type ?plot .
|
||
>
|
||
>
|
||
>
|
||
>
|
||
x = rnorm (100)
|
||
y = rnorm (100)
|
||
plot (x , y )
|
||
plot (x ,y , xlab =" this is the x - axis " , ylab =" this is the y - axis " ,
|
||
main =" Plot of X vs Y ")46
|
||
2. Statistical Learning
|
||
We will often want to save the output of an R plot. The command that we
|
||
use to do this will depend on the file type that we would like to create. For
|
||
instance, to create a pdf, we use the pdf() function, and to create a jpeg,
|
||
pdf()
|
||
we use the jpeg() function.
|
||
jpeg()
|
||
> pdf (" Figure . pdf ")
|
||
> plot (x ,y , col =" green ")
|
||
> dev . off ()
|
||
null device
|
||
1
|
||
The function dev.off() indicates to R that we are done creating the plot.
|
||
dev.off()
|
||
Alternatively, we can simply copy the plot window and paste it into an
|
||
appropriate file type, such as a Word document.
|
||
The function seq() can be used to create a sequence of numbers. For
|
||
seq()
|
||
instance, seq(a,b) makes a vector of integers between a and b . There are
|
||
many other options: for instance, seq(0,1,length=10) makes a sequence of
|
||
10 numbers that are equally spaced between 0 and 1 . Typing 3:11 is a
|
||
shorthand for seq(3,11) for integer arguments.
|
||
> x = seq (1 ,10)
|
||
> x
|
||
[1] 1 2 3 4 5 6 7
|
||
> x =1:10
|
||
> x
|
||
[1] 1 2 3 4 5 6 7
|
||
> x = seq ( - pi , pi , length =50)
|
||
8 9 10
|
||
8 9 10
|
||
We will now create some more sophisticated plots. The contour() func-
|
||
contour()
|
||
tion produces a contour plot in order to represent three-dimensional data; contour plot
|
||
it is like a topographical map. It takes three arguments:
|
||
1. A vector of the x values (the first dimension),
|
||
2. A vector of the y values (the second dimension), and
|
||
3. A matrix whose elements correspond to the z value (the third dimen-
|
||
sion) for each pair of ( x , y ) coordinates.
|
||
As with the plot() function, there are many other inputs that can be used
|
||
to fine-tune the output of the contour() function. To learn more about
|
||
these, take a look at the help file by typing ?contour .
|
||
>
|
||
>
|
||
>
|
||
>
|
||
>
|
||
>
|
||
y=x
|
||
f = outer (x ,y , function (x , y ) cos ( y ) /(1+ x ^2) )
|
||
contour (x ,y , f )
|
||
contour (x ,y ,f , nlevels =45 , add = T )
|
||
fa =( f - t ( f ) ) /2
|
||
contour (x ,y , fa , nlevels =15)
|
||
The image() function works the same way as contour() , except that it
|
||
image()
|
||
produces a color-coded plot whose colors depend on the z value. This is2.3 Lab: Introduction to R
|
||
47
|
||
known as a heatmap, and is sometimes used to plot temperature in weather heatmap
|
||
forecasts. Alternatively, persp() can be used to produce a three-dimensional
|
||
persp()
|
||
plot. The arguments theta and phi control the angles at which the plot is
|
||
viewed.
|
||
>
|
||
>
|
||
>
|
||
>
|
||
>
|
||
>
|
||
image (x ,y , fa )
|
||
persp (x ,y , fa )
|
||
persp (x ,y , fa , theta =30)
|
||
persp (x ,y , fa , theta =30 , phi =20)
|
||
persp (x ,y , fa , theta =30 , phi =70)
|
||
persp (x ,y , fa , theta =30 , phi =40)
|
||
2.3.3 Indexing Data
|
||
We often wish to examine part of a set of data. Suppose that our data is
|
||
stored in the matrix A .
|
||
> A = matrix (1:16 ,4 ,4)
|
||
> A
|
||
[ ,1] [ ,2] [ ,3] [ ,4]
|
||
[1 ,]
|
||
1
|
||
5
|
||
9
|
||
13
|
||
[2 ,]
|
||
2
|
||
6
|
||
10
|
||
14
|
||
[3 ,]
|
||
3
|
||
7
|
||
11
|
||
15
|
||
[4 ,]
|
||
4
|
||
8
|
||
12
|
||
16
|
||
Then, typing
|
||
> A [2 ,3]
|
||
[1] 10
|
||
will select the element corresponding to the second row and the third col-
|
||
umn. The first number after the open-bracket symbol [ always refers to
|
||
the row, and the second number always refers to the column. We can also
|
||
select multiple rows and columns at a time, by providing vectors as the
|
||
indices.
|
||
> A [ c (1 ,3) , c (2 ,4) ]
|
||
[ ,1] [ ,2]
|
||
[1 ,]
|
||
5
|
||
13
|
||
[2 ,]
|
||
7
|
||
15
|
||
> A [1:3 ,2:4]
|
||
[ ,1] [ ,2] [ ,3]
|
||
[1 ,]
|
||
5
|
||
9
|
||
13
|
||
[2 ,]
|
||
6
|
||
10
|
||
14
|
||
[3 ,]
|
||
7
|
||
11
|
||
15
|
||
> A [1:2 ,]
|
||
[ ,1] [ ,2] [ ,3] [ ,4]
|
||
[1 ,]
|
||
1
|
||
5
|
||
9
|
||
13
|
||
[2 ,]
|
||
2
|
||
6
|
||
10
|
||
14
|
||
> A [ ,1:2]
|
||
[ ,1] [ ,2]
|
||
[1 ,]
|
||
1
|
||
5
|
||
[2 ,]
|
||
2
|
||
648
|
||
2. Statistical Learning
|
||
[3 ,]
|
||
[4 ,]
|
||
3
|
||
4
|
||
7
|
||
8
|
||
The last two examples include either no index for the columns or no index
|
||
for the rows. These indicate that R should include all columns or all rows,
|
||
respectively. R treats a single row or column of a matrix as a vector.
|
||
> A [1 ,]
|
||
[1] 1 5
|
||
9 13
|
||
The use of a negative sign - in the index tells R to keep all rows or columns
|
||
except those indicated in the index.
|
||
> A [ - c (1 ,3) ,]
|
||
[ ,1] [ ,2] [ ,3] [ ,4]
|
||
[1 ,]
|
||
2
|
||
6
|
||
10
|
||
14
|
||
[2 ,]
|
||
4
|
||
8
|
||
12
|
||
16
|
||
> A [ - c (1 ,3) ,-c (1 ,3 ,4) ]
|
||
[1] 6 8
|
||
The dim() function outputs the number of rows followed by the number of
|
||
dim()
|
||
columns of a given matrix.
|
||
> dim ( A )
|
||
[1] 4 4
|
||
2.3.4 Loading Data
|
||
For most analyses, the first step involves importing a data set into R . The
|
||
read.table() function is one of the primary ways to do this. The help file
|
||
read.table()
|
||
contains details about how to use this function. We can use the function
|
||
write.table() to export data.
|
||
write.
|
||
Before attempting to load a data set, we must make sure that R knows table()
|
||
to search for the data in the proper directory. For example on a Windows
|
||
system one could select the directory using the Change dir. . . option under
|
||
the File menu. However, the details of how to do this depend on the op-
|
||
erating system (e.g. Windows, Mac, Unix) that is being used, and so we
|
||
do not give further details here. We begin by loading in the Auto data set.
|
||
This data is part of the ISLR library (we discuss libraries in Chapter 3) but
|
||
to illustrate the read.table() function we load it now from a text file. The
|
||
following command will load the Auto.data file into R and store it as an
|
||
object called Auto , in a format referred to as a data frame. (The text file data frame
|
||
can be obtained from this book’s website.) Once the data has been loaded,
|
||
the fix() function can be used to view it in a spreadsheet like window.
|
||
However, the window must be closed before further R commands can be
|
||
entered.
|
||
> Auto = read . table (" Auto . data ")
|
||
> fix ( Auto )2.3 Lab: Introduction to R
|
||
49
|
||
Note that Auto.data is simply a text file, which you could alternatively
|
||
open on your computer using a standard text editor. It is often a good idea
|
||
to view a data set using a text editor or other software such as Excel before
|
||
loading it into R .
|
||
This particular data set has not been loaded correctly, because R has
|
||
assumed that the variable names are part of the data and so has included
|
||
them in the first row. The data set also includes a number of missing
|
||
observations, indicated by a question mark ? . Missing values are a common
|
||
occurrence in real data sets. Using the option header=T (or header=TRUE ) in
|
||
the read.table() function tells R that the first line of the file contains the
|
||
variable names, and using the option na.strings tells R that any time it
|
||
sees a particular character or set of characters (such as a question mark),
|
||
it should be treated as a missing element of the data matrix.
|
||
> Auto = read . table (" Auto . data " , header =T , na . strings ="?")
|
||
> fix ( Auto )
|
||
Excel is a common-format data storage program. An easy way to load such
|
||
data into R is to save it as a csv (comma separated value) file and then use
|
||
the read.csv() function to load it in.
|
||
> Auto = read . csv (" Auto . csv " , header =T , na . strings ="?")
|
||
> fix ( Auto )
|
||
> dim ( Auto )
|
||
[1] 397 9
|
||
> Auto [1:4 ,]
|
||
The dim() function tells us that the data has 397 observations, or rows, and
|
||
dim()
|
||
nine variables, or columns. There are various ways to deal with the missing
|
||
data. In this case, only five of the rows contain missing observations, and
|
||
so we choose to use the na.omit() function to simply remove these rows.
|
||
na.omit()
|
||
> Auto = na . omit ( Auto )
|
||
> dim ( Auto )
|
||
[1] 392
|
||
9
|
||
Once the data are loaded correctly, we can use names() to check the
|
||
names()
|
||
variable names.
|
||
> names ( Auto )
|
||
[1] " mpg "
|
||
[5] " weight "
|
||
[9] " name "
|
||
" cylinders "
|
||
" d i s p l a c e m e n t " " horsepower "
|
||
" a c c e l e r a t i o n " " year "
|
||
" origin "
|
||
2.3.5 Additional Graphical and Numerical Summaries
|
||
We can use the plot() function to produce scatterplots of the quantitative
|
||
variables. However, simply typing the variable names will produce an error
|
||
message, because R does not know to look in the Auto data set for those
|
||
variables.
|
||
scatterplot50
|
||
2. Statistical Learning
|
||
> plot ( cylinders , mpg )
|
||
Error in plot ( cylinders , mpg ) : object ’ cylinders ’ not found
|
||
To refer to a variable, we must type the data set and the variable name
|
||
joined with a $ symbol. Alternatively, we can use the attach() function in
|
||
attach()
|
||
order to tell R to make the variables in this data frame available by name.
|
||
> plot ( Auto$cylinders , Auto$mpg )
|
||
> attach ( Auto )
|
||
> plot ( cylinders , mpg )
|
||
The cylinders variable is stored as a numeric vector, so R has treated it
|
||
as quantitative. However, since there are only a small number of possible
|
||
values for cylinders , one may prefer to treat it as a qualitative variable.
|
||
The as.factor() function converts quantitative variables into qualitative
|
||
as.factor()
|
||
variables.
|
||
> cylinders = as . factor ( cylinders )
|
||
If the variable plotted on the x-axis is categorial, then boxplots will
|
||
automatically be produced by the plot() function. As usual, a number
|
||
of options can be specified in order to customize the plots.
|
||
>
|
||
>
|
||
>
|
||
>
|
||
>
|
||
plot ( cylinders ,
|
||
plot ( cylinders ,
|
||
plot ( cylinders ,
|
||
plot ( cylinders ,
|
||
plot ( cylinders ,
|
||
ylab =" MPG ")
|
||
mpg )
|
||
mpg ,
|
||
mpg ,
|
||
mpg ,
|
||
mpg ,
|
||
boxplot
|
||
col =" red ")
|
||
col =" red " , varwidth = T )
|
||
col =" red " , varwidth =T , horizontal = T )
|
||
col =" red " , varwidth =T , xlab =" cylinders " ,
|
||
The hist() function can be used to plot a histogram. Note that col=2
|
||
hist()
|
||
has the same effect as col="red" .
|
||
histogram
|
||
> hist ( mpg )
|
||
> hist ( mpg , col =2)
|
||
> hist ( mpg , col =2 , breaks =15)
|
||
The pairs() function creates a scatterplot matrix i.e. a scatterplot for every
|
||
pair of variables for any given data set. We can also produce scatterplots
|
||
for just a subset of the variables.
|
||
scatterplot
|
||
matrix
|
||
> pairs ( Auto )
|
||
> pairs (∼ mpg + d i s p l a c e m e n t + horsepowe r + weight +
|
||
acceleration , Auto )
|
||
In conjunction with the plot() function, identify() provides a useful
|
||
identify()
|
||
interactive method for identifying the value for a particular variable for
|
||
points on a plot. We pass in three arguments to identify() : the x-axis
|
||
variable, the y-axis variable, and the variable whose values we would like
|
||
to see printed for each point. Then clicking on a given point in the plot
|
||
will cause R to print the value of the variable of interest. Right-clicking on
|
||
the plot will exit the identify() function (control-click on a Mac). The
|
||
numbers printed under the identify() function correspond to the rows for
|
||
the selected points.2.3 Lab: Introduction to R
|
||
51
|
||
> plot ( horsepower , mpg )
|
||
> identify ( horsepower , mpg , name )
|
||
The summary() function produces a numerical summary of each variable in
|
||
summary()
|
||
a particular data set.
|
||
> summary ( Auto )
|
||
mpg
|
||
Min .
|
||
: 9.00
|
||
1 st Qu .:17.00
|
||
Median :22.75
|
||
Mean
|
||
:23.45
|
||
3 rd Qu .:29.00
|
||
Max .
|
||
:46.60
|
||
cylinders
|
||
Min .
|
||
:3.000
|
||
1 st Qu .:4.000
|
||
Median :4.000
|
||
Mean
|
||
:5.472
|
||
3 rd Qu .:8.000
|
||
Max .
|
||
:8.000
|
||
horsepower
|
||
Min .
|
||
: 46.0
|
||
1 st Qu .: 75.0
|
||
Median : 93.5
|
||
Mean
|
||
:104.5
|
||
3 rd Qu .:126.0
|
||
Max .
|
||
:230.0 weight
|
||
Min .
|
||
:1613
|
||
1 st Qu .:2225
|
||
Median :2804
|
||
Mean
|
||
:2978
|
||
3 rd Qu .:3615
|
||
Max .
|
||
:5140
|
||
year
|
||
Min .
|
||
:70.00
|
||
1 st Qu .:73.00
|
||
Median :76.00
|
||
Mean
|
||
:75.98
|
||
3 rd Qu .:79.00
|
||
Max .
|
||
:82.00 origin
|
||
Min .
|
||
:1.000
|
||
1 st Qu .:1.000
|
||
Median :1.000
|
||
Mean
|
||
:1.577
|
||
3 rd Qu .:2.000
|
||
Max .
|
||
:3.000
|
||
displacement
|
||
Min .
|
||
: 68.0
|
||
1 st Qu .:105.0
|
||
Median :151.0
|
||
Mean
|
||
:194.4
|
||
3 rd Qu .:275.8
|
||
Max .
|
||
:455.0
|
||
acceleration
|
||
Min .
|
||
: 8.00
|
||
1 st Qu .:13.78
|
||
Median :15.50
|
||
Mean
|
||
:15.54
|
||
3 rd Qu .:17.02
|
||
Max .
|
||
:24.80
|
||
name
|
||
amc matador
|
||
: 5
|
||
ford pinto
|
||
: 5
|
||
toyota corolla
|
||
: 5
|
||
amc gremlin
|
||
: 4
|
||
amc hornet
|
||
: 4
|
||
chevrolet chevette : 4
|
||
( Other )
|
||
:365
|
||
For qualitative variables such as name , R will list the number of observations
|
||
that fall in each category. We can also produce a summary of just a single
|
||
variable.
|
||
> summary ( mpg )
|
||
Min . 1 st Qu .
|
||
9.00
|
||
17.00
|
||
Median
|
||
22.75
|
||
Mean 3 rd Qu .
|
||
23.45
|
||
29.00
|
||
Max .
|
||
46.60
|
||
Once we have finished using R , we type q() in order to shut it down, or
|
||
q()
|
||
quit. When exiting R , we have the option to save the current workspace so
|
||
workspace
|
||
that all objects (such as data sets) that we have created in this R session
|
||
will be available next time. Before exiting R , we may want to save a record
|
||
of all of the commands that we typed in the most recent session; this can
|
||
be accomplished using the savehistory() function. Next time |