Sunday, December 8, 2013

Intro to R: Working directory, vectors, matrices, rbind, cbind, and writing data tables

Welcome to the opening edition of R coding tutorials! A quick heads-up: I am going to use the terms "code" and "script" a lot; they are generally interchangeable terms. I usually try to refer to "script" as a finished working product, while "code" (again, for me) refers to something I am still working on. But this is just my way of differentiating the status of things I am working on, and does not reflect any kind of standard usage.

Setting a Working Directory + Script Packages

First order of business is to set up a working directory, and any script packages you may need.
The working directory is where R will look for any files you reference in your own written code/scripts/programs, and where it will save any objects you create, such as graphics or data files. Packages are premade sets of functions you can download off of remote servers. Functions are manipulatable R tools built out of script, with preconfigured data entry options and adjustable settings. You can type the name of a function preceded by a question mark in R to get more information about any available function.

To load a package into R, First go to Packages and click Set CRAN Mirror. This will establish which remote site (aka CRAN Mirror) you want to load your script package from. I usually go for USA(CA1), which is UC Berkeley. Why? Because they seem to have just about everything I've ever needed available. Sometimes a CRAN Mirror won’t have a package you are looking for. In that case you may have to do some hunting online to figure out which Mirror it is found in. You can also email the (lead) author of the script package and ask them what Mirror(s) they had it loaded to. I've had luck with both of these methods.

To load a package you have decided to work with, you can use the drop-down menu or script. Using the drop-down menus, click Packages, and then click Load packages. Doing this will bring up a new window with all of the packages you currently have in your library. Click on the package you want to load, and then click OK. Sometimes you will have packages that are dependent on other packages being loaded. In these cases, R will install and load that package for you. Same for if you use script, which is very simple to do:

library(name of package)

The big advantage of scripting your package loadings is that won’t have to use the menu method for every package you need to run a particular analysis for your project. Just copy and paste library(name_of_package_you_want_to_load) into your R session with the rest of the script, and you're good to go. Remember, you need to load the package for any any R tool you want to use before you can use it. Typing in adonis(things_and_stuff_you_are_studying)*, for example, without first loading the VEGAN package adonis() comes in will just make R yell at you in red letters about how "adonis" doesn't exist. Load the package first, and then you can use the tools.

*adonis is the permutational multivariate analysis of variance function found in the VEGAN package. I'll get back to this one in a future post.

To set the working directory, you can use the drop down menu, or you can script it. If you are starting a big project and will be repeatedly setting the same working directory, you can script the directory address, then copy and paste it from your script file back in to R whenever you need to.

For example:

setwd("C:/Users/David/Documents/OSU Stillwater/Masters/R stuff/project_data&code/Working_directory")


You will notice some underscores in this directory name. R does not like spaces. In fact, spaces make R very, very angry, causing it to throw bright red error messages at you much as Thor hurls his hammer Mjölnir twixt the eyes of his enemies. Instead of putting spaces in anything you'll ever use in conjunction with R, use underscores instead. Make it a habit. 

To acquire the address of the folder you will be using as your working directory, right click on a file in that folder (making a new blank text file will work for this). Click on Properties in the menu, then right-click on the text next to Location, and click Select All in the menu that appears. Right-click again to copy, then paste your working directory into setwd("working_directory_address"). That should do the trick!

Now, you can save scripting sessions in R so that you don't have to set up your working directory and packages every time you return to work on your project. As convenient as this is, I have had issues in the past with a session simply containing too much code/script and loaded data files, and errors beginning to propagate. These mysterious errors can make functional code or script appear to be flawed, not functioning the way it actually should, which is a giant pain. Instead of saving whole sessions** while I'm writing code, I back up my code and scripts within a text file, usually in the freeware program Notepad++. Don't use Word; the auto-formatting functions can do all kinds of weird unwanted stuff to your code (auto-capitalizing for example). Trust me, just use text file-based programs. I keep two windows open: One for finished, working script, and the other for code I am still working on. It's very very important to keep everything well labeled during this process, from files to individual lines of code and completed scripts. I'm super serious here. Plus if you ever plan to publish your script(s), it will be to your benefit to have detailed notes on what all your script does when you start writing. Failing to stay on top of labeling can leave you wondering why you structured a piece of code one way or another, and create needless headaches. Use # before and after these notes, so your notes will be excluded from the functional script. I've included examples of this below.

**Saving sessions can be very useful once you have finished putting together a single analysis, since you can keep allof your package load commands, working directory designation script, and analysis script in one place, nice and neat.

Something important to keep in mind when you are coding: if you get an error message, the first thing you should do is check your code for punctuation, spelling, or spacing errors. Any of these mistakes will throw an automatic wrench into your code, and will likely give you some angry, red error message. Take a deep breath and carefully read back through all of the relevant code you have entered. The error messages will contain information that can be useful in figuring out what went wrong. If you can't figure out what the error message is telling you, try copy and pasting it into a web search engine. With a little diligence, you should be able to work out what the problem is. If things seem to be a complete mess, make sure your work thus far is backed up, turn R off, restart, load your packages again, and review your relevant code and/or data files for errors. This has happened to me a couple of times, where I've been working for 5+ hours straight and code that used to work is suddenly giving me error messages. I restart my R session, reset everything and re-enter all the relevant code/script leading up the error message, and things go back to working right. Damn computer-gremlins! This is why you want to back up your code and scripts in text files. I'll get more into debugging strategies in future posts as needed.


Introducing Vectors

When working on a new analysis, create two text files. One could be projectA_FinishedScript, where you put completed, functional scripts that are debugged and working %100 percent properly with full labeling. Descriptive file names are, again, a must. The other file could be called projectA_CodeInProgress,where you keep code for analyses you are still putting together. Here, keeping your labels current as you work on coding your analyses, graphics, etcetera is really, really critical. Having to go back through each section of some code you were working on because you forgot what part of it does is frustrating, and an avoidable waste of time. Label your code. Do it. Do it well. Your blood pressure and deadlines will thank you.

So in the imaginary projectA_CodeInProgress file, we have a script for taking the sum of a vector of numbers. A vector is a single row or column of integers, real numbers, complex numbers, logical values, characters, or raw data. The commas separate each value in the vector.

odds1<-c(1,3,5,7,9)  #This sequence of integers is a vector. It has a length greater than 1.#
#"<-" assigns "c(1,3,5,7,9)" to "odds1". "odds1" is now a vector of values. "<-" is formally called a#
#"gets arrow", as in "odds1" gets "c(1,3,5,9)".#
"c(_)" is the concatenation function, and sets all values inside of it into a single row of values, aka a vector.
If only one integer was assigned, as in "x<-1", then "x" would be a scalar, which is a vector with only 1 value.

(sum(odds1))^1/17

#Returns the sum of the values in the vector "odds1", then takes the 17th root of the value of the sum#
#using "^1/17".#
#The 17th root is used because (blah blah blah things).#

#sum(_) is a function that takes the sum of whatever values, vectors, or matrices you place within# #the parentheses.#

But since this isn't the whole analysis, it stays in the CodeInProgress text file. If for some reason this was a complete piece of script that you were totally finished writing, debugging, and labeling, then you could move it to the FinishedScript text file.

cbind, rbind, and Making Data Matrices
Now let's take a look at cbind and rbind, two super useful data editing functions found in the
"base" package that is automatically loaded when you start an R session. These two functions can turn multiple lists of values - multiple vectors - into a single data matrix. A data matrix will have multiple columns and rows, as opposed to the one-dimensional vectors.
For example:

odds1<-c(1,3,5,7,9)
#"<-" assigns "c(1,3,5,7,9)" into "odds1". "odds1" is now a vector of values.#
evens1<-c(2,4,6,8,10)
#"<-" assigns "c(2,4,6,8,10)" into "evens1". "evens1" is now a vector of values.#
I should note that after you get enough coding experience, writing down notes this simple won’t be needed. However, the function of assigning values or code to a title (such as "odds1" or "evens1', as I am using here) is something you will absolutely want to keep straight. I'll get more into this as I move in to more complex scripts in later posts.
So we want to take our two vectors of values - "odds1" and "evens1" - and make them into a single matrix of values.
Let's use cbind first:

columns<-cbind(odds1,evens1)
#combines vectors "odds1" and "evens1" as columns into a matrix of values using the command#
#cbind, which binds vectors and matrices as columns.#
#vector titles become column headers.#

To see what you've made, type "columns" in to R, then hit enter. You should see the following:

     odds1 evens1
[1,]    1     2
[2,]    3     4
[3,]    5     6
[4,]    7     8
[5,]    9    10

Note that the titles for each vector have become column names, while the rows are automatically numbered and labeled.

Next, rbind:

rows<-rbind(odds1,evens1)

#combines vectors "odds1" and "evens1" as rows into a matrix of values using the command rbind,#
#which binds vectors and matrices as rows.#
#vector titles become row names.#

Type "rows", and hit enter.

            [,1] [,2] [,3] [,4] [,5]
odds1     1    3    5    7    9
evens1    2    4    6    8   10

So, same concept as cbind, but rows instead of columns.

Now, instead of just mashing together some vectors to create a matrix, let's actually make a data frame (aka a data table). The data matrix is of a single data type, uses less memory than a data frame, and are a prerequisite for doing linear algebra operations. Data tables are better in situations where you will be referring to individual rows and columns in your code/script. Using the attach() function allows rows and columns of the matrix to be referenced and accessed using their titles.

matrix1<-data.frame(odds1,evens1)

attach(matrix1)

#Using the attach() function, we can now reference titles by name using quotes.

matrix1[1:4,"odds1"]


[1] 1 3 5 7


#This script returns rows 1 through 4 of the “odds1” column in matrix1.#


Now type "matrix1" into R, and hit enter. You should see this:
    odds1 evens1
1     1      2
2     3      4
3     5      6
4     7      8
5     9     10


We can now attach a new column to our "matrix1" data frame, using cbind. I've provided a new vector - "odds2" - for this purpose.
odds2<-c(11,13,15,17,19)


cbind(matrix1,odds2)

You may be wondering what happens if the new vector has more or fewer values than the data frame it is being attached to.

Let's try it to see:


odds3<-c(11,13,15,17,19,21)

cbind(matrix1,odds3)
>>Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 5, 6
 
So that didn't work too well. In cases like this, you'll need some kind of place-holder - such as 0 - to keep everything lined up correctly. The problem was that the number of values in the "odds3" vector was different than the number of rows in the "matrix1" data frame. R does not like this, and is now yelling at us. There is a solution however! Using rbind, we can tack some zeros on to the bottom of matrix1!
zeros2<-c(0,0)

matrix1b<-rbind(matrix1,zeros2)

matrix1b


Which gives us:
    odds1 evens1
1     1      2
2     3      4
3     5      6
4     7      8
5     9     10
6     0      0

Ta da!
Meow let's try again with the "odds3" vector.

matrix2<-cbind(matrix1b,odds3)

matrix2


  odds1 evens1 odds3
1     1        2         11
2     3        4         13
3     5        6         15
4     7        8         17
5     9       10        19
6     0        0         21

So what happens if we use cbind on a matrix instead of just single vectors? Let's try it!
evens2<-c(12,14,16,18,20)

matrix3<-data.frame(odds2,evens2)

matrix3

   odds2 evens2
1    11     12
2    13     14
3    15     16
4    17     18
5    19     20


combo_matrix<-cbind(matrix1,matrix3)

combo_matrix

  odds1 evens1 odds2 evens2
1     1      2    11     12
2     3      4    13     14
3     5      6    15     16
4     7      8    17     18
5     9     10   19     20


Now we’ll try rbind:

combomatrix2<-rbind(matrix1,matrix3)

>>Error in match.names(clabs, names(xi)) :
  names do not match previous names
The column names don’t match, so R cant attach the two matrices one on top of the other with rbind.
To make rbind work, we’ll need to transpose (flip on their side) the data matrices, turning columns into rows and rows into columns.

To do this, we’ll use the transpose function t(data_frame_you_want_to_transpose)

matrix1t<-t(matrix1)

[,1] [,2] [,3] [,4] [,5]
odds1     1    3    5    7    9
evens1    2    4    6    8   10

matrix3t<-t(matrix3)

[,1] [,2] [,3] [,4] [,5]
odds2    11   13   15   17   19
evens2   12   14   16   18   20

Now, let’s try rbind again.

combomatrix2<-rbind(matrix1t,matrix3t)

combomatrix2

       [,1] [,2] [,3] [,4] [,5]
odds1     1    3    5    7    9
evens1    2    6    6    8   10
odds2    11   13   15   17   19
evens2   12   14   16   18   20

And there you have it! I love it when a plan comes together.


Making Data Files

Now we are going to test your working directory by making a brand new data file. Exciting, I know!
Note: If you don't have your working directory set up, you should do that now.

write.table(combo_matrix, file="combo_matrix.csv",row.names=T)

combo_matrix is the name of the R table you are exporting to your working directory using the write.table function. The table was assigned the to this name using the gets-arrow "<-".
file="combo_matrix.csv" tells the R write.table function to create a new comma-separated-value format (.csv) data file in your working directory.
row.names=T will result in the numerical row names being included in the .csv file you are creating. To exclude these row names from the file you are creating, use F (as in False) instead of T (as in True)
The above file creation script will create a comma-separated-value file (.csv) in your working directory.

A .csv file can be opened in Microsoft excel.


write.table(combo_matrix, file="combo_matrix.txt",row.names=T)

file="combo_matrix.txt" tells the R write.table function to create a new text format (.txt) data file  in your working directory.
This script will write the combo_matrix data frame as a text file (.txt). A .txt file can be opened with programs such as Wordpad, Notepad, and my preferred Notepad++, among many others.

My next R post will be a walk-through on loading external data sets in text (.txt) and excel comma-seperated-value files (.csv), as well as some actual math! Hooray maths and datums!

If there is something you find in this or future R tools posts that you have a question or comment about, please let me know. I will get back to you directly, or edit this post accordingly (and then let you know).

Cheers!


Note: For even more beginner (and intermediate) level R stuff, I strongly suggest checking out The R Book by Michael J. Crawley. It's a giant book for sure, but if you are about to start doing a lot of coding in R, it is absolutely an invaluable resource.