from OTU table to HEATMAP!

In this tutorial you will learn:

  • what is a heatmap
  • how to create a clean, highly customizable heatmap using heatmap.2 in the gplots package in R
  • how to remove samples with poor output (not very many sequences)
  • how to rearrange your samples by a metadata category
  • how to make a color coded bar above the heatmap for the metadata category

You will not learn:

  • how to create an OTU table
  • how to choose a good color palette for your color coded bar
  • why heatmaps are called heatmaps

What you will need:

  1. metadata table
  2. OTU table
  3. and in case it wasn’t obvious, R

Introduction

So figuring out a code from OTU table to heatmap has been my dream since we saw a cool looking heatmap in one of Dr. Lamendella’s presentations on the human gut microbiome (from the most awesome gut microbiome paper ever of 2012). It is a neat way to display a matrix of information in a color coded grid and is not in any way related to Fahrenheit or Celsius. One of the most common applications of heatmaps are for displaying results of gene expression levels in DNA microarrays.

Photo from Wikipedia: Heatmap of a DNA microarray-Unfortunately for people with red-green color-blindness, these two colors are the most commonly used for creation of heatmaps.

In our case, the heatmap will function particularly well in displaying  information about OTU abundances at various taxonomical levels. Heatmaps would also serve really well when people in the lab start theoretically getting metaproteomic data. To see more examples of heatmaps used in seminal research, look at the supplementary figures of the coolest paper of 2012. This is a particularly long tutorial that has codes adapted from several sources. I will try my best to explain each step but some of the steps in certain functions are still a bit foggy and others would require unnecessary and long explanations. My original bare basic tutorial on heatmaps can be found here and it was mainly based upon this tutorial done by Nathan Yu. If you are keen on just making a quick heatmap, I suggest looking at those two links first.

Prepping the Files

You know how it goes, you need to make things in the right format before you can make the magic happen.

  1. Metadata File-Your sample names should listed down the first column and variables describing your samples are in the top row. You can have  as many columns/variables as you like.
  2. OTU Table-Contrary to the metadata file, your sample names should be on the top row whereas the OTU IDs should be down the first column. When you scroll all the way to the right on your file there should be a column with the taxonomical information in each row such as
    • “k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Coprococcus; s__Coprococcus eutactus” or
    • “Root;Bacteria;Firmicutes;””Clostridia””;Clostridiales;””Lachnospiraceae””;””Lachnospiraceae Incertae Sedis”
    •  Unlike the columns which have a title (the sample name), this one is probably blank. Type in “taxonomy”. All lowercase and no extra frills.
  3. Make sure the ID numbers/names for samples MATCH between your metadata file and OTU file.
    • In reality, it will probably make things easier for you if your samples are something like “Site2Samp3Vers1” rather than just “231”. This is one of those critical steps that absolutely must be done or else R won’t be able to match up the OTU data of each sample with the metadata of each respective sample. When creating your sample names, the concatenation will probably help you out as explained here.
  4. Make sure both are saved as .csv (Comma-Separated Values)  rather than .xls or .xlsx rather than Excel.
    • This compacts the file and makes the importation process into R easier.

Hopefully this didn’t take you too long! Now the fun begins.

Importing the files into R

OPTIONAL: I tend to have stuff saved accidentally in my R sessions that I no longer need, so my first step of every large script is removing anything that is currently in the work session.

ls()
rm(list=ls())
ls()

OPTIONAL: I also usually change my working directory at the beginning. This makes writing pathways (or the directions for R to find files) much easier and leaves less of a chance for things to get messed up. Write in the pathway for the folder where your metadata and OTU tables are located. The one written below follows the pathway conventions of Windows PC (for Mac view http://youtu.be/8xT3hmJQskU).

getwd() 
setwd("c:/your_folder/project_1") 
getwd()

OPTIONAL: The second getwd() call should repeat whatever working directory you just specified. To double check that your files are in the pathway you just specified:

dir()

Now we actually import the files into R. Change the names to whatever you named your file.

dat1 <- read.csv("OTU.csv", header=TRUE,row.names=1, sep=",") 
meta <- read.csv("metadata.csv", header=TRUE,row.names=1, sep=",")

Also take the time to make a folder called “output” in your working directory folder. We will be writing lots of scripts that spit out tables here and there, so it is nice to separate them out. What are you looking for? Just do it manually (you know, right click, create new folder, etc.); it’s easier.

OTU Table “Manipulation” or Cleaning

As you already know, data usually don’t arrive in the ready to analyze format. From missing values to messed up samples, now we take the time to clean our data (since the phrase “data manipulation” is not PC at all). In this process, we will split the OTU table into two segments/objects. The first object (taxa.names) will contain just the taxonomical information and the second object is a matrix(dat2) that will contain the rest of the table.

taxa.names <- dat$taxonomy 
dat2 <- as.matrix(dat[,-dim(dat)[2]])

Optional: Dropping Samples with Low Observations

This section is adapted from Dr. Schaefer’s tutorial “Analyses of Diversity, Diversity and Similar Indices” class notes. The whole class syllabus is rather fascinating and proves a lot of R equivalents of processes (Unifrac analysis, nMDS, Indicator Species analysis, etc.) that we would normally do in QIIME or PC-ORD.

Although the subtitle says optional, it’s probably a good idea to to drop samples with low observations. There is a very low probability your sample only has 10 or even just 100 microbes in it.

First, create an object that contains number of observances for each sample (sums each column).

s_abundances<-apply(dat2,2,sum)

Next, split the OTU table into two matrices: one with low abundances and one with high abundances. Currently the threshold is set at 1000 but you can change the number in both lines.

bads<-dat2[,s_abundances<1000]
goods<-dat2[,s_abundances>1000]

To see how many poor samples will be removed from the OTU table:

ncol(dat2)-ncol(goods)
ncol(bads)

Now replace the old OTU table with the new one that just contains good markers.

dat2<-goods

Run these if you would like output of the good and bad markers.

write.table(badm, file="output/bads.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE)
write.table(goodm, file="output/goods.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE)

.

The next parts on changing to relative abundance and extraction by different taxonomic levels are based on a powerpoint in the lab module “Introduction to Metagenomic Data Analysis” by Alexander V. Alekseyenko and Susan P. Holmes (as part of the Summer Institute in Statistics and Modeling of Infectious Diseases).

Recommended: Change to Relative Abundance

Right now, observances of each OTU are probably being reported as absolute abundances. Each OTU observance is the original number generated by OTU picking and the like in QIIME. Sometimes there are specific reasons why you might want to leave the abundances as absolute abundances but the majority of the time, you will probably need to make samples comparable to each other. This requires standardization: the absolute number of observances for an OTU becomes a fraction of the total observances in that sample.

dat2 <- scale(dat2, center=F, scale=colSums(dat2))

To begin, you will need transpose the OTU table so that the OTU ID numbers become the top row rather than the first column.

dat2 <-t(dat2)

Next, take a closer look at the way taxonomical designations are written in your OTU table. Are there six or seven levels?  If it looks like

  • “Root;Bacteria;Firmicutes;””Clostridia””;Clostridiales;””Lachnospiraceae””;””Lachnospiraceae Incertae Sedis”
  • “k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Lachnospiraceae; g__Coprococcus; s__Coprococcus eutactus”

there are seven levels designated by the Root/Kingdom, Phylum, Class, Order, Family, Genus and Specie. You will need to start numbering the levels at level 2 rather than 1 in the extraction functions. However, if the beginning of your taxonomical designations are phyla then start numbering at level 1; or if phyla are not listed until the third spot in the taxonomical designations, start numbering the levels at 3.

d.phylum = otu2taxonomy(dat4,level=2,taxa=taxa.names)
d.class = otu2taxonomy(dat4,level=3,taxa=taxa.names)
d.order = otu2taxonomy(dat4,level=4,taxa=taxa.names)
d.family = otu2taxonomy(dat4,level=5,taxa=taxa.names)
d.genus = otu2taxonomy(dat4,level=6,taxa=taxa.names)
d.species = otu2taxonomy(dat4,level=7,taxa=taxa.names)

To look at the output, transpose and export.

phylum2 <-t(d.phylum)
class2 <-t(d.class)
order2 <-t(d.order)
family2 <-t(d.family)
genus2 <-t(d.genus)
species2 <-t(d.species)
write.table(phylum2, file="output/phyla.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE) 
write.table(class2, file="output/classes.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE)
write.table(order2, file="output/orders.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE)
write.table(family2, file="output/families.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE)
write.table(genus2, file="output/genera.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE)
write.table(species2, file="output/species.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE)

These steps allow you to analyze your OTU data at various taxonomical levels within and outside of R.

Merging metadata and OTU tables

Next we will merge the metadata and OTU table together. Originally, I tried to figure out a way to avoid this step but R was not being so cooperative with reordering. We are merging the two tables so that when we rearrange the samples according to a metadata variable, the OTU table rearranges as well.

First choose which taxonomical level you want to look at. For simplicity’s sake, let’s look at phyla first. This first script merges the two tables by matching up the Sample ID/Names. The “all.x=FALSE” portion makes sure that metadata information is not included for samples that were dropped earlier from the data processing (if you skipped that step, this should not affect the output).

mergephylum<-merge(meta, d.phylum, by="row.names",all.x=FALSE)

There are multiple ways to double check the merge worked. The easiest is just by looking at the dimensions of each object involved.

dim(meta)
dim(d.phylum)
dim(mergephylum)ncol

The first number reflects the number of rows while the second refers to the amount of columns. The number of rows of the final merge should match the number of rows in the phyla file (or whichever level you are looking at).  The number of columns of the final merge should be the sum of the columns in d.phylum and meta +1. Again, here is the code for output.

write.table(mergephylum, file="output/mergephylum.csv", col.names=NA,row.names=TRUE, sep=",", quote=FALSE)

Reordering by metadata variable

If you would like your heatmap in order by sample or order does not matter, skip to the next step (making the actual matrix for the heatmap).

You can reorder your samples based on a variable in your metadata. This can be anything from diet type to sample site. The variable that will direct the way your table will be reordered is not limited to just being discrete (as in group A, group B, etc.); the variable can also be continuous (blood pressure, height, pH).

For this rearrangement code, you will need the exact name of the column of the variable. Since names might have changed since importation into R, it’s best to double check. This lists the names of all the columns of your metadata.

colnames(meta)

Now use the name of the variable in question and replace “VARIABLE” in the following code.

reordermerge <- mergephylum[order(mergephylum$VARIABLE, mergephylum$Row.names),]

This will cause the table to be rearranged by increasing value of the VARIABLE (or alphabetical order) and then by the Row.names or Sample ID/Names.

Making the actual matrix for the heatmap

This next set of  script splits OTU table and metadata again so that the heatmap will only display OTU data rather than metadata as well. It involves removing columns, renaming the rows and changing the format of the object to a matrix.

(OTUcol1<-ncol(meta)+2)
(OTUcol2<-ncol(reordermerge))
justOTU<-reordermerge[,OTUcol1:OTUcol2]
justOTU[1:5,1:5]
rownames(justOTU[1:10,])
rownames(justOTU)<-reordermerge$Row.names
rownames(justOTU[1:10,])
justOTU2<-as.matrix(t(justOTU))
justOTU2[1:5,1:5]

Heatmap!!!

Hurray! After all those steps we are finally to the steps of making the actual heatmap.

First install the package that contains the codes to make the heatmap. (Refresh your memory on packages here.)

install.packages("gplots")
library(gplots)

Now print out your heatmap!

heatmap.2(justOTU2,Rowv=TRUE, Colv=FALSE, scale="column", trace="none", col=redgreen, xlab="sample", ylab="phylum", margins=c(10,15))

If you are happy with this image, skip to “Export the Heatmap.

This tutorial will be covering some of the arguments within this function for customization. In case you want to know about all the possibilities, read this documentation on heatmap.2.

Looking at some of the arguments already present:

  • justOTU2-This is where you put the object name of the heatmap you would like to visualize.
  • Colv and Rowv-Hold your horses and read about these in the next subsection.
  • scale-For the scale argument you can input “none”, “column” or “row”. R normalizes the rows or columns based on the Z-score or standard score (not to be confused with Z-factor) which indicates how many standard deviations away the value is from the mean. In other words, the scale indicates which values are outliers and possibly significant.
  • trace-You should probably set this to “none” because if you put “column” “row” or “both”, you get some funky lines separating the boxes in the heatmap. Unless your data set is pretty small, then you would benefit from adding the following code with a different color.
    • tracecol="cyan"
  • col-This controls the color scheme. Read about changing the color scheme in the second following subsection.
  • xlab and ylab-This is where you put labels for your X and Y axis.
  • margins-Changing the numbers in this argument changes the amount of white space surrounding your heatmap. I highly recommend fiddling with these numbers to make sure the visualization space fits the heatmap of your data.

Optional: Dendrograms

A dendrogram is a tree placed on right and/or top sides of the heatmap. Instead of clustering phylogenetically similar samples or phyla, these trees cluster columns or rows that have similar values. This clustering causes the rows and columns to be reordered when you input “TRUE” for Rowv or ColvIf you input Colv=TRUE, then you will lose any customized ordering you did earlier when you reordered the metadata.

Optional: Change the color scale

Although many default heatmaps use the red-green color scheme, I would suggest choosing a different palette to A) better match the color scheme of your paper and more seriously, B) allow color blind people the chance to understand your graphic.

I will not go much into the specifics of choosing good color palettes, but the most important aspect of designing graphics is that people can actually read and understand them. The color scheme must assist in this goal. So yellow text on a white background is a huge no-no (regardless if the audience is color blind or not) and remember, there are people who can be blue-green color blind. In case you want to learn more about colors in R in general, I stumbled across this presentation that has excellent information but is displayed in the most horrendous manner graphically.

There are already a couple color schemes present in R. You can find them by entering

?cm.colors

The easiest way to change the color scheme to a customized scheme is by using the package “rcolorbrewer.”

install.packages("RColorBrewer")
library(RColorBrewer)

The information in your heatmap would probably benefit best from a monochromatic color scheme (white, light blue, dark blue, etc.) or a diverging color scheme (dark red, pink, white, light blue, dark blue, etc.) Creating both utilize a similar function from the RColorBrewer: colorRampPalette. (For more information about RColorBrewer, look at an explanation here and and another one  by Dr. Stewart MacArthur).

For a monochromatic scale, change the color names within the quotation markers:

scaleblue<-colorRampPalette(colors=c("white","blue"))(100)

For a diverging scale, input three colors.

scaleyellowblue<-colorRampPalette(colors=c("yellow","white","blue"))(100)

The number at the end corresponds to the number of colors generated in the scale. Thus if you inputted

scaleyellowblue<-colorRampPalette(colors=c("yellow","white","blue"))(3)

You would only get the 3 original colors in the scale.

To apply the color scale, input the name of your custom scale in the argument for col.

heatmap.2(justOTU2,Rowv=TRUE, Colv=FALSE, scale="column", trace="none", col=COLORSCALE, xlab="sample", ylab="phylum", margins=c(10,15))

For example, the diverging yellow-white-blue color scheme:

heatmap.2(justOTU2,Rowv=TRUE, Colv=FALSE, scale="column", trace="none", col=scaleyellowblue, xlab="sample", ylab="phylum", margins=c(10,15))

If you want to get fancy, try choosing colors from:

For example, the Science paper I mentioned earlier had a really cool protein heatmap with a customized color scheme. To replicate that scheme:

colors in R

Optional: Colored Side Bar

In the several papers, you can find some authors put a customized colored side bar along the top or left hand side of the heatmap to denote different groupings of the samples. You would most likely want to create a customized colored side bar based on the variable you chose to base your data reordering. The following function is based upon a code created by Dr. Peter Cock found here.

First, you will need to create a function that assigns a color to each category. This function will uses a whole bunch of “if” functions. Only the first category will be preceded by “if” and following categories will be “else if”. Add a line for each category. This example has 5 categories.

color.map<-function(diet.type){
 if(diet.type=="0") "red" 
 else if(diet.type=="1") "green" 
 else if(diet.type=="2") "yellow"
 else if(diet.type=="3") "blue"
 else if(diet.type=="4") "purple"
 }

The text in between quotation marks after “VARIABLE==” should be the exact text of the category designation within the column. So if your categories are “dry”, “muddy” and “sandy”, you should replace the numbers.

color.map<-function(CATEGORY){
 if(CATEGORY=="dry") "red" 
 else if(CATEGORY=="muddy") "green" 
 else if(CATEGORY=="sandy") "yellow"
 }

If the variable is continuous(such as pH), the color scheme would probably be better suited by a monochromatic scale (look at previous section for more information). Install rcolorbrewer and create a custom palette.You can change the colors and number of tones the function will produce.

install.packages("RColorBrewer")
library(RColorBrewer)
colorRampPalette(colors=c("yellow","red"))(5)

Use the hex codes outputted the last code in your color map function.

color.map<-function(CATEGORY){
 if(CATEGORY=="0") "#FFFF00" 
 else if(CATEGORY=="1") "#FFBF00" 
 else if(CATEGORY=="2") "#FF7F00"
 else if(CATEGORY=="3") "#FF3F00"
 else if(CATEGORY=="4") "#FF0000"
 }

Now that you have a function linking category and color, apply the color map to the whole column of data.

sidebarcolors <- unlist(lapply(reordermerge$VARIABLE, color.map))

Now add the argument ColSideColors=sidebarcolors to your heatmap code.

heatmap.2(justOTU2,Rowv=TRUE, Colv=FALSE, scale="column", trace="none", col=redgreen, xlab="sample", ylab="phylum", margins=c(10,15), ColSideColors=sidebarcolors)

There is actually a way to put in multiple side bars, but I have not looked much into the code. You can find an explanation here.

Here is an example I made with a dendrogram for rows, a customized color scale and customized sidebar colors. The color scale was created using a 9 tone palette created by rcolorbrewer.

heatmap example

Export the Heatmap

Once you have found the perfect way to graphically display your information as a heatmap, you will probably want to save the beauty. The export process of images is different from tables. It is similar to turning a camera on, placing an object in front of the camera, and then taking the picture. When you turn on your camera, you need to choose the size of the photo, name, and by extension, folder of the photo will be written to. The “placing the object” part is rewriting whatever final code you have come up with for your heatmap.

png(filename="output/phylum.png", width=1200, height=800)
heatmap.2(justOTU2,Rowv=TRUE, Colv=FALSE, scale="column", trace="none", 
 col=redgreen, xlab="diet type", ylab="phylum", 
 ColSideColors=sidebarcolors, margins=c(15,15)) 
dev.off()

This is it! In case your curiousity about heatmaps has not been filled, there are many other aspects of heatmaps that can still be twiddled. For example, this tutorial shows how to overlay numbers on the colored boxes. If you have any suggestions to make this tutorial better, please comment below!

Advertisements

General Navigation in R

So you’ve finally managed to install the pesky environment but have no idea what you are looking at when you open the program. This tutorial is for you. (Again, here is a version with screenshot pictures).

When you open R, it might look different than the screenshots in the picture version of the tutorial. This depends on what type of computer you are using and how you installed R (look at step 8 of the installation post).

nav1

When you first open R, there should be a window with this information in it. The version of your copy of R can be found at the beginning (starred). Every time you open R, this window is called the workspace. A workspace can be likened to a box. You can put multiple things in the box and rearrange and analyze the objects in the box. Depending on how you installed R, you can have multiple workspaces/boxes open to put in and remove things. When you open R, you can have a new box or old box (saved workspace) opened depending on how you closed R. More information on that later.

Now we are going to try entering in the command “help.start()” which will open a help guide in Chrome, Firefox, Safari etc. In the previous screenshot, there is a box around the greater-than sign (“>”). This is where you write directions to R. Try typing “help.start()” exactly (without the quotation marks) and press the Enter/Return key.

nav2

*From now on, commands will be written like so

help.start()

When the command is finished running (R has heard and followed your instructions), another greater-than sign should appear. You would then be able to write another command and run it. (When you run scripts that require lots of time and computation, this sign will become a symbol of relief, happiness and many other positive emotions).  Depending on the command, R will spit out some output about what it is doing or results from the instructions before the greater-than sign appears. (Marked by an arrow in the following picture).
nav3

Here is that happy symbol of a successfully run code and then a help screen pops up. If you click on “An Introduction to R” there are a myriad of tutorials already written for R that were installed on your computer when you installed R.

nav5

Alternatively, you could get a PDF version of the guide by clicking on the top menu Help>Manuals (in PDF)>An Introduction to R. Although it is easy to look for answers using the all powerful tool Google, this PDF is a solid, 101-page introduction to R. A downloadable version from the internet can also be found here.

Although R is good at following instructions, it is pickier than a toddler choosing what to eat when it comes to written commands. You have to write commands exactly for them to be performed. So you  will not misspell, not change uppercased and lowercased letters and definitely not change a location of a period or underscore. If you do, R will whine and spit out an error message in response to your command.

Picky R doesn't like anything new

Picky R doesn’t like anything new

Let’s try some basic arithmetic. Remember to press enter between each line.

4+7
10/3
4-7+23
3*(20+1)
sqrt(144)
remember PEMDAS!

remember PEMDAS!

You can also create objects (now let’s remember algebRa) by assigning numbers or even other objects to the new object.

x<-3
y<-7
z<-x*y

nav9

If you type the name of these objects, you can see what each object contains.

x
y
z

nav10

To see all the objects you have in your workspace (or box, to continue the earlier metaphor):

ls()

nav11

You can reassign new contents to an object.

a<-1
a
a<-x
a
a<-200
a

nav12

An object can contain more than just one number. Here are some different versions of sequences.

x<-c(5,9)
y<-seq(1,21, by=2)
z<-seq(0,100, length=9)
w<-4:32
x
y
z
w

nav16

If you want to repeat a function you wrote earlier, press on the up or down buttons on your keyboard.

You can also erase all the objects in your workspace.

ls()
rm(list=ls())
ls()

nav11

To exit R, you can either close out of it or type

q()
nav15

How courteous of R

Either way, R will ask if you would like to save your workspace. If you choose to save, R will remember objects in the workspace even if you turn off your computer.

Here is a link to an excellent and shorter introduction to R (compared to the earlier one we opened up through R) if you would like to get more comfortable navigating in R.

Scatterplot Matrices

Scatterplot matrices are a great way to roughly determine if you have a linear correlation between multiple variables. This is particularly helpful in pinpointing specific variables that might have similar correlations to your genomic or proteomic data. If you already have data with multiple variables, load it up as described here.

If not, no worries because R comes with some various presaved datasets for practice (some are more interesting than others. To view these datasets, input the following.

data()

For this tutorial, we will be looking at the datasets trees and ChickWeight.  First, load or open these datasets.

data(trees)
data(ChickWeight)

To see the actual data contained by these datasets, just write the title of the dataset.

trees
ChickWeight
  • The trees dataset seems to contain three columns of measurements: Girth, Height and Volume.
  • The ChickWeight dataset seems to involve little chicklets getting fed different diets and being weighed at various time points.

To find out more information about the datasets and to confirm our observations, put a question mark before the title of the dataset.

?trees
?dataset

Now, you ready for the scatterplot?

pairs(trees)
trees

Dataset Trees Scatterplot Matrix

This is an example of a scatterplot matrix. The variables are written in a diagonal line from top left to bottom right. Then each variable is plotted against each other. For example, the middle square in the first column is an individual scatterplot of Girth and Height, with Girth as the X-axis and Height as the Y-axis. This same plot is replicated in the middle of the top row. In essence, the boxes on the upper right hand side of the whole scatterplot are mirror images of the plots on the lower left hand.

In this scatterplot, it is probably safe to say that there is a correlation between Girth and Volume (Go data! Confirming the obvious) because the plot looks like a line. There is probably less of a correlation between Height and Girth in addition to Height and Volume. More statistical analyses would be needed to confirm or deny this.

Now for ChickWeight.

pairs(ChickWeight)
chickweight

Dataset ChickWeight Scatterplot

This scatterplot matrix is unfortunately not as clean as the last plot because it contains discrete data points for Time, Chick and Diet. However, much can still be extracted from this scatterplot matrix (think about BS exercises you might have done for English or Art) about experimental design and possible outcomes.

  • Scatterplots related to Time are evenly distributed into columns or rows, suggesting that data was actually collected in a regimented fashion. (As in, data was collected at the times it should have been for all the Chick samples).
  • There were about 50 chicks. The first 20 were on diet 1 and then the next three groups of 10 were given diet 2, 3 or 4.
  • Looking at Row 4, Column 1, there is a possibility that chicks on diet 3 gained more weight than chicks on diets 1, 2 or 4.
  • Looking at Row 2, Column 1, it seems that chicks weighed about the same amount at the beginning of the experiment but variation increased as time passed on. In general, there is an increase in weight.

There you have it!

In conclusion,

  • Scatterplot matrices are good for determining rough linear correlations of metadata that contain continuous variables.
  • Scatterplot matrices are not so good for looking at discrete variables.

Soup up your R environment: how to install packages

Today we are going to make additions to our R environment in a common process called installing packages. The transition won’t be as long, drastic nor emotional as an episode of Extreme Makeover: Home Edition, but it does add on more capabilities to your R environment.

Wow...I would like to live in a house like this

Extreme Makeover: Home Edition in Houston as published by Houston’s Real Estate Landscape Swamplot

A package is a bunch of codes combined and distributed through the different R mirrors/servers. Packages usually serve a specific function such as analyzing certain types of data or assisting in multi-server computing. Today, we are going to install ggplot2 and gplots, both of which are commonly used in creating different figures.

I will briefly enumerate the installation steps here which fall roughly into two parts. If you would like screenshots of the process in addition to the instructions, click here.

Method 1 (less typing)

Part 1-Getting the Package onto Your Computer

  1. Open R via  your preferred method (icon on desktop, Start Menu, dock, etc.)
  2. Click “Packages” in the top menu then click “Install package(s)”. 
  3. Choose a mirror that is closest to your geographical location.
  4. Now you get to choose which packages you want to install. If you would like to install multiple packages, click on each one while holding the CTRL key or the cloverleaf-looking key. For now just highlight “ggplot2”.
  5. You will know when the package has been downloaded onto your computer when another greater-than symbol (“>”) appears.

Part 2-Loading the Package into R

  1. Type “library(ggplot2)” and then press the Enter/Return key.
  2. All done.

You will only need to do Part 1 once time on your computer. From now on, you only need to do Part 2 each time you close and restart R.

Method 2 (Quicker)

Use this method once you get more acquainted with the whole copy and paste business. It is much quicker than Method 1.

Part 1-Getting the Package onto Your Computer

  1. Type “install.packages(“gplots”)” and then press the Enter/Return key.
  2. If you have already loaded a package from a server in the R session, then R will automatically install the package. If not, R will automatically prompt you to choose a mirror. Again, choose one close to unless you want to watch a loading bar slowly inch its way to fulfillment.

Part 2-Loading the Package into R

  1. Type “library(gplots)” and then press the Enter/Return key.
  2. All done. R will spit out a lot more output because it needs to install other packages required for gplots.

Again, you will only need to do Part 1 one time on your computer. From now on, you only need to do Part 2 each time you close and restart R.

I thought R was a letter…intro/installation

Image

I will make a confession. This past summer, I didn’t spend my spare time watching relentlessly addicting TV shows nor clubbing in San Francisco. Instead, I checked out figures. No, not the sort of figures you’re probably thinking about. The ones that are included in research papers and have the potential to be beautiful works of art (such as the ones found on the Figured Foundation). And you guessed it, R can be used to generate awesome figures for your last minute put together presentation or end-of-term paper.

Oh yeah. That figure!

Some examples of R graphs found on the R website.

So what the heck is R?

R is a free “environment” that can be used to run statistical analysis and create amazing graphics (a more comprehensive list of R’s superpowers can be found here). It can be installed on Unix, Windows and Apple operating systems. It is considered an “environment” because there tons of add-ons that you can load or write to expand the capabilities of R. In essence, R is a free sports car that you can add rocket power and flying capabilities while still looking cool.

If this is your first time using a form of code, don’t be afraid. Once you get over the hump that R (or any code in general) is relentless about spelling, the process of entering code is no different from choosing the different components of a sundae (do we want strawberries or blueberries as the fruit topping?). And once you get a little more involved, you can start improvising your own crazy creations via the writing of your own functions. This past summer I met a guy that used R to scourge the Travelocity, Priceline, and other airplane ticket finding websites to search for his ideal plane tickets.

So how do I install this?

My apologies ahead of time, but this will be a Windows-centric tutorial. If you would like step by step instructions with screenshots, click here.

  1. Go to the R website and click “Download R” under “Getting Started”
  2. Choose a place to download R. Even though we’re on the limitless and borderless interweb, choosing a location close to you helps speeds things up.
  3. Choose which R package to download based on your operating system in the first box. If you are Unix or Mac user, I apologize buut this is where we now go our separate ways.
  4.  Click on “install R for the first time” then download the file with the biggest font on the top. Then open that puppy up.
  5. Windows might be paranoid and say the publisher could not be verified. Click “run”. Then choose your language.
  6. Click “next” to start the installation, agree to all their legal writings, and selection an installation window.
  7. Select “Core Files” and then either 32-bit or 64-bit files depending on your computer system. (To check, hit Start, right click Computer and select Properties. Look at System Type).
  8. Now you have a choice for Startup Options. I prefer to view the program in multiple separate windows so that I can arrange them on my screen while also have an internet browser or a notepad type program open as well.
    multiple2If you like what you see in the photo above, click “Yes (customized setup)”. If you prefer to have one window with all the components of the program viewed inside that window click “No (accept defaults)” and skip to Step 11.
  9. If you said yes to Step 8, click “SDI (separate windows)”. Next, you can specify plain text or HTML help. I would suggest HTML help because it is easier to view than plain text, which appears in the window.
  10. If you are at an institution that utilizes Internet2.dll, select “Internet 2.” If not or if you are unsure, select “Standard”.
  11. Go ahead and create a program shortcut by clicking “Next“.
  12. Choose if you want to have another icon clutter your desktop and/or Quick Launch toolbar. I suggest leaving the two options under “Registry Entries” selected.
  13. Let it do its thing. Go on Facebook, write a Tweet or run to the bathroom really quick.
  14. Things should be all done! Go update your status or Tweet how excited you are to have installed R.