RESISPART

Workshop Tutorial

1. Introduction

The goal of this workshop is to gain hands-on experience for analyzing the 16S rRNA gene sequences generated by the next generation sequencing (NGS) platform, such as the Illumina sequencers. The 16S rRNA gene is commonly used by the research community as a marker to decipher the diversity of the microbial community in a sample. The reasons for this are first, 16S rRNA gene is universally present in all prokaryotes (inlcuding Bacteria and Archaea); secondly, although the 16S rRNA gene sequences are very conserved in both Bacteria and Archaea, there are enough variability within the genes (i.e,. 9 hypervariable regions) for different species. Hence by sequencing the 16S rRNA genes in a microbial community, and matchinng the sequences to a set of 16S rRNA gene sequences with know taxonomy information, one can determine how many and what are the species in the samples. Since the NGS has a limitation of how long a DNA sequence can be decoded, only a portion of the 16S rRNA gene can be targeted. Usually the first step is to PCR-amplify a region of the 16S rRNA gene from the DNA extracted from the samples. The amplicons are then sequenced by the NGS technology to produce millions of DNA reads. These sequence reads are then filtered and denoised to reduce the errors and artifacts. The filtered reads are then taxonomically aasigned to determine their genus or species. The outcome of these processes is a read count table with columns representing samples and rows representing organisms, either genus, species or OTUs - operational taxonomic units, if the taxnomy has yet to be determined). This table is usually called an OTU table.

In this workshop we will learn how to use several commonly used software tools to analysze a set of demo sequences. After this workshop, we will have pratical hand-on experience for the bioinformatics task that is required to process this type of data in order to understand the mirobial diversity of the samples.

2. Installation of the R software and packages for analyzing microbiome sequence data

R is short for “The R Project for Statistical Computing” and is a free software environment for statistical analysis and data visualization. R has been a very popular software development platform for many bioinformatics toosl, including those that are useful for analyziging NGS sequenes reads, as well as the microbiome diversity analysis.

In this workshop we will use several sotftware packges developed on the R platform. Henct the first thing is to download and install R on your computer.

For this workshop we will pratice the analysis on a personal computer click the below links to download either Widnows 10 or Max OS X version of R to install:

Windows 10 version:

https://vps.fmvz.usp.br/CRAN/bin/windows/base/R-3.6.1-win.exe

Max OS X version:

https://brieger.esalq.usp.br/CRAN/bin/macosx/R-3.6.1.pkg

click above link and save the file to the default download location on your computer. Find the downloaded file, double-click to install R on your computer. During the installation just respond with default options.

Alternative:
These are some other mirrored web sites that provide this download:

Brazil
Computational Biology Center at Universidade Estadual de Santa Cruz
Universidade Federal do Parana
Oswaldo Cruz Foundation, Rio de Janeiro
University of Sao Paulo, Sao Paulo
University of Sao Paulo, Piracicaba

For a complete list of all the mirrored sites in different country, visit this link:
https://cran.r-project.org/mirrors.html

You can also downlaod another version of R, Microsoft R Open here:
https://mran.microsoft.com/ope
It's an enhanced version of the original R. The most important feature of R Open is that it can use the multiple computer CPU cores hence it can speed up many process during the analysis. The R version for Windows can only use a single core hence it can be very slow, especially with a larger (sequence) data set.

After R is installed on your computer, find the shortcut and launch the program. You should choose the “x64” version which is fastser. You will be presented with an R graphical interface:

There are many online R tutorials for beginners, including many Youtube videos, such as:

https://www.youtube.com/watch?v=iijWlXX2LRk

Next we’ll get our hands wet by testing out the R in Windows.

3. Basic R operations

The R command-line interface

There are two types of input in the R command-line interface

Comments - any line that start with a # sign, is treated as comment and the R program will ingore the entire line.
Commands - if a line does not beging iwth a # sign, it is treated as a command and whatever you type in must obey the R language and grammer. If you type something that R doesn’t recognize, it will give you an error output.
Multipe lines command input - if a line ends with a / (backward slash) sign, it tells the R program that you have done giving out your commands, and will continue your command input in the next line. R will then wait until the last line without the / sign and collect all the lines together and execuate the commands. This is because often time a command has many parameters, or a long parameter (such as a long flie name) and this multi-line command featuer will come in handy.

Now copy and paste the example R code below into your R command line interface:

#Below is a working R code that you can copy and paste into your R to execuate some R commands
message("Hello! Welcome to the RESISTPART Bioinformatics Workshop")
2+2
3*3

You should get an output like this:

Using R as a calculator

These are some basic mathematic operators:

# BASIC ARITHMETIC OPERATORS
2-5                # subtraction
6/3                # division
3+2*5              # note order of operations exists
(3+2)*5            # if you need, force operations using
                   # redundant parentheses
4^3                # raise to a power is ^
exp(4)             # e^4 = 54.598 (give or take)
log(2.742)         # natural log of 2.74
log10(1000)        # common log of 1000
pi                 # 3.14159... 

Using R for logical tests

These are some basic logical operators:

# BASIC ARITHMETIC OPERATORS
1 & 0              # AND
1 | 0              # OR
2 > 1              # comparison
2 < 1              # comparison
3>=3               # comparison
3<=4               # comparison

Store result (or anything) in a variable (object)

# Assign a number in a variable
x <- 5
# This works exactly the same
x = 5
# Now just type the variable name to print out its content
5
#
# Let get more complicated
longvariablename = "This is the content of a variable with a longe name"
longvariablename
# Store the result of a calculation in the variable
y = 1+2+3+4+5+6+7+8
y

R data types

R has 5 different data types:

character
numeric (real or decimal)
integer
logical (true or false)
complex

#### character data type
x="abcdefg"
x
sentence="This is a sentence"
sentence

#### numerica data type
x=123
class(x)
x=1.234
class(x)
y=3
class(y)

### integer
y=3L
class(y)
y=as.integer(3)
class(y)

#### logical
x= 1 > 2        # x has the logical value FALSE
x= 3 < 4        # y has the logical value TRUE
x= 1 & 1        # x is TRUE
x= 1 & 0        # x is False
x= TRUE | FALSE # x is TRUE
y= !x           # y is FALSE, ! is negate operator

#### complex
#A complex value in R is defined via the pure imaginary value i. 
z = 1 + 2i     # create a complex number 
z              # print the value of z 
class(z)        # print the class name of z 

R data structures

The data (of different types) can be combined to form data structures

vectors - list of multiple data of the same type
matrix - a 2D table of the same type of data
array - a multi-dimension tabel of the same types of data
list - list of multiple data of various types
data frame - a 2D table of same or different types of data
factors - a special type of data for statistical modelling

1. Vector

We use the c() function to create a vector.:

#### vectors examples
# numeric vector
X <- c(1,-2,5.3,6,-20,4)
print(X)

# character vector 
Y <- c("one","two","three") 
print(Y)

#logical vector
Z <- c(FALSE,TRUE,FALSE,FALSE,TRUE,FALSE)
print(Z)

# Accessing vector elements using position.
x <- c("Jan","Feb","Mar","April","May","June","July")
y <- x[c(2,3,6)]
print(y)
 
# Accessing vector elements using logical indexing.
v <- x[c(TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE)]
print(v)
 
# Accessing vector elements using negative indexing.
t <- x[c(-2,-5)]
 
print(t)

2. Matrix

A matrix in R consists of a 2D table of data arranged in rows and columna. We use the matrix() function to create a matrix:

# create a matrix with 3 rows and 5 column, 3x5, fill the data by column, not by row
mymatrix <- matrix(c(1,2,3,5,6,7,9,10,12,13,14,15), nrow=3, ncol=5, byrow=FALSE)
mymatrix

# Accessing elements of a matrix by row and column coordination
mymatrix[3,5] # row 6, column 5
mymatrix[2-4,3-5] # rows 2-4, columns 3-5

# can perform calculation on all elements of a matrix
mymatrix + 1
mymatrix - 1
mymatrix * 2
mymatrix / 4

3. Array

An array in R is similar to a matrix, except a matrix is only two-dimension (2D) while an array can have multiple deminsions:

# create an array with a dimension of 2x3x3:
myarray <- array(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18), dim=c(2,3,3))
myarray

# accessing an array element is in a similar manner:
myarray[2,1,3]

4. List

A list is like a vector, but it can holds different type of data, even another list.

list1=list(1,2,3,"A","B","C")
list2=list("A","B","C",1,2,3,T,F)
list3=list(list1,list2)
list3

5. Data frame

A data frame is a 2D data table that can holds different types of data (matrix and array can only hold sample types of data).

# build a random data frame
Number = c(2, 3, 5) 
Character = c("aa", "bb", "cc") 
Logical = c(TRUE, FALSE, TRUE) 
df = data.frame(n, s, b)

# R comes with a default data frame called "mtcars"
mtcars

6. Factor

A factor in R have a unique data type that is used to store categorical variables, which can be either numeric or string. Most important advantage to convert integer or character to factor is that, they can be used in statistical modeling where they will be implemented correctly. factor() is the function used to convert numeric or character variables to factors:

# create a factor using the factor() function, adding the factor in vector format:
myfactor=factor(c("Brazil","USA","USA","Brazil","Norway","Norway","Brazil"))
# create another factor where the level is ordered 
myfactor=factor(c("Brazil","USA","USA","Brazil","Norway","Norway","Brazil"),level=c("USA","Brazil","Norway","Sweden"))
myfactor

R Graphic Output

One of the strength of R is its capability to generate graphical results such as plots, charts, and heatmaps. For example:

Scatter plot

# We will use
mtcars
plot(x = mtcars$wt, y = mtcars$mpg)

# passing multiple variables to plot
plot(mtcars[, 4:6])

Line plot

# base graphic
plot(x = pressure$temperature, y = pressure$pressure, type = "l")

# add points
points(x = pressure$temperature, y = pressure$pressure)

# add second line in red color
lines(x = pressure$temperature, y = pressure$pressure/2, col = "red")

# add points to second line
points(x = pressure$temperature, y = pressure$pressure/2, col = "red")

Bar Chart

Some examples:

# Plot default DOB dataset
barplot(height = BOD$demand, names.arg = BOD$Time)

# Plotting Categorical data
age <- c(17,18,18,17,18,19,18,16,18,18)

# use table() function to categorize the count data by age
table(age)

# now make a bar plot in a little more fancy way;
barplot(table(age),
main="Age Count of 10 Students",
xlab="Age",
ylab="Count",
border="red",
col="blue",
density=10
)

Histogram

# A most basis historgram showing a set of normally distibuted data
# make a set of 500 normmally distributed data
n=rnorm(500)
hist(n)

# plot mpg frequency data 
# binning size is 10
hist(mtcars$mpg, breaks = 10)

# boxplot of mpg based on interaction of two variables
boxplot(mpg ~ cyl + am, data = mtcars)

Box plot

# When x is a factor (as opposed to a numeric vector), it will automatically create a box plot:
plot(as.factor(mtcars$cyl), mtcars$mpg)

plot(LakeHuron, type="l", main='type="l"')
plot(LakeHuron, type="p", main='type=p"')
plot(LakeHuron, type="b", main='type="b"')

Note: R has 104 built in data sets right after you install the software. To view a list of these data sets, use the function dat().

Beyound basic R - install R packages

Copy and paste the below R package installation codes to install the packages that we will be using in this workshop:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("dada2")
BiocManager::install("ggplot2")  
BiocManager::install("ggpubr")
BiocManager::install("phyloseq")
BiocManager::install("Biostrings")
BiocManager::install("writexl")

After installing these packages, we need to load them before we can use it, let’s load them all in a single shot:

library("dada2")
library("ggpubr")
library("ggplot2")
library("phyloseq")
library("Biostrings")
library("writexl")