%load_ext rpy2.ipython
Melting your phyloseq object with dplyr/tidyr
If you’re a microbial ecologist and you haven’t used phyloseq, put away your vintage scarf and thick-rimmed glasses because you are not hip. phyloseq is the nexus of all that is good in the universe. In this post we’ll look at one of the greatest phyloseq functions, psmelt
. psmelt
combines your sample info, OTU counts, and taxonomic annotations into a single dataframe where every row represents an OTU-sample combination. Hhhhwhat!?! Amazing, I know. From this omnipotent dataframe you can do anything. So, how do we use psmelt
? And can we re-create psmelt
with dplyr and tidyr?
Let’s start by bringing in some R packages and loading the rmagic
extension for IPython.
%%R
library(phyloseq); library(tidyr)
library(dplyr)
library(magrittr) library(ggplot2)
I’m going to use data from a paper we recently submitted. See the pre-print here if you’re interested. A typical microbiome data set consists of three tables:
- A table of sample information.
- A table of OTU counts in the samples.
- A table of taxonomic info for each OTU.
In this example the OTU counts and taxonomic info are combined in a json biom formatted table. We have the sample information in a QIIME formatted sample metadata table. Don’t get bogged down in the file formats, however. They’re just tables and you can get tables into R easily. Focus on the concepts and what information you need to do an analysis.
We’ll write a quick function to handle the QIIME formatted sample data table (note how we’re just using the base R read.table
function) and we’ll use the import_biom
function from phyloseq to import our OTU counts and OTU taxonomic annotations. We’ll combine all the info into a phyloseq object
.
%%R
= function(fn) {
read.qiime = "\t", comment = "", header = TRUE, stringsAsFactors = FALSE) %>%
read.table(fn, sep = X.SampleID) %>%
rename(SampleID = .$SampleID; .} %>%
{rownames(.) %>% sample_data
data.frame
}
= import_biom("data/otu_table_wtax.biom")
physeq = read.qiime("data/sample_data_combined_qiime_format.tsv")
sample_data(physeq) physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 4340 taxa and 158 samples ]
sample_data() Sample Data: [ 158 samples by 46 sample variables ]
tax_table() Taxonomy Table: [ 4340 taxa by 8 taxonomic ranks ]
Great! Now let’s melt all that information into an all-powerful, all-knowing, benevolent dataframe. (Evil laugh)
%%R
= psmelt(physeq)
mdf %>% select(OTU, SampleID, Abundance) %>% head mdf
OTU SampleID Abundance
20025 OTU.1106 4502024.3 2233
288 DQ531697.1 site16 2147
19962 OTU.1106 site19 1505
272 DQ531697.1 site15 1426
310 DQ531697.1 site17 1278
456836 OTU.36 site19 1268
To keep this post short, I won’t demonstrate where you can go from a melted phyloseq object
. Suffice to say that you can go anywhere with your melted phyloseq object
/dataframe. Think of it like a magic carpet.
With really big OTU tables, melting can take a long time…
%%R
system.time(psmelt(physeq))
user system elapsed
22.660 0.056 22.728
It takes about 30-ish seconds with our example phyloseq object with the psmelt
function. I wonder how fast dplyr + tidyr can melt our phyloseq object?
Let’s re-create the psmelt
function with dplyr + tidyr.
We need to,
- get our tables (sample data, OTU counts, taxonomy),
- combine our taxonomic info with our OTU counts,
- gather this combined table into a long form version, and
- add in the sample data.
%%R
= function(physeq) {
psmelt.dplyr = data.frame(sample_data(physeq))
sd = data.frame(tax_table(physeq)) %>% add_rownames("OTU")
TT = data.frame(otu_table(physeq), check.names = FALSE) %>% add_rownames("OTU")
otu.table %>%
otu.table %>%
left_join(TT) "SampleID", "Abundance", setdiff(colnames(otu.table), "OTU")) %>%
gather_(
left_join(sd) }
%%R
= psmelt.dplyr(physeq)
mdf.dplyr %>% select(OTU, SampleID, Abundance) %>% head mdf.dplyr
Joining by: "OTU"
Joining by: "SampleID"
OTU SampleID Abundance
1 OTU.758 G4F20 2
2 OTU.2942 G4F20 0
3 OTU.753 G4F20 0
4 OTU.752 G4F20 0
5 OTU.751 G4F20 0
6 OTU.750 G4F20 0
…and let’s see how fast it is.
%%R
system.time(psmelt.dplyr(physeq))
Joining by: "OTU"
Joining by: "SampleID"
user system elapsed
1.04 0.00 1.04
Wow. Roughly an order of magnitude faster. dplyr and tidyr are so great. And look at that syntax!
I shold note that I’m using phyloseq version 1.9.15 in this demo.
The psmelt
function is so useful (and much more rigorous/fool-proof than our quick dplyr/tidyr version) but if you have a huge phyloseq object that takes forever to melt, try dplyr + tidyr.
Let’s do one quick example from the melted phyloseq object
to give you a taste of how useful it is. Let’s say we want to make phylum level stacked bar charts for all the studies in the example data. Here’s how you might do it from the melted phyloseq object
.
%%R -w 1200
= mdf.dplyr %>%
d %>%
group_by(SampleID, Rank2) filter(Abundance > 0) %>%
= sum(Abundance), study = first(study)) %>%
summarize(rank2.abund = rank2.abund / sum(rank2.abund))
mutate(relative.abundance
= ggplot(d, aes(x = SampleID, y = relative.abundance, fill = Rank2)) +
p ~study, drop = TRUE, space = "free", scales = "free") +
facet_grid(.= "identity", width = 0.85) +
geom_bar(stat + theme(axis.text.x = element_blank(), legend.position = "none", strip.background = element_blank())
theme_bw() p
Snap.
%%R
= lapply(names(sessionInfo()$otherPkgs), packageVersion)
pckgs = names(sessionInfo()$otherPkgs)
names(pckgs) data.frame(pckgs)
ggplot2 magrittr tidyr dplyr phyloseq
1 1.0.0 1.5 0.2.0 0.4.1 1.9.15