%matplotlib inline
%load_ext rpy2.ipython
import matplotlib.pyplot as plt
import re
from align import aligner
from skbio import DNA, parse_fasta, SequenceCollection
import pandas as pd
Exploring 515F primer targets in archaeal and bacterial SSU rRNA genes
Microbial ecologists have long considered the 515F SSU rRNA gene PCR primer to be “universal.” That is, the 515F primer is thought to amplify Eukarya, Archaea, and Bacteria. Indeed the target of the 515F primer is well conserved across each domain, however, the there are some differences in the primer target between domains that should be considered when evaluating 515F for use in a microbial diversity studies.
To make matters more complicated the 515F primer has changed from its inception to account for new sequence information and it’s not clear from the literature which 515F version is most effective. In this post, we’ll take a look at tbe 515F primer target across a broad diversity of SSU rRNA genes. In subsequent posts we’ll evaluate different 515F versions and we’ll also evaluate reverse primers that can be paired with 515F. While there are excellent tools avaliable for SSU rRNA gene PCR primer analysis (see PrimerProspector, TestPrime), most tools focus on nucleotide frequecies in aligments. In this post, we’ll evaluate 515F by looking at the distribution of 515F target oligos.
We’ll use the 99% OTU seed sequences for the GreenGenes database as reference. You can download the GreenGenes files from here. This analysis will focus on Archaea and Bacteria.
First we need to import some python functions and modules as well as load some extensions for the IPython notebook.
Eventually we’ll want to explore target sites by their taxonomic affiliation so we need to get the taxonomic information from the GreenGenes files we downloaded (above). The following code is a quick and dirty way to populate a pandas DataFrame with the taxonomic annotations for the gene sequences we’ll be working with.
def parse_tax(line):
id, tax_str = line.split("\t")
= tax_str.rstrip().split(";")
k, p, c, o, f, g, s = {"kingdom" : k.split("__")[1],
tax_dict "phylum" : p.split("__")[1],
"class" : c.split("__")[1],
"family" : f.split("__")[1],
"order" : o.split("__")[1],
"genus" : g.split("__")[1],
"species" : s.split("__")[1],
"id" : id}
return tax_dict
= pd.DataFrame.from_records([parse_tax(line) for line in
df open("data/gg_13_5_otus/taxonomy/99_otu_taxonomy.txt")])
"id", inplace=True) df.set_index(
Now that we have the taxonomy for each sequence, let’s get the 515F target site. To do this we just align one combination of the degenerate 515F primer to each SSU rRNA gene sequence. We then pull out the aligned region from the SSU rRNA gene sequence – this is the 515F target! I’m using the glocal
method for the Python
aligner package for the alignment and I’m using scikit-bio to parse and manage the SSU rRNA gene sequenes.
= "data/gg_13_5_otus/rep_set/99_otus.fasta"
fn = DNA.iupac_degenerate_characters()
degens = SequenceCollection.from_fasta_records([(n, s) for n, s in parse_fasta(fn)
seqs if not any(i in s for i in degens)],
DNA)
= [seq[0] for seq in seqs.iteritems()] ids
%%time
= DNA("GTGCCAGCCGCCGCGGTAA") #GTGCCAGCMGCCGCGGTAA M = [CA]
F515
= [aligner(F515.sequence,
primer_targets_515 1].sequence,
seq[= "glocal")[1]
method for seq in seqs.iteritems()]
CPU times: user 20min 29s, sys: 112 ms, total: 20min 30s
Wall time: 20min 31s
Now we can join our target and taxonomy information into a single DataFrame. Each row represents one target from one gene. Specifically, each row has the target sequence and corresponding taxonomic information.
= df.join(pd.Series(primer_targets_515, index=ids, name = "515F"), how = "left")
df_targets =0, subset=["515F"], inplace=True) df_targets.dropna(axis
I love pandas but I think for this application it will be best to use the (mind-blowingly amazing) R package dplyR. So, we need to push our pandas DataFrame into an R session. I’m using the rmagic
functions from IPython/rpy2 to move from Python to R.
%R -i df_targets
%%R
library(dplyr)
library(ggplot2)
library(ggthemes) library(magrittr)
Ok, now we can make some figures! Let’s first look at the ten most abundant archaeal and bacterial 515F targets. You can see below that the bacterial 515F targets are dominated by a single sequence, and, that although the nineteen targets are similar in sequence there is little overlap in target representation between Archaea and Bacteria.
Code
%%R -w 600
= df_targets %>%
d %>%
group_by(kingdom, X515F) = n()) %>%
summarize(count filter(rank(desc(count), ties.method = "random") <= 10) %>%
%>%
arrange(desc(count), kingdom) = factor(X515F, levels = unique(.$X515F)))
mutate(X515F
= ggplot(d, aes(x = X515F, y = count))
p
= p + facet_wrap(~kingdom, ncol = 1, scales = "free_y")
p
= p + geom_bar(stat = "identity", fill = "#14979B")
p
= p + labs(x = "")
p
= p + scale_y_log10()
p
= p + theme_bw()
p
= p + theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 12),
p = element_text(size = 12),
axis.text.y = element_text(size = 16),
axis.title = element_blank(),
strip.background = element_text(size = 16))
strip.text
p
If we dig a little deeper into to archaeal 515F targets, we see some interesting trends among phyla. There are very few Nanoarchaea sequences in the reference database but they all have the same 515F target sequence which appears to be unique to Nanoarchaea. Also, there are a couple relatively abundant 515F target sequences with insertions. Eight of the ten most abundant archaeal 515F targets are identical over the last six nucleotides at the 3’ end of the 515F primer.
Code
%%R -h 550 -w 400
= df_targets %>%
d filter(kingdom == "Archaea", phylum != "") %>%
{= group_by(., X515F) %>%
top10 = n()) %>%
summarize(count filter(rank(desc(count), ties.method = "random") <= 10) %>%
"X515F") %>%
extract2(%>%
unique as.character
filter(., X515F %in% top10)
%>%
} %>%
group_by(phylum, X515F) = n()) %>%
summarize(count
{= group_by(., X515F) %>%
psort = sum(count)) %>%
summarize(S %>%
arrange(desc(S)) "X515F") %>% as.character
extract2(= factor(X515F, levels = psort))
mutate(., X515F
}
= ggplot(d, aes(x = X515F, y = count))
p
= p + facet_wrap(~phylum, ncol = 1, scales = "free_y")
p
= p + geom_bar(stat = "identity", fill = "#14979B")
p
= p + labs(x = "")
p
= p + scale_y_log10()
p
= p + theme_bw()
p
= p + theme(axis.text.x = element_text(angle = 60, hjust = 1, size = 12),
p = element_text(size = 12),
axis.text.y = element_text(size = 16),
axis.title = element_blank(),
strip.background = element_text(size = 16))
strip.text
p
The Bacteria appear to be dominated by a single target sequence (above). Let’s get that sequence explicitly.
%%R
max.target = df_targets %>%
filter(kingdom == "Bacteria") %>%
%>%
group_by(X515F) = n()) %>%
summarize(count %>%
arrange(desc(count)) filter(rank(desc(count)) == 1) %>%
"X515F") %>% as.character
extract2(
"The most abundant target sequence among bacteria is", max.target)
paste(# <!-- collapse=True -->
[1] "The most abundant target sequence among bacteria is GTGCCAGCAGCCGCGGTAA"
There are some phyla, however, that are not as dominated by the most abundant bacterial 515F target. Here is a histogram of representation (percentages of total targets) within phylum for the most abundant bacterial 515F target. You can see that there are nine phyla for which the most abundant 515F bacterial target represents less than 80% of total targets.
%%R -h 350
%>%
df_targets filter(kingdom == "Bacteria", phylum != "" ) %>%
%>%
group_by(phylum, X515F) = n()) %>%
summarize(count %>%
group_by(phylum) = count / sum(count)) %>%
mutate(count.relative filter(X515F == max.target) %>%
"count.relative") %>% hist(breaks = 50,
extract2(= "Histogram of most abundant target representation in phlya")
main # <!-- collapse=True -->
Let’s see what the 515F targets look like for those nine phyla (above). The plot below shows the five most abundant 515F targets ranked by max representation within a phylum for the nine selected phyla. You can see that FCPU426, WS5, and GN01 515F targets are still dominated by the most abundant 515F target across all bacteria (GTGCCAGCAGCCGCGGTAA). The TM7 phylum 515F targets differ from the most abundant target at the 3’ end of the 515F primer.
Code
%%R -w 600 -h 500
= df_targets %>%
d filter(kingdom == "Bacteria", phylum != "" ) %>%
%>%
group_by(phylum, X515F) = n()) %>%
summarize(count %>%
group_by(phylum) = count / sum(count)) %>%
mutate(count.relative
{= filter(., X515F == max.target, count.relative <= 0.80) %>%
keep "phylum") %>%
extract2(%>%
unique as.character
filter(., phylum %in% keep)
%>%
}
{= group_by(., X515F) %>%
top = max(count.relative)) %>%
summarize(m filter(rank(desc(m), ties.method = "random") <= 5) %>%
%>%
arrange(desc(m)) "X515F") %>%
extract2(%>%
unique as.character
filter(., X515F %in% top) %>% mutate(X515F = factor(X515F, levels = top))
}
= ggplot(d, aes(x = X515F, y = count.relative))
p
= p + facet_wrap(~phylum, ncol = 3, scales = "free_y")
p
= p + geom_bar(stat = "identity", fill = "#14979B")
p
= p + labs(x = "")
p
= p + theme_bw()
p
= p + theme(axis.text.x = element_text(angle = 65, hjust = 1, size = 12),
p = element_text(size = 12),
axis.text.y = element_text(size = 16),
axis.title = element_blank(),
strip.background = element_text(size = 16))
strip.text
p# <!-- collapse=True -->
I would guess that these nine phyla are the most likely to be missed by commonly used variants of 515F primers.
In the next post we’ll test that hypothesis by evaluating how well the 515F primer matches the targets we’ve identified…