16s rRNA Sequencing From Gut Microbiome (stool)

16s rRNA Sequencing From Gut Microbiome (stool)

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Do I extract RNA or DNA from gut microbiome (stool samples) if want to do 16s rRNA sequencing?

In general, you extract DNA, then PCR out the 16rRNA coding regions and finally sequence them. Here some links

16S rRNA Gene Sequencing vs. Shotgun Metagenomic Sequencing

Are you a company, lab or researcher planning a new microbiome study?If so, you are probably considering whether to conduct 16S rRNA gene sequencing or shotgun metagenomic sequencing. Although 16S rRNA gene sequencing has been more commonly used for microbiome studies to date, shotgun metagenomics is becoming more accessible and popular in microbiome research. However, each method has its pros and cons which should be considered before you decide which sequencing method to use. Here is your one-stop guide to 16S rRNA gene sequencing vs shotgun sequencing to help you generate the best data for your research.


While the microbiota that live on and in the human body have long been recognized as critical to understanding a variety of human diseases, we are only beginning to understand their equally critical role in maintaining human health. To facilitate this understanding, the National Institutes of Health launched the Human Microbiome Project (HMP) in 2008 [1] to sequence the microbiome of healthy human subjects ( One of the primary goals of the HMP is to characterize the human microbiome of healthy individuals, and to describe, if possible, a core microbiome. The NIH enrolled over 200 healthy subjects, both male and female, and collected microbial DNA samples from 18 different body sites [2]. Researchers from many different academic institutions are part of the HMP Data Analysis Working Group analyzing the HMP sequence data to answer a number of fundamental questions for a basic understanding of a healthy human microbiome. The HMP is using both 16S rRNA tag sequencing to elucidate the types of microbes and their relative abundances and shotgun metagenomic sequencing to find out what functions these microbes may be performing. These analyses, being published as an overview manuscript [3] and a series of companion papers, lay the groundwork for further research in the human microbiome: the similarities and differences between individuals and body sites, and through time the numbers and types of microbes and what role they play in human health.

The 16S rRNA gene is considered the gold standard for phylogenetic studies of microbial communities and for assigning taxonomic names to bacteria. The explosion of sequence data brought about by Next-Generation Sequencing (NGS) is highlighting a richness of microbes not previously anticipated. NGS comes with a clear trade-off. The number of reads sequenced is greater by orders of magnitude than previous methods (e.g., Sanger sequencing), but the reads are much shorter. The read length using the Roche GS-FLX (‘454’) technology has been increasing rapidly from 100 nt in 2006 to over 400 nt at present. Unfortunately, taxonomists cannot provide taxonomic names for all of the novel organisms discovered by this unprecedented depth of sampling. Even in sections of the bacterial tree that are well described, existing tools are generally not sufficient to provide species names or phylogenetic information for the millions of short reads. For instance, the most commonly used tool for assigning taxonomy to 16S tags, the Ribosomal Database Project (RDP) Classifier [4], at best classifies 16S sequences only as far as the genus level, although many sequences that are distant from the commonly used reference sequences or that are taxonomically ambiguous can only be described to class, order or family levels. To complement analyses relying on limited taxonomic names, 16S rRNA sequences can be clustered together into operational taxonomic units (OTUs) at the 97% similarity (3% difference). This level of sequence-based clustering is generally recognized as providing differentiation of bacterial organisms below the genus level, although it would be inaccurate to assume that this level of clustering consistently defines either microbial species or strains.

Previous studies have demonstrated a great deal of variation in gut and nasal microbiota between individuals [5], [6], [7], [8], [9], and in the microbiota at different body sites within a single individual [10]. This study uses the largest number of healthy subjects to date to look for the possibility of a set of core OTUs common across individuals and body sites within the larger context of variation. Using the OTU approach, we analyze the HMP 16S tag-sequencing data to look for organisms that occur in most or all healthy subjects. The depth of sequencing per sample in this project is not adequate to understand the nature or extent of rare organisms that often play an important role in health and disease instead, we focus on the more abundant organisms that are common across individuals.

Bioinformatics Analysis

Figure 1. Illumina MiSeq 16S rRNA data from saliva samples preserved for over 6 years using Norgen’s Saliva DNA Collection and Preservation Device (Cat. RU49000). The saliva DNA was isolated using Norgen’s Saliva DNA Isolation Kit (Cat. RU45400) from saliva that had been preserved for various periods of time up to 6 years at room temperature. The relative abundance (%) is shown per each sample, and shows the top 10 most abundant by Genus.

Figure 1. Illumina MiSeq 16S rRNA data from saliva samples preserved for over 6 years using Norgen’s Saliva DNA Collection and Preservation Device (Cat. RU49000). The saliva DNA was isolated using Norgen’s Saliva DNA Isolation Kit (Cat. RU45400) from saliva that had been preserved for various periods of time up to 6 years at room temperature. The relative abundance (%) is shown per each sample, and shows the top 10 most abundant by Genus.

Figure 2. The Alpha rarefaction curve is used to assess species richness. The curve is generated by by calculating the amount of different OTUs (operational taxonomic unit) in relation to sample size.

Figure 2. The Alpha rarefaction curve is used to assess species richness. The curve is generated by by calculating the amount of different OTUs (operational taxonomic unit) in relation to sample size.

Figure 3. 16S microbiome data generated by Illumina MiSeq. Stool DNA was isolated using Norgen’s Stool Isolation Kit (Magenetic Bead System)(Cat. 55700) from 200 mg of stool from 9 donors. The Shannons Diversity is a common Alpha Diversity metric which measures the richness and diversity of a population.

Figure 3. 16S microbiome data generated by Illumina MiSeq. Stool DNA was isolated using Norgen’s Stool Isolation Kit (Magenetic Bead System)(Cat. 55700) from 200 mg of stool from 9 donors. The Shannons Diversity is a common Alpha Diversity metric which measures the richness and diversity of a population.

Figure 4. High quality soil DNA was successfully isolated from soil samples using (A) Norgen’s Soil DNA Isolation Kit (Magnetic Bead System)(Cat. 58100) and (B) Norgen’s Soil DNA Isolation Plus Kit (Cat. 64000) respectively. The purified DNA was then successfully used in 16S rRNA Microbiome Sequencing. Once sequenced samples are assembled into OTUs (operational taxonomic unit) based on 97% identity and the Phylogenetic tree represents the evolutionary relationship between these OTUs.

Figure 4. High quality soil DNA was successfully isolated from soil samples using (A) Norgen’s Soil DNA Isolation Kit (Magnetic Bead System)(Cat. 58100) and (B) Norgen’s Soil DNA Isolation Plus Kit (Cat. 64000) respectively. The purified DNA was then successfully used in 16S rRNA Microbiome Sequencing. Once sequenced samples are assembled into OTUs (operational taxonomic unit) based on 97% identity and the Phylogenetic tree represents the evolutionary relationship between these OTUs.

Figure 5. Principal Coordinate Analysis (PCoA) of 26 samples which shows the differences in the distribution in of taxonomic classification between samples from a genus level, using an Unweighted Unifrac metric. The Unweighted Unifrac metric is sensitive to differences in low-abundant features.

Figure 5. Principal Coordinate Analysis (PCoA) of 26 samples which shows the differences in the distribution in of taxonomic classification between samples from a genus level, using an Unweighted Unifrac metric. The Unweighted Unifrac metric is sensitive to differences in low-abundant features.

  1. Figure 1.
  2. Figure 2.
  3. Figure 3.
  4. Figure 4.
  5. Figure 5.


For the WEHI data set, library sizes after quality filtering, clustering, and combining PCR replicates ranged from 30,000 to 250,000 sequences per sample, with a median of 67,000 (Figure S1A) sequences clustered into 12,652 OTUs of minimum size 20. For the BCM data set, library sizes ranged from 5,000 to 56,000, with a median of 27,000 (Figure S1B) sequences clustered into 3,675 OTUs of minimum size 20. The final number of sequences reflects differing sequencing and filtering protocols, including the use of multiple PCR replicates at WEHI.

Taxonomic overview

Samples were dominated at the phylum level by Bacteroidetes and Firmicutes, as expected. The mean summed proportion of these two phyla was 94%, varying from 71.9% to 99.7% between individual samples. Likewise, a single order from each of these two phyla, Bacteroidales and Clostridiales, was dominant, with three orders from the phylum Proteobacteria contributing another 1–2% overall (Fig. 2A,D see also Figure S2).

Overview of the fecal bacterial microbiome from sequencing at WEHI (A,B,C) or BCM (D,E,F). (A,D) Dominant bacterial genera in fecal samples, or higher taxa where genus was not available. Bars are colour-coded by phyla: red Bacteroidetes, blue Firmicutes, green Proteobacteria, brown Actinobacteria, yellow Verrucomicrobia. (B,E) Alpha diversity within samples. Two measures are shown: observed number of OTUs per sample, an estimate of richness, and Inverse-Simpson index indicating the evenness of the sample. Samples were sub-sampled to the smallest sample size, and values are the mean of 10 random sub-samples. Boxes show the inter-quartile range for the four methods on three days. (C,F) Beta diversity. NMDS ordination of the UniFrac distance between samples, a representation of phylogenetic similarity.

Alpha (α) diversity is used to characterise the richness of the microbiome and its evenness (heterogeneity) or distribution of proportions. Samples showed a considerable spread of α diversity (Fig. 2B,E). Samples from individual 66 had the lowest observed richness (number of OTUs per sample) and the lowest Inverse-Simpson diversity index, the latter indicating dominance by a smaller number of OTUs. This is reflected in the genera plots (Fig. 2A,D). In contrast, samples from individual 11 had a high observed richness but a comparatively low Inverse-Simpson index, consistent with the presence of a few high-abundance and multiple low-abundance genera.

Analysis of β diversity by non-metric multidimensional scaling (NMDS) ordination of the UniFrac distance showed that samples cluster strongly by individual, with marked separation between individuals (Fig. 2C,F).

Differences between WEHI and BCM data sets

For WEHI and BCM data sets the most abundant phyla were similar, but the proportions of less abundant phyla and higher taxonomic resolution differed. For example, the mean proportion of genus Akkermansia in the order Verrucomicrobiales was greater in BCM (0.7%) than WEHI (0.02%). The proportion of Bacteroides was lower in some individual samples for BCM than WEHI (Fig. 2B also S2).

The BCM data set yielded fewer OTUs and therefore had lower values for observed richness (Fig. 2B,E). The number of OTUs observed per sample was dependent on sampling depth (Figure S3) values shown are based on the smallest sample sizes for each of the two data sets. Richness was similar between the WEHI and BCM data sets, with samples from individual 66 showing the lowest alpha diversity and those from individual 44 the highest. For the Inverse-Simpson diversity index, which is not dependent on library size at this depth of sequencing, the BCM data set had a greater range of values, and a greater range for samples from some individuals. Both data sets had similar patterns of beta diversity between individuals (Fig. 2C,F), although the BCM data set had several outliers.

Initial analysis was performed separately on the WEHI and BCM datasets. For better comparison of the taxonomies, the bioinformatic pipeline was re-applied to a data set comprising the BCM sequences and one of the three WEHI technical replicates (Fig. 3). The ordination plot shows ‘batch’ effects between the two sequencing centers and greater between-sample differences in the BCM data set.

Beta diversity between samples from two sequencing centers. Ordination plot of Bray-Curtis distances between samples, using Detrended Correspondence Analysis. Points represent samples from BCM and a single technical replicate from WEHI.

DESeq2 was used to make generalized linear models for the counts at phylum, order and OTU levels (Table 1). The model included individual ID, day and collection-processing method as factors. At the phylum level, the largest change was in the Verrucomicrobia. At the OTU level, 3% of OTUs were significantly different (Figure S4, Additional data S1). Most of the differentially abundant OTUs belonged to the orders Clostridiales (63%) and Bacteroidales (31%). The direction of change in OTUs was not consistent, and there were no significant differences in counts for Clostridiales and Bacteroidales between WEHI and BCM data sets.

Effect of collection-processing method on taxonomic analysis

Testing of the WEHI data set for differential abundances between collection-processing methods, using DESeq2 with a design controlling for the effect of person and day, revealed no significant differences in counts by phylum, order or family (Table 2, Fig. 4A). Five OTUs (0.04% of OTUs comprising 0.2% of sequences) were different under collection-processing Method A. With the BCM data set, collection-processing Methods A and B were taxonomically different, with a decrease in Actinobacteria in Method A (Fig. 4D) and an increase in Lentisphaerae, although counts were very low (p < 0.001, Additional Table S1). Lentisphaerae were also increased in Method A compared with Methods C and D (p < 0.05).

Effect of collection-processing method from sequencing at WEHI (A,B,C) or BCM (D,E,F). (A,D) Log of standardised counts (scaled by library size) of the four most abundant phyla. Points show mean and bars standard deviation (sd) for each individual and collection-processing method. Method A has the smallest average sd for Bacteroidetes and Actinobacteria. (B,E) The Inverse-Simpson α diversity index for each sample (compare with Fig. 2). (C,F) Mean log (standardised count) plotted against the mean over the collection-processing methods, and a linear regression applied. Method A has the greatest average deviation from the linear model for the WEHI data set.

Diversity varied within a sample depending on collection-processing method (Fig. 4B,E) but the effect was small and inconsistent. After fitting a linear model with inputs for method and individual, 20–30% of variation was unaccounted for, while collection-processing method accounted for only 2%. Overall, alpha diversity was slightly lower with Method A in the WEHI data set, and higher with Method B in the BCM dataset. (Table 3).

Different methods of collection-processing might also increase the variance between samples, reducing the reproducibility of a result. Two approaches were used to test for this. Greater variance between samples is equivalent to greater distance between samples by some measure. The Bray-Curtis dissimilarity between OTU counts was calculated for pairs of samples from each individual and method, and a Tukey Honest Significant Difference test applied to a linear model of the dissimilarity. There was no evidence that the dissimilarity between samples was different for collection-processing methods (smallest p = 0.1) in the WEHI data set. There were significant differences in Bray-Curtis distances between samples in the BCM data set (p < 0.001), with collection-processing Method A associated with smaller differences between samples from the same individual than Methods B, C and D (Additional Table S2).

In addition, we looked for differences in the variance of the four most abundant phyla. The log transformed standardised counts for Bacteroidetes, Firmicutes, Proteobacteria and Actinobacteria per sample were compared with the mean across collection-processing methods for each individual (Fig. 4C,F). In the WEHI data set Methods B, C and D gave similar results, while method A had lower variance within samples from the same individual but greater deviation from the mean compared with the other methods.

Effect of collection-processing method on library size

Collection-processing methods were compared after quality filtering, barcode extraction and clustering. In the WEHI data set, the number of DNA sequences extracted per sample was not different by collection-processing method in the BCM data set, collection-processing Method D resulted in fewer sequences than other collection-processing methods, but the difference was small compared with total variation (Figure S5). Batch effects (sequencing run) were more significant (p < 10 −5 ) than collection-processing method, but batch and method together contributed less than 5% of the variation in library size.

Taxonomical Classification of Bacterial Sequences

Precise taxonomy assignments based on sequence alignments remain a computational challenge for both 16S and shotgun libraries, because of the short NGS read lengths. Prior to taxonomic classification, gene marker amplicon sequences, like regions of the bacterial 16S rRNA gene, are clustered by two main approaches (Sun et al., 2012 Chen et al., 2013). First, sequences can be clustered into phylotypes according to their similarity to previously annotated sequences in a reference database (Liu et al., 2008). Second, operational taxonomic units (OTUs) can be constructed by clustering sequences de novo, purely based on their similarity (Schloss and Westcott, 2011 Sun et al., 2012), which is computationally much more intensive. A hybrid method that combines both approaches is therefore recommended. In all cases, an arbitrary similarity threshold is used to differentiate clusters. The 99% similarity threshold is generally accepted as a good proxy for species (Stackebrandt and Ebers, 2006). However, this threshold is often insufficient to discriminate between closely related species, such as different members of the Enterobacteriaceae, Clostridiaceae, and Peptostreptococcaceae families. Importantly, higher resolution analytical tools have been published that overcome some of the limitations associated with clustering algorithms (Eren et al., 2013, 2014 Tikhonov et al., 2015).

Comprehensive reference databases have been compiled for annotation of sequenced bacteria metagenomes. For 16S rRNA genes, this includes the Greengenes database (DeSantis et al., 2006), the Ribosomal Database Project (RDP) (Cole et al., 2014), and SILVA (Quast et al., 2013). In addition to their extensive catalogs of curated 16S rRNA sequences, available for downloading, each of those portals also offers a series of bioinformatics tools for analysis of NGS sequences. Comprehensive analysis servers like MG-RAST are also publicly available, which already contain updated databases for annotation purposes (Meyer et al., 2008). More specifically, the human microbiome project (HMP) keeps a curated collection of sequences of microorganisms associated with the human body, including eukaryotes, bacteria, archaea and viruses, from both shotgun and 16S sequencing projects (C. Human Microbiome Project, 2012a,b). One of the approaches to increasing the resolution of taxonomical classification of sequences is to compile databases containing only the sequences likely to exist in the environment under study. For example, specialized databases comprising only members of the human intestinal microbiota (Ritari et al., 2015 Forster et al., 2016) have been created.

Robust bioinformatics approaches have also been developed for analysis of shotgun data (Riesenfeld et al., 2004 Schloss and Handelsman, 2008 Wu and Eisen, 2008 Huson et al., 2011 Boisvert et al., 2012 Gevers et al., 2012 Kultima et al., 2012 Namiki et al., 2012 Segata et al., 2012). Unique clade-specific marker genes (Mende et al., 2013) and lowest common ancestor (LCA) positioning approaches are among the most popular. For the former, a gene marker catalog is pre-computed from previously sequenced bacterial genomes and sequences are taxonomically classified by querying the catalog. For the LCA approach, pre-aligned sequences are hierarchically classified on a taxonomy tree using a placement algorithm (Aho et al., 1973 Huson et al., 2011). Sequences that surpass a dissimilarity threshold (bit-score) are progressively placed on higher taxonomy levels.


In-silico comparison of full vs. partial 16S gene sequencing

The in-silico analysis was carried out separately on two non-redundant public databases: Greengenes v13.8.99 29 and the Human Oral Microbiome Database (HOMD) v13 30 . Only the results for the Greengenes database are reported in the main text. For the HOMD, a single sequence was randomly selected to represent each species present in the database. As Greengenes does not consistently provide species-level taxonomic classification, all sequences with genus-level classification were selected and sequences representative of 99% sequence-similarity clusters were used to represent distinct species. Supplementary Fig.  2a (and Source Data) indicate the relative extent to which different bacterial taxa were represented within this Greengenes-derived database.

In-silico amplicons demarcating different sub-regions of the 16S gene were generated by trimming regions defined by established primer sets (Supplementary Table  1 ) using Cutadapt v1.4.2 31 , allowing up to three mismatches within the primer alignment. Sequences were discarded if one or more variable region (including V1–V9) could not be identified by the trimming tool, contained N’s, or if the resulting amplicon was Ϣ SDs away from the observed mean length for the respective region. These curation steps retained 15% and 75% of the sequences in the Greengenes and HOMD databases, respectively (Supplementary Table  2 ). Full-length (V1–V9) amplicons were aligned using MUSCLE 32 and Shannon entropy was calculated at each base position along a single E. coli str. K-12 substr. MG1655 (Fig.  1a ) 16S gene sequence (NCBI Gene ID 947777). Accordingly, deletions within other 16S sequences are represented in entropy plots, whereas deletions within the reference sequence are not.

To determine the taxonomic resolution of afforded by different variable regions, each in-silico amplicon was classified against the filtered reference database from which it was generated using the mothur command classify.seqs 33 with a range of minimum confidence thresholds (-cutoff 30�). To create OTUs, in-silico amplicon datasets generated for each sub-region were filtered to remove non-unique sequences and re-ordered to correspond with the sequence order in the V1–V9 dataset. Each amplicon was assigned a unitary abundance and OTUs were generated at a variety of similarity thresholds (97%, 98%, and 99%) using the USEARCH command cluster_otus 34 , with chimera detection disabled using the option -uparse_break �.

Construction of a bacterial mock community

Based on data available from the Human Microbiome Project and Human Oral Microbiome database, 36 bacterial strains were selected to represent microbes prevalent in the human body sites including the airways, gut, oral cavity, skin, and vaginal tract (Supplementary Table  3 ). DNA from ten strains was obtained directly from ATCC ( The other 26 strains were cultured in appropriate media and environmental conditions until cultures reached late logarithmic phase (Supplementary Table  3 ) 35 – 38 . Unless otherwise indicated, anaerobes were grown under an atmosphere of 90% N2, 5% H2, and 5% CO2. DNA was isolated by suspending cultures in TE buffer containing 20 mg ml 𢄡 lysozyme and incubated at 37 ଌ for 30 min. Subsequently, AL buffer (Qiagen, Valencia, CA) containing 1.23 mg ml 𢄡 Proteinase K was added and samples were incubated at 56 ଌ overnight. Samples were then incubated at 95 ଌ for 5 min and DNA was isolated using a DNeasy Blood and Tissue kit (Qiagen). DNA was eluted in MD5 solution (MoBio Laboratories, Carlsbad, CA). Isolated DNA was pooled in a manner that accounted for different numbers of 16S rRNA gene copies per species. Briefly, the genome size (n) in bp was estimated for each organism and was used to calculate the mass of DNA (m) per genome using the formula m = (n) (1.096 ×�� g𠂛p 𢄡 ). Genome mass was then normalized based on the predicted copy number of the 16S rRNA gene (Supplementary Table  3 ) and the appropriate mass of DNA containing the required 16S copy number for each species was calculated.

Illumina library preparation shotgun sequencing and assembly

WGS sequencing was performed for 19 members of the mock community that did not have WGS sequence data publicly available. Libraries were made using the Illumina TruSeq Nano DNA HT kit according to the manufacturer’s instructions, and were sequenced on either the Illumina MiSeq or HiSeq platform. Genomes for sequenced organisms were assembled individually using SPAdes v3.5.0 39 with post-processing enabled (�reful).

PacBio library preparation and sequencing

Sequencing libraries were prepared by amplifying the V1–V9 region of the 16S rRNA gene using primers 27F and 1492R (Supplementary Table  1 ), and Accuprime Taq polymerase (Thermo Fisher Scientific, Waltham, MA). Amplicons were purified using PCR purification kits (Qiagen, Hilden, Germany) and 1 μg of DNA was used for the SMRTbell 1.0 Template Prep Kit (Pacific Biosciences, Menlo Park, CA). SMRTbell-adapted sequences were run on the Pacific Biosciences (PacBio) RS II platform using P6C4v2 chemistry. Output files were processed and assembled into CCS reads using CCS2 v3.0.1 setting the minimum passes to 3, minimum signal-to-noise ratio (SNR) to 4, minimum length to 1200, minimum predicted accuracy to 0.9, and the minimum Z-score to 𢄥. Consensus sequences longer than 1600𠂛p were discarded.

Analysis of the bacterial mock community

Reference 16S rRNA gene sequences matching strains in the mock community were initially downloaded from the RDP database 40 . Several reference gene sequences contained ambiguous base calls. Each sequence was therefore aligned to its respective WGS assembly and the aligned assembly region extracted to create an improved reference gene set containing a single representative 16S rRNA gene sequence for each member of the mock community.

To determine sequence variation in PacBio CCS data, reads generated from the mock community were aligned to the mock reference gene set using Cross_match 41 with the minimum alignment score (-minscore) set to 750, the substitution penalty (-penalty) set to 𢄩, and only the best alignment for each read reported (-masklevel 0). Output alignments were parsed to determine the number and location of insertions, deletions, and substitutions in reads aligning to each reference 16S rRNA gene sequence.

To determine the frequency and position of expected sequence variation𠅊ttributable to the presence of multiple, divergent copies of the 16S rRNA gene within a single genome—the seven gene copy variants known to exist in the E. coli K-12, MG1655 sub-strain ( <"type":"entrez-nucleotide","attrs":<"text":"NC_000913.3","term_id":"556503834","term_text":"NC_000913.3">> NC_000913.3) were downloaded from RefSeq and aligned using MUSCLE. To provide a second estimation of expected intra-genome sequence variation, Illumina WGS sequence reads were aligned to the single E. coli reference sequence present in the mock community reference database and the location of insertions, deletions, and substitutions inferred using the SAMtools pileup command 42 .

Sampling and sequencing of the human stool microbiome

Stool samples were collected from four healthy, competitive cyclists enrolled in the study described by Petersen et al. 20 . Informed consent was obtained from all human participants and work was carried out with the oversight of the Jackson Laboratory Internal Review Board (IRB numbers 1503000013 and 16-JGM-07). Fecal material was self-collected using polyethylene sample collection containers (Fisher Scientific) and was placed on freezer packs before shipping to the Jackson Laboratory for Genomic Medicine. Once received, samples were stored at � ଌ prior to extraction. DNA was extracted using the PowerSoil DNA Isolation Kit (MO BIO Laboratories, Inc.). mWGS sequence libraries were prepared as described for the bacterial mock community and 150-base paired-end reads were generated on the Illumina NextSeq platform. Exact duplicate sequences were discarded on the assumption that they were PCR artifacts and the remaining reads were screened against the human reference genome (GRCh38) using BMTagger 43 . Adapters and low-quality bases were trimmed using Flexbar 44 .

Amplicon libraries were prepared and sequenced for the V1–V9 region (PacBio RS II) and V1–V3 region (Illumina MiSeq) as described for the bacterial mock community.

Quantifying bacteroides in the human stool microbiome

Taxonomic abundance estimates were generated from mWGS data by aligning sequenced reads to the Real Time Genomics™ (RTG) reference database of bacterial genome assemblies (v2.0), using the map and species commands within the RTG-core bioinformatics package (

Amplicon sequence data for the V1–V3 and V1–V9 region of the 16S rRNA gene were pooled and de-replicated using USEARCH (v8.0.1517), before being clustered into OTUs at either 97% or 99% similarity thresholds using the -cluster_otus command 34 . Amplicon sequences from each sample were then reassigned to each OTU at the same similarity threshold used for clustering in order to obtain OTU relative abundance estimates. The genus of each OTU was determined using the RDP classifier v2.2 11 in conjunction with the Greengenes database, v13.5 at a confidence threshold of 0.8.

V1–V3 and V1–V9 amplicons belonging to the genus Bacteroides were selected by directly classifying individual amplicon sequences using the RDP classifier. Sequences were then clustered into OTUs at either 97% or 99% identity thresholds using USEARCH. Representative sequences of Bacteroides OTUs generated for each variable region/identity threshold combination were assigned a putative species classification by aligning each sequence to the RTG reference database (v2.0) using the USEARCH local alignment algorithm 45 , allowing up to 50 top hits for each aligned sequence.

The suitability of the RTG database as a reference for discriminating different Bacteroides species was assessed by extracting the 16S rRNA gene sequences for each Bacteroides genome contained therein. Extracted sequences were globally aligned using MUSCLE, a maximum-likelihood tree was constructed using FastTree v2 46 , and visualized using the R package ape 47 . The resulting tree (Supplementary Fig.  11 ) indicated that sequence variation within the 16S gene was sufficient to resolve most major Bacteroides species contained within this database.

The suitability of either 97% or 99% identity thresholds for clustering V1–V3 and V1–V9 amplicons at the species level was assessed by determining the frequency with which OTUs for each variable region/identity threshold aligned optimally to a single species in the RTG reference database (Supplementary Fig.  12 ).

V1–V9 amplicon sequences assigned to the single OTU identified as B. vulgatus (OTU_1 Supplementary Data  1 ) were detected at high relative abundance in two human stool microbiome samples (Scott and IronHorse). Sequences from each sample were therefore extracted and aligned to the single 16S rRNA gene reference sequence used in the mock community analysis. Sequence alignment was performed using Cross_match and alignment errors were calculated as described above.

Isolation and sequencing of bacteria from human stool

Stool samples were again contributed by competitive cyclists enrolled in the study described by Petersen et al. 20 . Ethical oversight and sample collection were as described above. Bacteria were cultured on a variety of media and under anaerobic conditions, unless otherwise stated (Supplementary Data  2 ). Individual colonies were picked and DNA extracted using the MasterPure™ Gram Positive DNA Purification Kit (Lucigen). Samples were multiplexed and sequenced on a PacBio RS II. A subset of multiplexed libraries were sequenced on multiple SMRT cells at varying loading concentrations (Supplementary Data  2 ) resulting in different numbers of total reads. Each repeated run was therefore treated as a technical replicate to determine (i) the measurement error for the estimation of intragenomic 16S gene SNP frequencies attributable to the sequencing platform and (ii) the relationship between measurement error and sequencing depth.

Computational analysis of individual isolates

Sequence data for each isolate were quality filtered and adapters removed as described above. Filtered sequences were reoriented using the mothur command align.seqs, with the Silva gold database as a reference and the arguments flip = t, threshold =𠂐.5. Gaps in alignments were subsequently removed with the mothur command degap.seqs. Filtered, reoriented fasta files were then de-replicated using the USEARCH command -derep_fulllength and then sorted with -sortbysize, with the argument -minsize 1. The most abundant unique sequence for each isolate was then extracted (on the assumption it was the least likely to contain sequencing errors) and was used as a reference against which to align all reads for that isolate. Sequence alignment was performed using Cross_match with the arguments -minscore 1200, -masklevel 0, and alignment errors (substitutions, insertions, and deletions) calculated as described above.

Due to the prevalence of sequencing errors in processed reads (e.g., Supplementary Fig.  10 ), insertion and deletion errors were ignored when generating nucleotide substitution profiles. Substitution errors in alignments were filtered in a multi-step process to separate true intragenomic SNPs from background error. First, samples with fewer than 200 aligned reads were discarded, because preliminary investigation indicated they had insufficient signal-to-noise ratio for the detection of true SNPs. Second, the distribution of the frequency of substitution errors was calculated across the entire aligned region of the 16S gene. Base positions where the substitution error frequency was well outside instrument error (nine interquartile ranges above the upper quartile) were identified as true SNPs. Finally, samples with SNPs at ϣ% of base positions were discarded, as this threshold was empirically determined to exclude impure isolates.

We assessed SNP measurement error ( ζ w ) 48 for a subset of cultured isolates where replicate sequencing was performed on multiple SMRT cells using varying input library concentrations (Supplementary Data  2 ). We also took advantage of variation in sequencing depth between replicates to determine whether the measurement error was affected by the number of reads available for SNP phasing. Across 271 samples, the median ζ w was 1.8% (Supplementary Fig.  13a ). There was no obvious relationship between measurement error and sequencing depth for samples with >� reads (Supplementary Fig.  13b ).

Taxonomic identification of sequenced isolates

Isolates were assigned a putative taxonomy using BLAST 49 . The most abundant unique sequence for each isolate was searched against the NCBI 16S Microbial database using blastn, with the argument -max_target_seqs 20. Resulting hits were sorted first by e-value, then bitscore and the taxonomy of the highest scoring sequence was reported. In addition, sequences were clustered into OTUs at 99% sequence identity using USEARCH command -cluster_otus with the arguments -otu_radius_pct 1.0, -uparse_break �. The phylogenetic relationship between isolates was determined by aligning the most abundant unique sequence for each isolate, then constructing a maximum-likelihood tree using FastTree v2.

To determine the total number of unique nucleotide substitution profiles generated from sequenced isolates, all isolates identified as belonging to the same OTU were compared with one another. Two isolates were considered different if the substitution frequency at one or more SNP loci differed more than 3 SDs above the mean measurement error (i.e., 6.58%, Supplementary Fig.  13 ).

Reporting summary

Further information on research design is available in the  Nature Research Reporting Summary linked to this article.

5. Viromic Sequencing

Viruses are key constituents of microbial communities which contribute to their evolution and homeostasis. Viromic sequencing has been used to study the intestinal viruses in different diseases, including type 1 diabetes [8], inflammatory bowel disease [10,125], alcohol-associated liver disease [126], non-alcoholic fatty liver disease [127], colorectal cancer [128,129], human immunodeficiency virus [130], and autoimmune diseases [11]. Because of the highly diverse nature of viruses and the lack of universal marker genes, it is difficult to use amplicon-based approach to amplify them with universal markers. Instead, shotgun metagenomic sequencing approaches can be used to characterize viruses and identify novel viruses.

Although in most environment, viruses outnumber microbial cells 10:1, viral DNA only represents 0.1% of the total DNA in a microbial community. Isolation of viral particles is the initial step in viromic sequencing, which is necessary to obtain a deep sequence coverage of viruses in the human gut microbiome, followed by viral particle purification. Large particles in the fecal samples, such as undigested or partially digested food fragments and microbial cells, are generally removed by serial filtration steps with osmotic neutral buffer or by ultracentrifugation with cesium chloride density gradient. The next step is nucleic acid extraction, during which the nucleic acid of the virus must first be isolated so that all the non-viral origin fractions are removed. DNAase and RNAase are usually used to remove the non-encapsulated nucleic acids. Depending on the type of viruses being studied, the library preparation protocol also varies. For example, bacteriophages are parasitic, special steps are required when isolating the DNA. For RNA virus, due to its unstable nature, reverse transcriptase to cDNA is required. In addition, virome contains active and silent fractions. For studying both the active and silent fraction of the virome, total nucleic acid isolation is needed [131]. For the active fraction of the virome, it is often required to use a filter, chemical precipitation or centrifugation to isolate the virus DNA.

The initial analysis of the sequences obtained after DNA sequencing is also quality control, which includes filtering of bad quality reads, decontamination of 16S rRNA, 18S rRNA and human sequence reads. Viruses have higher homology to prokaryotic or eukaryotic genes, therefore filtering of bad quality sequences is a key step in the viromic analysis. The resulting sequences are analyzed by either alignment-based approach or assembly approach. With alignment-based approach, different mapping algorithms are used to compare the resulting sequence reads against viral genomes and viral databases. Although the databases have expanded recently, the number of genomes deposited in the databases is far less than the sequenced virotypes and most of sequences reads lack similarity to the sequences in the databases, which are poorly annotated. The lack of sequence identity typically results in 60%�% sequences in the viral metagenomes [132]. Due to the highly diverse nature of viruses and the lack of similarity in current existing databases, de novo assembly approaches are often used in the viromic analysis [131,133,134]. Different assemblers are used for viral metagenomic data, such as VICUNA [135]. Popular shotgun metagenome assemblers such as MetaVelvet has also been applied to viral metagenome assembly. There are some virome-specific computational pipelines available, such as Metavir [136,137] and the Viral MetaGenome Annotation Pipeline (VMGAP) [138], which generally include open reading frame (ORF)-finding algorithms to predict coding sequences, followed by comparison with different protein databases.


Sanger sequencing

Sanger sequencing resulted in 1242 reads of 16S rRNA gene sequences ('Sanger'-dataset). After aligning the reads against SILVA database, using BLASTN, we imported the results into MEGAN, where 1228 reads could be assigned. Surprisingly, we found a high abundance of Cyanobacteria in the Sanger data set.

454 sequencing

454 sequencing resulted in 72,571 reads of 16S rRNA gene sequences ('16S-454'-dataset). After aligning the reads against the SILVA database, using BLASTN, we imported the results into MEGAN, where 72,350 reads could be assigned. The abundance of Cyanobacteria was much lower in 454 sequences compared to the Sanger sequences. Furthermore, we detected slightly more Bacteroidetes than Firmicutes in this dataset, and also phyla being less abundant compared to Bacteroidetes and Firmicutes such as Verrucomicrobia and Actinobacteria easily overlooked when using Sanger sequencing. Proteobacteria and Clostridiaceae were only detectable at a low level by this approach.

SOLiD sequencing

16S sample: After filtering low quality sequences (during conversion from 'csfasta' to 'fasta', as mentioned above) we obtained 3,767,260 reads (2,155,456 forward and 1,611,804 reverse) for 16S samples ('16S-SOLiD' dataset). All sequences were blasted against the SILVA database and then imported into MEGAN, leading to assignments for 2,530,912 reads.

Shotgun sample: The above-mentioned conversion from 'csfasta' to 'fasta' format with quality filtering resulted in 10,764,512 forward and 9,997,372 reverse-reads for the 'Shotgun-SOLiD' dataset. Of these 3,168,307 forward and 4,577,127 reverse reads have length 40 bp or above. There were 791,321 mate pairs in which both reads had length of 40 bp or more. Further, there were 861,344 mate pairs in which only the forward read has length 40 bp or more and 1,798,245 matepairs in which only the reverse read had a length of 40 bp or more. In total, we considered 3,450,910 mate sequences or a total of 6,901,820 sequences for which at least one of the mates was at least 40bp long (for details see Table ​ Table1 1 ).

Table 1

Details of sequence reads of 'Shotgun-SOLiD' dataset.

Data type (shotgun sample)File consisting forward readsFile consisting reverse reads
Fasta file after quality filter10,764,5129,997,372
Reads of length 40+ bp3,168,3074,577,127
Reads where both the mates are 40+bp791,321791,321
Mates where one read is 40+bp other is 𼐋p861,344 forward (40+bp) reads has 𼐋p reverse mates1,798,245 reverse (40+bp) reads has 𼐋p forward mates
Total number of reads processed for BLAST3,450,9103,450,910

After adapter removal, all of these sequences were aligned against the NCBI-NR database using BLASTX and imported into MEGAN. Using the above-mentioned thresholds 1,100,372 reads could be assigned to some node in the NCBI taxonomy.

A comparison of main abundances of bacterial groups on four taxonomic levels derived from the different sequencing technologies is shown in Figure ​ Figure1. 1 . Additional file 1 shows the tree view of normalized comparison of the data obtained from these four methods. We have highlighted the nodes (showing sum and assigned read numbers) that are used to create Figure ​ Figure1. 1 . Further when judged, as overview in Figure ​ Figure1, 1 , 16S-Sanger and 16S-SOLiD generally look similar to each other except 'species' level, this is because using 16S-SOLiD we have much more reads compared to Sanger, and that helped us to achieve more species richness.

Comparison of abundances of bacterial groups on different taxonomic levels obtained by 'Sanger', '16S-454', '16S-SOLiD' and 'Shotgun-SOLiD' sequencing. (A) Phylum level, (B) class level, (C) genus level, and (D) species level. Columns are organized according to clustering results based on normalized Euclidean distance analysis of the phylogenetic tree on each taxonomic level, as displayed on the left.

Comparison of 16S and shotgun samples obtained using SOLiD technology

Figure ​ Figure2 2 shows a normalized comparative tree-view of the assignments at 'family' level of NCBI taxonomy. Beside information about the composition of the microbiome (as is the case with 16S rRNA sequences), the shotgun DNA includes information about the encoded proteins. While a higher percentage of the 16S rRNA sequences could be taxonomically assigned, the composition of the microbiota inferred by both approaches was comparable. However, there were microbial species that outweighed in one approach compared to the other. In shotgun sequencing, more Actinobacteria, Bacteroidetes, Bacillales, Lactobacillales, Clostridiaceae, Eubacteriaceae, Gammaproteobacteria, Selenomonadales and Fusobacteriacae were detectable. On the other hand, in 16S rRNA gene sequencing, we found confirmation for the high abundance of Cyanobacteria. In contrast, we could find only a few reads assigned to Cyanobacteria in shotgun sequencing. On the one hand, this over-representation could be caused by preferential amplification of the 16S rRNA genes of Cyanobacteria as argued in the Sanger sequencing results section. Furthermore, we found more reads that map to Verrucomicrobiacea, Clostridiales and Proteobacteria in 16S rRNA gene sequencing than in shotgun sequencing. The two major phyla in the intestinal microbiome, the Firmicutes and Bacteroidetes, are represented differently by the two approaches. While 16S rRNA sequencing revealed more Firmicutes, shotgun sequencing resulted in more Bacteroidetes. This difference could be due to artifacts of the amplification of 16S rRNA genes.

Normalized comparison result obtained using MEGAN for '16S-SOLiD' dataset and 'Shotgun-SOLiD' dataset. Normalized comparison result obtained using MEGAN for '16S-SOLiD' dataset (magenta) and 'Shotgun-SOLiD' dataset (yellow). '16S-SOLiD' dataset is blasted against the SILVA database and 'Shotgun-SOLiD' dataset is blasted against the NCBI_NR database. The tree is collapsed at 'family' level of NCBI taxonomy. Circles are scaled logarithmically to indicate the number of assigned of reads.

The results reported here are based on using all mate pairs for which at least one of the two reads has a length of 40 bp or more. If one would consider only those mate pairs, for which both reads have a length of at least 40 bp, then the number of reads considered would drop by 75%, resulting in a huge decline of computational requirements, but one will lose 33% of assigned reads (see Additional file 2) which leads to 21 more species. Hence, in some studies it may be sufficient to only consider mate pairs in which both reads are longer than 40 bp, if there are plenty of such reads.

Comparison of 16S samples from three technologies (Sanger, 454 and SOLiD)

As SOLiD sequencing is substantially more cost-efficient than Sanger sequencing, it is possible to produce many more SOLiD reads at a very small fraction of the cost of a Sanger run. SOLiD sequencing produces very short sequences and many of them cannot be assigned, and these are shown as 'No hits' node in the above figures. Sanger sequencing does not have this limitation and 454 data are also less affected in this respect. Hence, we ignored the 'No hits' node in the comparison. Figure ​ Figure3 3 depicts a normalized comparison tree view of the all the 16S samples obtained from three technologies at 'Family' level of the NCBI taxonomy. To facilitate visual comparison, nodes are scaled by 'summarized reads', that is, the number of reads assigned to or below a given node. It is clearly visible that we were able to find many phyla, such as Actinobacteria, and the domain of Archaea using SOLiD sequencing that were not detected by Sanger sequencing and appeared only with a few reads in the 454 dataset. Furthermore, important bacterial groups such as Verrucomicrobia, Lactobacilli, Fusobacteria and special members of the Clostridiales were not found by Sanger sequencing at all. In the 454 sample we detected Verrucomicrobia, but not the other two. We found comparable amounts between Sanger and 16S rRNA SOLiD sequencing for one the two major phyla of the intestinal microbiome, the Baceriodetes (Figure ​ (Figure3, 3 , Figure ​ Figure1 1 ).

Normalized comparison between 16S samples obtained using three technologies: 'Sanger', '16S-454' and '16S-SOLiD' datasets. Normalized comparison result obtained using MEGAN for 'Sanger'-dataset (blue), '16S-454' dataset (cyan) and '16S-SOLiD' dataset (magenta) without considering 'No hits' node. The tree is collapsed at 'family' level of NCBI taxonomy. Circles are scaled logarithmically to indicate the number of summarized reads.

A detailed absolute comparison between 1242 16S-Sanger reads, 72571 reads of 16S-454 and the 300,000 reads from '16S SOLiD' dataset is depicted in Additional file 3. Here we can see that 300,000 reads of '16S-SOLiD' datasets already provides much resolution in the analysis when compared to 16S sequences from Sanger or 454 technologies. Furthermore, according to Sanger sequencing reads, assignments to phyla such as the Proteobacteria and the Firmicutes are dominant, possibly because of easier cloning and particular amplification procedures. This amplification process could be the cause for the differences seen when comparing the amounts of Bacteroides, Gammaproteobacteria, Alphaproteobacteria and Bacilli in 16S sequencing. It was already shown in Figure ​ Figure2 2 that they are highly present in the shotgun dataset. Furthermore, the SOLiD datasets give information about the abundance of potentially pathogen microorganisms like Camphylobacter, Listeria and Neisseria. In the 'Sanger' dataset, these organisms were not detected due to their low abundance. The overrepresentation of the Cyanobacteria in the Sanger dataset was much less pronounced in the '16S-SOLiD' dataset. In the 'Sanger' dataset, the Cyanobacteria were the dominant group and had more reads than all other bacteria. In the '16S-SOLiD' dataset, they were still a group with a high abundance but the other bacterial groups were well represented, too. Low abundance of Cyanobacteria in the 'Shotgun-SOLiD' dataset could be explained by the missing amplification process in SOLiD technology. The advantage of SOLiD sequencing over Sanger sequencing is visible here. Due to the large number of reads, the overrepresentation of a bacterial group was less pronounced. Furthermore, the shotgun approach has the advantage of the avoiding amplification preferences for some bacterial groups. Figure ​ Figure2 2 illustrates that the bacterial groups of Actinobacteria, Bacteroidetes, Bacilli, Alpha- and Gammaproteobacteria and Clostridiaceae are underrepresented when amplification processes were used.

Furthermore, paired reads using SOLiD technology achieved much more resolution than 454 single reads at a lower cost (see Additional file 4).

In total, these data suggest that SOLiD sequencing is a viable and cost efficient option for the analysis of the intestinal microbiome in spite of the short read length.

Functional analyzes using SEED and KEGG

In this classification, genes are assigned to functional roles and different functional roles are grouped into subsystems. The SEED classification can be represented as a rooted tree in which internal nodes represent different subsystems and where leaves represent functional roles. MEGAN's functional analyzes using SEED classification is shown in Additional file 5.

For pathway analysis using KEGG, the program MEGAN matches each read to a KEGG orthology (KO) accession number, using the best hit to a reference sequence for which a KO accession number is known. The program reports the number of hits to each KEGG pathway. Additional file 6 depicts the result of such an analysis at the highest level of the KEGG hierarchy. To perform a functional analysis, MEGAN assigns each read to the functional role of the highest scoring gene in a BLAST or similar comparison against a protein database. To perform a KEGG analysis, then it attempts to match each read to a KEGG orthology (KO) accession number, using the best hit to a reference sequence for which a KO accession number is known. Thus from the functional analyses we can be informed about the possibility of metabolisms to be active. Thus this KEGG analysis is technically preliminary therefore only a detailed examination of individual pathways will allow on to decide which pathways are actual active.

Comparison with other approaches

To evaluate the performance of the MEGAN4 analysis based on a BLASTN comparison of the reads against the SILVA database, we ran the data through the RDP classifier [22](using 'Confidence threshold': 80%) (see Additional file 7). For RDP, we didn't specify minimum alignment length in order to allow all the assignments with previous threshold. The MEGAN analysis resulted in very similar annotation as with RDP. We also analyzed the data using MOTHUR software [23]. However, MOTHUR uses a simple best-hit assignment strategy that assigns all reads to the leaves of the NCBI taxonomy, regardless of the presence of other, equally similar reference sequences. Hence, a direct comparison against analyses performed using the LCA approach is hardly possible.

Beside these analyses an overall diversity was compared at genus level of the both 16S-SOLiD and 16S-454 data, using the Shannon-Weaver index and Simpson Reciprocal index, a measurement that combines diversity (the number of different nodes at a certain level) and evenness (the relative abundance of each node). Considering all the nodes at 'genus' level, we obtained for 16S SOLiD data Shannon and Simpson index values of 2.212 and 2.879, respectively. For 16S-454 data these two indices attain much lower values of 1.220 and 1.845, respectively.

Code availability

Software versions used are listed in Table  8 .

Table 7

16S alignment validation. Region(s) covered by 16S reads with exact matches to the SILVA database. The first column represents the region(s) called by our pipeline, while the third and fourth show the exact matching positions in the SILVA database. This shows consistency between the variable region called by our pipeline and the expected position it occupies along the 16S gene. SILVA IDs: B. fragilis: FQ312004.3243020.3244552 B. vulgatus: CP000139.2183533.2185042 F. nucleatum: AE009951.530422.531923 R. gnavus: AZJF01000012.178214.179732.

v2F. nucleatum134389
v2R. gnavus108362
v2B. vulgatus110364
v2B. fragilis108361
v3B. vulgatus330540
v3B. fragilis327537
v4F. nucleatum531818
v4R. gnavus500788
v4B. vulgatus522810
v6v7F. nucleatum9441207
v6v7R. gnavus9171177
v6v7B. vulgatus9361194
v6v7B. fragilis9331193

Code for sequence quality control and trimming, shotgun and 16S metagenomics profiling and generation of figures in this paper is freely available and thoroughly documented at This repository includes instructions for the analysis and reproduction of the figures on this paper from the publicly available samples, as well as pipelines used for the analysis. This repository is arranged in folders, each containing a README:

• qc: Scripts for quality control and preprocessing of samples

• analysis_shotgun: Scripts to run softwares for metagenomics analysis

• regions_16s: In-house scripts for splitting IonTorrent reads into new FASTQ files

• analysis_16s: DADA2 pipeline adapted to this dataset

• assembly: Scripts to run the assembly, binning and quality control software

• figures: Scripts used to generate the figures in this manuscript

• shannon_index_subsamples: Scripts used to compute alpha diversity in subsampled FASTQs

Watch the video: 16S rRNA Sequencing (June 2022).