How to select genes before log2 ratio on a RNASeq gene expression matrix, based on signal median

How to select genes before log2 ratio on a RNASeq gene expression matrix, based on signal median

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I want to transform a TCGA mRNA expression matrix (in linear data format) to log2-ratios and then run a feature (gene) selection, selecting the 1000 most variant genes (genes with higher standard deviation across samples). The workflow is the following:

  1. Select "good" genes before log2 ratio (genes each with median signal at least t in p% of samples);
  2. On selected genes, run log2 ratio, dividing each gene by its median signal and then log2-transforming the result matrix;
  3. Select the 1000 most variant genes along all samples.

How do I select t and p?

There is no rule for fixing t and p. It depends on the level of stringency you expect. Value of t depends on what is considered to be an active concentration; this need not be same for all genes.

This is an RNAseq data; I don't understand what is the "median" signal you are talking about. For each sample a gene would have a normalized expression value which is typically RPKM (Reads Per Kilobase per Million mapped reads). If you have replicates for each sample then take the mean not the median.

Regarding calculation of log-ratios: Always be careful with this especially in case of zeros. Instead of log ratios you may use some sort of a gain metric:

if ratio = x/y then gain = (x-y)/y

You can also do a principal component analysis on the data and select firstnprincipal components.

How to select genes before log2 ratio on a RNASeq gene expression matrix, based on signal median - Biology

There is not a general solution to select "t" and "p". Such choices are largely arbitrary. Furthermore, for an array platform, if one assumes that "t" has something to do with "expressed", the value for "t" will differ for each probe on the array.

Since you are ultimately going to filter based on variance, I'd suggest starting with your median-centered, log-transformed data and simply choose the top 1000 most variable genes.

The data that OP is referring to is RNAseq so no probes. Sequencing bias correction can be done for them.

Thanks Sean. Regarding your last suggestion, I thought that it could introduce problems, since genes with very low median signal can show a high variance when logratio transformed. What do you think?

Well log transformation will bring down the variance. Imagine 4 samples [0.5, 2, 8, 32]. Without log transformation the variance is 213.5625 but when you log2 transform the data then the variance reduces to 6.67

In any case if the expression is consistently low then the variance will be low. You should be careful about log transformations especially when doing differential expression studies. I would suggest that you do the log transformation after selecting for median and variance.

I followed your first advice using custom thresholds.

Hoping this is useful, the code of the pipeline is available at

I filtered out the genes which were below the overall 5th percentile in more than the 5% of the samples. I think it could be a reasonable threshold. In case tell me.

GSEA Data Files

How do I create an expression dataset file? What types of expression data can I analyze?

GSEA requires that expression data be in a RES, GCT, PCL, or TXT file. All four file formats are tab-delimited text files. For details of each file format, see Data Formats.

GenePattern provides several modules for converting expression data into gct and/or res files:

ExpressionFileCreator converts raw expression data from Affymetrix CEL files.

GEOImporter and caArrayImportViewer create a GCT file based on expression data extracted from the GEO or caArray microarray expression data repository, respectively.

MAGEImportViewer module converts MAGE-ML format data. MAGE-ML is the standard format for storing both Affymetrix and cDNA microarray data at the ArrayExpress repository.

To use expression data stored in any other format (such as cDNA microarray data), first convert the data into a tab-delimited text file that contains expression measurements with genes as rows and samples as columns and then modify that text file to comply with the gct file format requirements as described in Expression Datasets in the GSEA User Guide.

If you are using two-color ratio data, see also cDNA Microarray Data.

Parsing Errors: If you see the following parsing error when you load your data file, check the file extension:
There were errors: ERRORS #:1Parsing trouble…

The file extension of the expression dataset file identifies the format of the file. If a gct, res, or pcl file has a .txt file extension, you will see the parsing error when you load the file into GSEA. Check that the file extension matches the file format. Note that some operating systems (such as Windows), can be configured to hide known file extensions. If your operating system is configured to hide known extensions, a file named test.gct.txt will be listed as test.gct. Look at the file type of the file: it should be GCT (or RES or PCL), not Text Document.

How do I filter or pre-process my dataset for GSEA?

How you filter or pre-process your data depends on your study. Here are a few guidelines to consider:

  • Probe identifiers versus gene identifiers. Typically, your dataset contains the probe identifiers native to your microarray platform DNA chip. GSEA can analyze the probe identifiers or collapse each probe set to a gene vector, where the gene is identified by gene symbol. Collapsing the probe sets prevents multiple probes per gene from inflating the enrichment scores and facilitates the biological interpretation of analysis results.
  • AP call filters. You can run GSEA on filtered or unfiltered data. Typically, the GSEA team runs the analysis on unfiltered data. One suggested approach is to run GSEA on the unfiltered data. If the results seem dominated by gene sets will poorly expressed genes, you might gain insight into what thresholds to use for the call filters.
  • Expression values. The GSEA algorithm examines the differences in expression values rather than the values themselves. For example, you might have natural scale data or logged expression levels you might have Affymetrix data or two-color ratio data.<a name="_Toc120959112"></a> As in most data analysis methodologies, the same expression data represented in different formats may generate different analysis results. The differences are expected. GSEA cannot determine which results are "correct."<a name="_Toc120959112"></a>

For more information, see Preparing Data Files in the GSEA User Guide.

Should I use natural or log scale data for GSEA?

We recommend using natural scale data. We used it when we calibrated the GSEA method and it seems to work well in general cases.

Traditional modeling techniques, such as clustering, often benefit from data preprocessing. For example, one might filter expression data to remove genes that have low variance across the dataset and/or log transform the data to make the distribution more symmetric. The GSEA algorithm does not benefit from such preprocessing of the data.

How many samples do I need for GSEA?

This depends on your specific problem and data characteristics however, as a rule of thumb, you typically want to analyze at least ten samples.

If you have technical replicates, you generally want to remove them by averaging or some other data reduction technique. For example, assume you have five tumor samples and five control samples each run three times (three replicate columns) for a total of 30 data columns. You would average the three replicate columns for each sample and create a dataset containing 10 data columns (five tumor and five control).

How do I create a phenotype label file? What types of experiments can I analyze?

GSEA can be used to analyze experiments of any type (including time-series, three or more classes, and so on). The phenotype labels (cls) ASCII file defines the experimental phenotypes and associates each sample in your dataset with one of those phenotypes. The cls file is an ASCII tab-delimited file, which you can easily create using a text editor. For more information, see Preparing Data Files in the GSEA User Guide.

What gene sets are available? Can I create my own gene sets?

You can use the gene sets in the Molecular Signature Database (MSigDB) or create your own. For more information about the MSigDB gene sets, see the MSigDB page of this web site. For more information about creating gene sets or using gene sets with GSEA, see Preparing Data Files in the GSEA User Guide.

How many genes should there be in a gene set?

GSEA automatically adjusts the enrichment statistics to account for different gene set sizes, as described in the Supplemental Information for the GSEA 2005 PNAS paper.

Can GSEA analyze a gene set that contains duplicate genes? duplicate gene sets?

Duplicate genes in a gene set and duplicate gene sets both effect GSEA results. GSEA automatically removes duplicate genes from each gene set, but does not check for duplicate gene sets. For more information, see Gene Sets in the GSEA User Guide.

Can GSEA analyze a gene set that contains genes that are not in my expression dataset?

The gene set enrichment analysis automatically restricts the gene sets to the genes in the expression dataset. The analysis report lists the gene sets and the number of genes that were included and excluded from the analysis.

What array platforms and organism species does GSEA support?

GSEA works on any data, as long as the gene identifiers in your expression data match those in the gene sets file.

Typically, GSEA uses gene sets from MSigDB. All gene sets in MSigDB consist of human gene symbols. GSEA has build-in tools for conversion between a variety of other gene identifiers to human gene symbols by means of specially formatted CHIP files. The CHIP files provide the mapping between gene identifiers in your expression data and gene identifiers in the gene sets. Specifically, our CHIP files provide the mappings from all kinds of different platforms (e.g., mouse Affymetrix probe set IDs, human Affymetrix probe set IDs, etc.) to human gene symbols.

If your data was generated from non-human samples, then you need to decide whether using MSigDB meets your needs. The options are:

  1. The non-human species serves as a model to study conditions relevant for human biology. In this case, you want gene sets that are conserved between humans and your model organism. MSigDB is then the right choice and you will only need to provide the appropriate CHIP file for the analysis.
  2. The non-human species is the subject of your research, and you have no plans to compare it to human gene sets. In this case, you can still use MSigDB is your organism is among the sources of some of MSigDB gene sets (e.g., mouse or rat) and you will only need to provide the appropriate CHIP file for the analysis.
  3. The non-human species is the subject of your research and you don't want to use MSigDB gene sets for other reasons. In this case, you have to provide your own database of gene sets as a GMT or GMX file. The file formats are described here. Of course, you still have to make sure that the gene identifiers in your your data match those in your gene sets database. If the identifiers don't match each other, then you have to also provide a CHIP file with the appropriate mappings. The CHIP file format is described here.

To see what CHIP files are available in our distribution (note: our CHIP files provide mappings to human gene symbols only): start GSEA desktop application and click [. ] at "Chip platform(s)" on "Run GSEA" page.

If your platform is not in this list, you have the following options:

  1. Create your own CHIP file to map your platform specific gene identifiers to human gene symbols and then use your CHIP file to collapse dataset in GSEA. The CHIP file format is described here.
  2. Convert your platform identifiers to human gene symbols outside GSEA, then run GSEA with 'Collapse dataset' = FALSE.

Make sure that gene symbols in the collapsed dataset appear only once. Simply replacing the identifiers with human gene symbols usually is not sufficient because some of the identifiers can correspond to the same human gene symbols, resulting in duplicate rows with different expression values. In this case, GSEA will arbitrarily pick one of the rows with the same gene symbols for the analysis, which we do not recommend.

Can GSEA analyze miRNA expression data?

The only way for GSEA to analyze expression data with miRNA identifiers is to provide gene sets made of matching miRNA identifiers. This is not possible with MSigDB gene sets, which consist predominantly of protein coding genes in the form of human gene symbols.

RNA-Seq Analysis of Spatiotemporal Gene Expression Patterns During Fruit Development Revealed Reference Genes for Transcript Normalization in Plums

Transcriptional analysis that uncovers fruit ripening-related gene regulatory networks is increasingly important to maximize quality and minimize losses of economically important fruits such as plums. RNA sequencing (RNA-Seq) and quantitative real-time reverse transcription polymerase chain reaction (qRT-PCR) are important tools to perform high-throughput transcriptomics. The success of transcriptomics depends on the high-quality transcripts from polyphenolic- and polysaccharide-enriched plum fruits, whereas reliability of quantification data relies on accurate normalization using suitable reference gene(s). We optimized a procedure for high-quality RNA isolation from vegetative and reproductive tissues of climacteric and non-climacteric plum cultivars and conducted high-throughput transcriptomics. We identified 20 candidate reference genes from significantly non-differentially expressed transcripts of RNA-Seq data and verified their expression stability using qRT-PCR on a total of 141 plum samples which included flesh, peel, and leaf tissues of several cultivars collected from three locations over a 3-year period. Stability analyses of threshold cycle (C T) values using BestKeeper, delta (Δ) CT, NormFinder, geNorm, and RefFinder software revealed S AND protein-related trafficking protein (MON), elongation factor 1 alpha (EF1α), and initiation factor 5A (IF5A) as the best reference genes for precise transcript normalization across different tissue samples. We monitored spatiotemporal expression patterns of differentially expressed transcripts during the developmental process after accurate normalization of qRT-PCR data using combination of two best reference genes. This study also offers a guideline to select best reference genes for future gene expression studies in other plum cultivars.

This is a preview of subscription content, access via your institution.

Genes that have very low counts across all the libraries should be removed prior to downstream analysis. This is justified on both biological and statistical grounds. From biological point of view, a gene must be expressed at some minimal level before it is likely to be translated into a protein or to be considered biologically important. From a statistical point of view, genes with consistently low counts are very unlikely be assessed as significantly DE because low counts do not provide enough statistical evidence for a reliable judgement to be made. Such genes can therefore be removed from the analysis without any loss of information.

As a rule of thumb, we require that a gene have a count of at least 10–15 in at least some libraries before it is considered to be expressed in the study. We could explicitly select for genes that have at least a couple of counts of 10 or more, but it is slightly better to base the filtering on count-per-million (CPM) values so as to avoid favoring genes that are expressed in larger libraries over those expressed in smaller libraries. For the current analysis, we keep genes that have CPM values above 0.5 in at least two libraries:

Here the cutoff of 0.5 for the CPM has been chosen because it is roughly equal to (10/L) where (L) is the minimum library size in millions. The library sizes here are 20–25 million. We used a round value of 0.5 just for simplicity the exact value is not important because the downstream differential expression analysis is not sensitive to the small changes in this parameter. The requirement of (ge 2) libraries is because each group contains two replicates. This ensures that a gene will be retained if it is expressed in both the libraries belonging to any of the six groups.

The above filtering rule attempts to keep the maximum number of interesting genes in the analysis, but other sensible filtering criteria are also possible. For example keep <- rowSums(y$counts) > 50 is a very simple criterion that would keep genes with a total read count of more than 50. This would give similar downstream results for this dataset to the filtering actually used. Whatever the filtering rule, it should be independent of the information in the targets file. It should not make any reference to which RNA libraries belong to which group, because doing so would bias the subsequent differential expression analysis.

The DGEList object is subsetted to retain only the non-filtered genes:

The option keep.lib.sizes=FALSE causes the library sizes to be recomputed after the filtering. This is generally recommended, although the effect on the downstream analysis is usually small.


In simulation and real data studies, limma with the l, l2, and r2 transformations performed better than limma with the voom transformation for data with small (nCases = nControls = 3) or large sample size (nCases = nControls = 100). For moderate sample size (nCases = nControls = 30 or 50), limma with the rv and rv2 transformation performed better than limma with the voom transformation. We hope these novel data transformations could provide investigators more powerful differentially expression analysis using RNA-seq data.

Inspection of the mapping results

The BAM file contains information about where the reads are mapped on the reference genome. But as it is a binary file containing information for many reads (several million for these samples), it is difficult to inspect and explore the file.

A powerful tool to visualize the content of BAM files is the Integrative Genomics Viewer (IGV).

Hands_on Hands-on: Inspection of mapping results

  1. Install IGV (if not already installed)
  2. Start IGV locally
  3. Expand the param-file mapped.bam file (output of RNA STAR tool ) for GSM461177
  4. Click on the local in display with IGV local D. melanogaster (dm6) to load the reads into the IGV browser

Comment Comments

In order for this step to work, you will need to have either IGV or Java web start installed on your machine. However, the questions in this section can also be answered by inspecting the IGV screenshots below.

Check the IGV documentation for more information.

IGV tool : Zoom to chr4:540,000-560,000 (Chromosome 4 between 540 kb to 560 kb)

Question Question

  1. What information appears at the top as grey peaks?
  2. What do the connecting lines between some of the aligned reads indicate?

Solution Solution

  1. The coverage plot: the sum of mapped reads at each position
  2. They indicate junction events (or splice sites), i.e. reads that are mapped across an intron

IGV tool : Inspect the splice junctions using a Sashimi plot

Comment Creation of a Sashimi plot

Question Question

  1. What does the vertical red bar graph represent? And the arcs with numbers?
  2. What do the numbers on the arcs mean?
  3. Why do we observe different stacked groups of blue linked boxes at the bottom?

Solution Solution

  1. The coverage for each alignment track is plotted as a red bar graph. Arcs represent observed splice junctions, i.e., reads spanning introns
  2. The numbers refer to the number of observed junction reads .
  3. The different groups of linked boxes on the bottom represent the different transcripts from the genes at this location, that are present in the GTF file.

Comment Comment

After the mapping , we have the information on where the reads are located on the reference genome. We also know how well they were mapped. The next step in RNA -Seq data analysis is quantification of the number of reads mapped to genomic features (genes, transcripts, exons, …).

Comment Comment

The quantification depends on both the reference genome (the FASTA file) and its associated annotations (the GTF file). It is extremely important to use an annotation file that corresponds to the same version of the reference genome you used for the mapping (e.g. dm6 here), as the chromosomal coordinates of genes are usually different amongst different reference genome versions.

In order to identify exons that are regulated by the Pasilla gene, we need to identify genes and exons which are differentially expressed between samples with PS gene depletion (treated) and control (untreated) samples. We will then analyze the differential gene expression and also the differential exon usage.

Computational Genomics with R

With the advent of the second-generation (a.k.a next-generation or high-throughput) sequencing technologies, the number of genes that can be profiled for expression levels with a single experiment has increased to the order of tens of thousands of genes. Therefore, the bottleneck in this process has become the data analysis rather than the data generation. Many statistical methods and computational tools are required for getting meaningful results from the data, which comes with a lot of valuable information along with a lot of sources of noise. Fortunately, most of the steps of RNA-seq analysis have become quite mature over the years. Below we will first describe how to reach a read count table from raw fastq reads obtained from an Illumina sequencing run. We will then demonstrate in R how to process the count table, make a case-control differential expression analysis, and do some downstream functional enrichment analysis.

8.3.1 Processing raw data Quality check and read processing

The first step in any experiment that involves high-throughput short-read sequencing should be to check the sequencing quality of the reads before starting to do any downstream analysis. The quality of the input sequences holds fundamental importance in the confidence for the biological conclusions drawn from the experiment. We have introduced quality check and processing in Chapter 7, and those tools and workflows also apply in RNA-seq analysis. Improving the quality

The second step in the RNA-seq analysis workflow is to improve the quality of the input reads. This step could be regarded as an optional step when the sequencing quality is very good. However, even with the highest-quality sequencing datasets, this step may still improve the quality of the input sequences. The most common technical artifacts that can be filtered out are the adapter sequences that contaminate the sequenced reads, and the low-quality bases that are usually found at the ends of the sequences. Commonly used tools in the field (trimmomatic (Bolger, Lohse, and Usadel 2014) , trimGalore (Andrews 2010) ) are again not written in R, however there are alternative R libraries for carrying out the same functionality, for instance, QuasR (Gaidatzis, Lerch, Hahne, et al. 2015) (see QuasR::preprocessReads function) and ShortRead (Morgan, Anders, Lawrence, et al. 2009) (see ShortRead::filterFastq function). Some of these approaches are introduced in Chapter 7.

The sequencing quality control and read pre-processing steps can be visited multiple times until achieving a satisfactory level of quality in the sequence data before moving on to the downstream analysis steps.

8.3.2 Alignment

Once a decent level of quality in the sequences is reached, the expression level of the genes can be quantified by first mapping the sequences to a reference genome, and secondly matching the aligned reads to the gene annotations, in order to count the number of reads mapping to each gene. If the species under study has a well-annotated transcriptome, the reads can be aligned to the transcript sequences instead of the reference genome. In cases where there is no good quality reference genome or transcriptome, it is possible to de novo assemble the transcriptome from the sequences and then quantify the expression levels of genes/transcripts.

For RNA-seq read alignments, apart from the availability of reference genomes and annotations, probably the most important factor to consider when choosing an alignment tool is whether the alignment method considers the absence of intronic regions in the sequenced reads, while the target genome may contain introns. Therefore, it is important to choose alignment tools that take into account alternative splicing. In the basic setting where a read, which originates from a cDNA sequence corresponding to an exon-exon junction, needs to be split into two parts when aligned against the genome. There are various tools that consider this factor such as STAR (Dobin, Davis, Schlesinger, et al. 2013) , Tophat2 (Kim, Pertea, Trapnell, et al. 2013) , Hisat2 (Kim, Langmead, and Salzberg 2015) , and GSNAP (Wu, Reeder, Lawrence, et al. 2016) . Most alignment tools are written in C/C++ languages because of performance concerns. There are also R libraries that can do short read alignments these are discussed in Chapter 7.

8.3.3 Quantification

After the reads are aligned to the target, a SAM/BAM file sorted by coordinates should have been obtained. The BAM file contains all alignment-related information of all the reads that have been attempted to be aligned to the target sequence. This information consists of - most basically - the genomic coordinates (chromosome, start, end, strand) of where a sequence was matched (if at all) in the target, specific insertions/deletions/mismatches that describe the differences between the input and target sequences. These pieces of information are used along with the genomic coordinates of genome annotations such as gene/transcript models in order to count how many reads have been sequenced from a gene/transcript. As simple as it may sound, it is not a trivial task to assign reads to a gene/transcript just by comparing the genomic coordinates of the annotations and the sequences, because of confounding factors such as overlapping gene annotations, overlapping exon annotations from different transcript isoforms of a gene, and overlapping annotations from opposite DNA strands in the absence of a strand-specific sequencing protocol. Therefore, for read counting, it is important to consider:

  1. Strand specificity of the sequencing protocol: Are the reads expected to originate from the forward strand, reverse strand, or unspecific?
  2. Counting mode: - when counting at the gene-level: When there are overlapping annotations, which features should the read be assigned to? Tools usually have a parameter that lets the user select a counting mode. - when counting at the transcript-level: When there are multiple isoforms of a gene, which isoform should the read be assigned to? This consideration is usually an algorithmic consideration that is not modifiable by the end-user.

Some tools can couple alignment to quantification (e.g. STAR), while some assume the alignments are already calculated and require BAM files as input. On the other hand, in the presence of good transcriptome annotations, alignment-free methods (Salmon (Patro, Duggal, Love, et al. 2017) , Kallisto (Bray, Pimentel, Melsted, et al. 2016) , Sailfish (Patro, Mount, and Kingsford 2014) ) can also be used to estimate the expression levels of transcripts/genes. There are also reference-free quantification methods that can first de novo assemble the transcriptome and estimate the expression levels based on this assembly. Such a strategy can be useful in discovering novel transcripts or may be required in cases when a good reference does not exist. If a reference transcriptome exists but of low quality, a reference-based transcriptome assembler such as Cufflinks (Trapnell, Williams, Pertea, et al. 2010) can be used to improve the transcriptome. In case there is no available transcriptome annotation, a de novo assembler such as Trinity (Haas, Papanicolaou, Yassour, et al. 2013) or Trans-ABySS (Robertson, Schein, Chiu, et al. 2010) can be used to assemble the transcriptome from scratch.

Within R, quantification can be done using: - Rsubread::featureCounts - QuasR::qCount - GenomicAlignments::summarizeOverlaps

8.3.4 Within sample normalization of the read counts

The most common application after a gene’s expression is quantified (as the number of reads aligned to the gene), is to compare the gene’s expression in different conditions, for instance, in a case-control setting (e.g. disease versus normal) or in a time-series (e.g. along different developmental stages). Making such comparisons helps identify the genes that might be responsible for a disease or an impaired developmental trajectory. However, there are multiple caveats that needs to be addressed before making a comparison between the read counts of a gene in different conditions (Maza, Frasse, Senin, et al. 2013) .

  • Library size (i.e. sequencing depth) varies between samples coming from different lanes of the flow cell of the sequencing machine.
  • Longer genes will have a higher number of reads.
  • Library composition (i.e. relative size of the studied transcriptome) can be different in two different biological conditions.
  • GC content biases across different samples may lead to a biased sampling of genes (Risso, Schwartz, Sherlock, et al. 2011) .
  • Read coverage of a transcript can be biased and non-uniformly distributed along the transcript (Mortazavi, Williams, McCue, et al. 2008) .

Therefore these factors need to be taken into account before making comparisons.

The most basic normalization approaches address the sequencing depth bias. Such procedures normalize the read counts per gene by dividing each gene’s read count by a certain value and multiplying it by 10^6. These normalized values are usually referred to as CPM (counts per million reads):

  • Total Counts Normalization (divide counts by the sum of all counts)
  • Upper Quartile Normalization (divide counts by the upper quartile value of the counts)
  • Median Normalization (divide counts by the median of all counts)

Popular metrics that improve upon CPM are RPKM/FPKM (reads/fragments per kilobase of million reads) and TPM (transcripts per million). RPKM is obtained by dividing the CPM value by another factor, which is the length of the gene per kilobase. FPKM is the same as RPKM, but is used for paired-end reads. Thus, RPKM/FPKM methods account for, firstly, the library size, and secondly, the gene lengths.

TPM also controls for both the library size and the gene lengths, however, with the TPM method, the read counts are first normalized by the gene length (per kilobase), and then gene-length normalized values are divided by the sum of the gene-length normalized values and multiplied by 10^6. Thus, the sum of normalized values for TPM will always be equal to 10^6 for each library, while the sum of RPKM/FPKM values do not sum to 10^6. Therefore, it is easier to interpret TPM values than RPKM/FPKM values.

8.3.5 Computing different normalization schemes in R

Here we will assume that there is an RNA-seq count table comprising raw counts, meaning the number of reads counted for each gene has not been exposed to any kind of normalization and consists of integers. The rows of the count table correspond to the genes and the columns represent different samples. Here we will use a subset of the RNA-seq count table from a colorectal cancer study. We have filtered the original count table for only protein-coding genes (to improve the speed of calculation) and also selected only five metastasized colorectal cancer samples along with five normal colon samples. There is an additional column width that contains the length of the corresponding gene in the unit of base pairs. The length of the genes are important to compute RPKM and TPM values. The original count tables can be found from the recount2 database ( using the SRA project code SRP029880, and the experimental setup along with other accessory information can be found from the NCBI Trace archive using the SRA project code SRP029880`. Computing CPM

Let’s do a summary of the counts table. Due to space limitations, the summary for only the first three columns is displayed.

To compute the CPM values for each sample (excluding the width column):

Check that the sum of each column after normalization equals to 10^6 (except the width column).


The remarkable diversity of flower colors, especially in wild plants has fascinated botanists, ecologists, and horticulturists for centuries [1,2,3]. The coloring of floral organs, a remarkable character of flowering plants, is a striking feature of the angiosperm radiation [4, 5]. Flower color diversity is recognized to be one of key adaptive traits correlated predominantly with pollinators (e.g. insects, birds) and animals for seed dispersal [6, 7]. Moreover, the flower color phenotype is an important feature for plants used for their classification by taxonomists. However, flower color appears evolutionarily to be one of the most labile traits, down to populations in the same species [7, 8].

The cellular compounds of flowers that contribute to the color profile and visually perceived by humans are generally referred to as “pigments”. A group of secondary metabolites belonging to flavonoids are the main determinants of pigments for coloration in plants, where anthocyanins are responsible for red orange to red, purple to violet pigments found in flowers, leaves, fruits, seeds and other tissues [9, 10]. Anthocyanins are the predominant compounds of floral coloration, existing in over 90% of angiosperms [11]. The flavonoid biosynthetic pathway leading to accumulation of anthocyanins is highly conserved and well characterized, and has been extensively studied in many species, most of which are in model plants or agriculturally and horticulturally important plants [12,13,14,15]. Few studies have examined the molecular basis underlying the formation and accumulation of anthocyanin in wild species [16, 17]. Based on these studies, three major associated factors have been proposed to be involved in anthocyanin accumulation, including transcription regulatory genes (MYB-bHLH-WD40 complex) that occur in the nucleus, structural genes (CHS, FLS, DFR, ANS) acting in the biosynthetic pathway, and transporter genes (GST) transferring anthocyanin from the cytosol into the vacuole [10, 18, 19]. The expression of these genes could also be affected by natural variation in sequences and cis-regulatory elements as well as epigenetic modifications (such as DNA methylation) in the promoter regions [18, 20]. Moreover, the color of flowers can be stabilized and enhanced by co-pigmentation of anthocyanins by flavonols, where it is observed as hyperchromic effect, in which the intensity of an anthocyanin content is fortified [21]. For instance, the DFR gene along with the FLS gene can compete for a substrate leading to the production of different anthocyanins and flavanols through two primary branches [22, 23], thus resulting in co-pigmentation. In contrast to the biosynthesis pathways, knowledge of anthocyanin catabolism in plants is limited. Some catabolic genes like BGLU and PER have been shown to be responsible for anthocyanin degradation [24, 25]. Nevertheless, the molecular mechanism regulating anthocyanin synthesis has been shown to vary among plant species resulting in structural diversity of anthocyanins, because the biosynthesis pathway is regulated by multiple factors through regulatory networks [26].

Color is a form of electromagnetic radiation in the range of the visible spectrum. The wavelengths reflected by pigments determine the color of a flower [27]. Color can be defined and classified in terms of Brightness (the intensity of a signal, B), Saturation (the purity of a color, S) and Hue (the spectral descriptor of color, H), and those features are commonly used to distinguish colors [27, 28]. Brightness refers to the color intensity that is determined by the amount of anthocyanin [29, 30], and different color component combinations such as B/H, S/H were found to be significantly correlated with anthocyanin content as well [31]. Liu et al. [32] proposed that the color brightness decreased as the total anthocyanin content increased. It was also demonstrated that a correlation exists between the saturation/hue ratio (S/H) and anthocyanin content [31]. With these parameters, the anthocyanin content can be rapidly and non-destructively determined.

In evergreen azaleas (Rhododendron), anthocyanins and flavanols are the main flower pigments, and especially the composition of anthocyanin constituents (i.e. cyanidin, delphinidin, malvidin, pelargonidin, peonidin, and petunidin), and their quantities determine their flower color that ranges from light pink to violet [11, 33]. Some studies have reported that R. kiusianum with purple-colored flowers contain derivatives of both anthocyanidins cyanidin and delphinidin, whereas the red-colored flowers of R. kaempferi contain only cyanidin derivatives [34]. Le Maitre et al. [35, 36] studying Erica species, belonging to the same family Ericaceae as Rhododendron, used qRT-PCR and UPLC-MS, unraveled the anthocyanin genetic network of floral color shifts between red or pink and white or yellow flowered species and found losses of single pathway gene expression, abrogation of the entire pathway due to loss of the expression of a transcription factor or loss of function mutations in pathway genes resulted in striking floral color shifts.

Here, we investigated the genetic basis of flower coloration using a highly color polymorphic Rhododendron sanguineum complex. The complex (R. subgen. Hymenanthes) includes plants with yellow to pink or crimson to blackish crimson flowers that are classified into six varieties mainly based on their flower color differences [37]. Members of this complex are basically located at high elevations (> 3000 m) associated with snow cover [37]. They are endemic to northwest Yunnan and southeast Tibet, one of the global biodiversity hotspots [38]. This region is also recognized as one of the centers for diversification and differentiation of Rhododendron [37, 39]. The flower color polymorphisms of this genus have been traditionally viewed as an ecologically adaptive trait that is essential in attracting specific pollinators [40,41,42], and may also be the response to environmental variation, such as UV radiations at different elevations, temperatures, and soil conditions [32]. Although there are studies published on the anthocyanin components and contents in Rhododendron flowers, most were solely dedicated to the identification of the pigment constituents in the petals of some wild and cultivated azaleas using thin-layer chromatography (TLC) and high-performance liquid chromatography (HPLC) [11, 33]. No study so far focused on the molecular mechanisms underlying infraspecific color polymorphisms in Rhododendron. The study of closely related entities such as a species complex has the advantage of a fairly homogeneous genetic background where flower color genes vary and cases of homoplasy are limited. Previous studies mainly focused on color shifts at different developmental stages of a single species [14, 18], or covered a number of related species [26, 35].

In the present study, we combined transcriptome sequencing (RNA-seq) and genome resequencing with reflectance spectra analyses to elucidate molecular and anthocyanin content differences among three differently colored naturally occurring varieties of the R. sanguineum complex, with yellow flushed pink to deep blackish crimson colored flowers. We aimed at studying the correlation between infraspecific flower color variation and the expression of candidate genes of the anthocyanin / flavonoid biosynthesis pathway. Our findings may allow the proposal of a hypothesis for the genetic mechanism of the expression of flower color variation and a representative case of pollinator-mediated incipient sympatric speciation in the R. sanguineum complex. In addition, it is the first study to compare transcriptome profiles in a natural system of a non-model species of Rhododendron.


3.1 Especially progression genes show elevated expression at the onset of tooth patterning

For a robust readout of gene expression profiles, we first obtained gene expression levels using both microarray and RNAseq techniques from E13 (bud stage) and E14 (cap stage) mouse molars (Section 2). From dissected tooth germs we obtained five microarray and seven RNAseq replicates for both developmental stages. The results show that especially the progression category genes (genes required for the progression of tooth development) are highly expressed during E13 compared to the control gene sets (tissue, dispensable, and developmental-process categories, p values range from .0003 to .0426 for RNAseq and microarray experiments, tested using random resampling, for details and all the tests, see Section 2, Figure 2, and Tables S2 and S3). Comparable differences are observed in E14 molars (p values range from .0000 to .0466 Figure 2, Tables S2 and S3).

In general, the expression differences between progression and tissue categories appear greater than between progression and dispensable categories (p values range from .0028 to .0379 and .0059 to .0466, respectively Table S3), suggesting that some of the genes in the dispensable category may still play a functional role in tooth development. In our data we have 11 genes that cause a developmental arrest of the tooth when double mutated (Appendix S1). The expression level of this double-mutant category shows incipient upregulation compared to that of the developmental-process category (p values range from .0322 to .1637 Table S3), but not when compared to the tissue or dispensable categories (p values range from .0978 to .5010 Table S3). Therefore, it is plausible, based on the comparable expression levels between double and some of the dispensable category genes, that several of the genes in the dispensable category may cause phenotypic effects when mutated in pairs.

Even though expression levels of the shape category genes (genes required for normal shape development) are lower than that of the progression category (Figure 2), at least the E14 microarray data suggests elevated expression levels relative to all the other control categories (p values range from .0001 to .0901 Table S3). The moderately elevated levels of expression by the shape category genes could indicate that they are required slightly later in development, or that the most robust upregulation happens only for genes that are essential for the progression of the development. The latter option seems to be supported by a RNAseq analysis of E16 molar, showing only slight upregulation of shape category genes in the bell stage molars (Table S3).

3.2 Transcriptomes of developing rat molars show elevated expression of the progression genes

Because our gene categories were based on experimental evidence from the mouse, we also tested whether comparable expression levels can be detected for the same genes in the rat. Evolutionary divergence of Mus-Rattus dates back to the Middle Miocene (Kimura et al., 2015 ), allowing a modest approximation of conservation in the expression levels. Examination of bud (E15) and cap (E17) stage RNAseq of rat molars shows comparable upregulation of progression and shape category genes as in the mouse (Figure 3, Table S2 and S3). Considering also that many of the null mutations in keystone gene in the mouse are known to have comparable phenotypic effects in humans (Nieminen, 2009 ), our keystone gene categories and analyses are likely to apply to mammalian teeth in general.

One complication of our expression level analyses is that these have been done at the whole organ level. Because many of the genes regulating tooth development are known to have spatially complex expression patterns within the tooth (Nieminen at al. 1998 ), cell-level examinations are required to decompose the patterns within the tissue.

3.3 Single-cell RNAseq reveals cell-level patterns of keystone genes

Tooth development is punctuated by iteratively forming epithelial signaling centers, the enamel knots. The first, primary enamel knot, is active in E14 mouse molar and at this stage many genes are known to have complex expression patterns. Some progression category genes have been reported to be expressed in the enamel knot, whereas others have mesenchymal or complex combinatorial expression patterns (Jernvall & Thesleff, 2012 Nieminen et al., 1998 ). To quantify these expression levels at the cell-level, we performed a single-cell RNAseq (scRNAseq) on E14 mouse molars (Section 2). We focused on capturing a representative sample of cells by dissociating each tooth germ without cell sorting (n = 4). After data filtering, 7000–8811 cells per tooth were retained for the analyses, providing 30,930 aggregated cells for a relatively good proxy of the E14 mouse molar (Section 2).

First we examined whether the scRNAseq produces comparable expression levels to our previous analyses. For the comparisons, the gene count values from the cells were summed up and treated as bulk RNAseq data (Figure 4a and Section 2). We analyzed the expression levels of different gene categories as in the mouse bulk data (Figure 2) and the results show a general agreement between the experiments (Figures 2 and 4b). As in the previous analyses (Table S3), the progression category shows the highest expression levels compared to the control gene sets (p values range from .0071 to .0310 Table S3). Although the mean expression of the shape category is intermediate between progression and control gene sets, scRNAseq shape category is not significantly upregulated in the randomization tests (p values range from .7788 to .9968). This pattern reflects the bulk RNAseq analyses (for both mouse and rat) while the microarray analysis showed slightly stronger upregulation, suggesting subtle differences between the methodologies (the used mouse strain was also different in the microarray experiment).

Unlike the bulk transcriptome data, the scRNAseq data can be used to quantify the effect of expression domain size on the overall expression level of a gene. The importance of expression domain size is well evident in the scRNAseq data when we calculated the number of cells that express each gene (Section 2). The data shows that the overall tissue level gene expression is highly correlated with the cell population size (Figure 5a). In other words, the size of the expression domain is the key driver of expression levels measured at the whole tissue level.

To examine the cell level patterns further, we calculated the mean transcript abundances for each gene for the cells that express that gene (see Section 2). This metric approximates the cell-level upregulation of a particular gene, and is thus independent of the size of the expression domain. We calculated the transcript abundance values for progression, shape, tissue, double, and dispensable category genes in each cell that expresses any of those genes. The resulting mean transcript abundances were contrasted to that of the dispensable category (Section 2). The results show that the average transcript abundance is high in the progression category whereas the other categories show roughly comparable transcript abundances (Figure 5b). Considering that the progression category genes have highly heterogeneous expression patterns (e.g., Nieminen at al. 1998 Figure 5c), their high cell-level transcript abundance (Figure 5b) is suggestive about their critical role at the cell level. That is, progression category genes are not only highly expressed at the tissue level because they have broad expression domains, but rather because they are upregulated in individual cells irrespective of domain identity or size. These results suggest that high cell-level transcript abundance is a system-level feature of genes essential for the progression of tooth development, a pattern that seems to be shared with essential genes of single cell organisms (Dong et al., 2018 ). We note that although the dispensable category has several genes showing comparable expression levels with that of the progression category genes at the tissue level (Figure 2), their cell-level transcript abundances are predominantly low (Figure 5b).

Next we examined more closely the differences between progression and shape category genes, and to what extent the upregulation of the keystone genes reflects the overall expression of the corresponding pathways.

3.4 Keystone gene upregulation in the context of their pathways

In our data the developmental-process genes appear to have slightly elevated expression levels compared to the other protein coding genes (Figures 2, 3, and 4b), suggesting an expected and general recruitment of the pathways required for organogenesis. To place the progression and shape category genes into the specific context of their corresponding pathways, we investigated in E14 mouse bulk RNAseq whether the pathways implicated in tooth development show elevated expression levels. Six pathways, Fgf, Wnt, Tgfβ, Hedgehog (Hh), Notch, and Ectodysplasin (Eda), contain the majority of progression and shape genes (Section 2). First we used the RNAseq of E14 stage molars to test whether these pathways show elevated expression levels. We manually identified 272 genes belonging to the six pathways (Section 2 and Table S4). Comparison of the median expression levels of the six-pathway genes with the developmental-process genes shows that the pathway genes are a highly upregulated set of genes (Figure 6a p < .0001, random resampling). This difference suggests that the experimentally identified progression and shape genes might be highly expressed partly because they belong to the developmentally upregulated pathways. To specifically test this possibility, we contrasted the expression levels of the progression and shape genes to the genes of their corresponding signaling families.

The 15 progression category genes belong to four signaling families (Wnt, Tgfβ, Fgf, Hh) with 221 genes in our tabulations. Even though these pathways are generally upregulated in the E14 tooth, the median expression level of the progression category is still further elevated (Figure 6b p < .0001). In contrast, the analyses for the 28 shape category genes and their corresponding pathways (272 genes from Wnt, Tgfβ, Fgf, Hh, Eda, Notch) show comparable expression levels (Figure 6c p = .5919). Whereas this contrasting pattern between progression and shape genes within their pathways may explain the subtle upregulation of the shape category (Figure 2), the difference warrants a closer look. Examination of the two gene categories reveals that compared to the progression category genes, relatively large proportion of the shape category genes are ligands (36% shape genes compared to 20% progression genes, Appendix S1). In our E14 scRNAseq data, ligands show generally smaller expression domains than other genes (roughly by half, Figure 6d,e), and the low expression of the shape category genes seems to be at least in part driven by the ligands (Figure 6c and Table S5).

Overall, the upregulation of the keystone genes within their pathways appears to be influenced by the kind of proteins they encode. In this context it is noteworthy that patterning of tooth shape requires spatial regulation of secondary enamel knots and cusps, providing a plausible explanation for the high proportion of genes encoding diffusing ligands in the shape category.

  1. Trapnell, C., L. Pachter, and S. L. Salzberg, 2009 TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25: 1105–1111.
  2. Levin, J. Z., M. Yassour, X. Adiconis, C. Nusbaum, D. A. Thompson et al., 2010 Comprehensive comparative analysis of strand-specific RNA sequencing methods. Nature Methods 7: 709.
  3. Young, M. D., M. J. Wakefield, G. K. Smyth, and A. Oshlack, 2010 Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biology 11: R14.
  4. Brooks, A. N., L. Yang, M. O. Duff, K. D. Hansen, J. W. Park et al., 2011 Conservation of an RNA regulatory map between Drosophila and mammals. Genome Research 21: 193–202.
  5. Marcel, M., 2011 Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17: 10–12.
  6. Robinson, J. T., H. Thorvaldsdóttir, W. Winckler, M. Guttman, E. S. Lander et al., 2011 Integrative genomics viewer. Nature Biotechnology 29: 24.
  7. Wang, L., S. Wang, and W. Li, 2012 RSeQC: quality control of RNA-seq experiments. Bioinformatics 28: 2184–2185.
  8. Dobin, A., C. A. Davis, F. Schlesinger, J. Drenkow, C. Zaleski et al., 2013 STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29: 15–21.
  9. Kim, D., G. Pertea, C. Trapnell, H. Pimentel, R. Kelley et al., 2013 TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology 14: R36.
  10. Liao, Y., G. K. Smyth, and W. Shi, 2013 featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30: 923–930.
  11. Luo, W., and C. Brouwer, 2013 Pathview: an R/Bioconductor package for pathway-based data integration and visualization. Bioinformatics 29: 1830–1831.
  12. Love, M. I., W. Huber, and S. Anders, 2014 Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15: 550.
  13. Anders, S., P. T. Pyl, and W. Huber, 2015 HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics 31: 166–169.
  14. Kim, D., B. Langmead, and S. L. Salzberg, 2015 HISAT: a fast spliced aligner with low memory requirements. Nature Methods 12: 357.
  15. Ewels, P., M. Magnusson, S. Lundin, and M. Käller, 2016 MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32: 3047–3048.
  16. Thurmond, J., J. L. Goodman, V. B. Strelets, H. Attrill, L. S. Gramates et al., 2018 FlyBase 2.0: the next generation. Nucleic Acids Research 47: D759–D765.
  17. Kim, D., J. M. Paggi, C. Park, C. Bennett, and S. L. Salzberg, 2019 Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37: 907–915.

Did you use this material as an instructor? Feel free to give us feedback on how it went.