Diversity of the sequencing sample with Shannon entropy in R

In this article, I would like to talk about the visualization of the sample diversity and the Shannon entropy usage for it.

My background is bioinformatics, so I will use a normalized RNAseq dataset in this article. As an alternative, you can use any other dataset, containing multiple features per sample and numerical values for each feature. You may always generate a random dataset or generate a synthetic dataset.

Each time I have my raw sequencing data preprocessed and normalized I am looking forward to my favorite part — visualization. For me this is the most interesting and inspiring part, I just love visualizations. Apart from my personal feeling towards it, visualization is usually a fast way to find some interesting patterns, uncover insights or point to the problems from the beginning.

Visualized data are usually easier to understand and sometimes it lets us show complex patterns without explaining them with words. There are many visualization techniques used in bioinformatics and in this article I would like to show a way of exploring the diversity of the sequencing samples using the Shannon entropy.

Shannon entropy could be used as a numerical way of measuring the sample diversity. The entropy, measure of disorder, reflects the diversity of each sample and represents if there are any overrepresented sequences or transcripts, responsible for the highest amount of reads.

The interpretation of the entropy value in this application is the following: the higher is the entropy value, the more diverse our sample is and the less overrepresented sequences there are in the sample. By the overrepresented sequences, I mean sequences, representing a high percentage of the total amount of reads in the sample.

The mathematical equation for the Shannon entropy is following, where pᵢ is a frequency of the element i, in our case sequence, appearing in the dataset:

This will give a numerical output for each sample, which we may use to compare the samples between each other. But this number does not answer the question: is there a sequence taking the biggest part of the sample diversity?

This is where the visualization comes into play and it could help us to see the representation percentage of each individual sequence in our dataset.

Now let’s move to the technical implementation. Step one is to get data. In this article, we do not start from the raw data import, rather from the normalized counts corrected for sequencing depth and transcriptome composition bias(CPM).

I use openxlsx package to import data from an excel file.

library(openxlsx, quietly = TRUE)
path = file.path(dir,"counts.xlsx")
df <- read.xlsx(path, sep="\\t", rowNames = TRUE)

The next step is to get the frequency of each transcript in each sample. To do so, we may simply divide the transcript count by the total amount of transcripts in the sample.

df_freq <- apply(df,2,function(x){x/sum(x)})

Next, we will calculate the entropy for each sample according to the previously mentioned formula.

Here we will need the plyr package.

total_entropy <- -colSums(apply(df_freq,2,function(x){x*log(x)}), na.rm = TRUE)df_total_entropy <- setNames(ldply(total_entropy, data.frame), c("sample", "df_total_entropy"))df_total_entropy

Now we already have a data frame with the entropy value per sample. But we also want to see what is responsible for this entropy level: is there one sequence representing the majority of the transcripts or are there several sequences representing a big part of the sequences.

To visualize the reasons for the entropy levels let us also prepare a data frame with the frequencies of the transcripts in each sample in a long format using the reshape2 package.

After converting to the long format we will also assign a data type to each column.

# create long dataframe
df_ent_melt <- setNames(melt(as.matrix(df_freq), na.rm = FALSE), c("Transcript", "Sample", "Value"))
# assign column data types
df_ent_melt <- transform(df_ent_melt,
Transcript = as.factor(Transcript),
Sample = as.factor(Sample),
Value = as.numeric(Value))

Now we are ready to visualize!

We will make two plots under each other.

The top plot represents the entropy values per sample.

The bottom plot shows the transcript representations in each sample. The transcripts keep the color across all samples.

We will relevel the factors in the first plot to get the samples in the correct order. The strings are ordered differently, then numerical values in R, so if we do not relevel the factor, a fancy way of saying reorder the levels of a factor, we will get the column in alphanumeric order: Sample_1, Sample_10, Sample_11, Sample_2 instead of the expected: Sample_1, Sample_2, etc.

For the bottom plot, we will create a color palette from the rainbow palette by repeating 32 colors from the rainbow palette 400 times. These will be enough color values for the given amount of transcripts. There is no need to create unique colors since the human eye will not be able to distinguish them. The colors are uniquely assigned to each transcript and remain the same across samples.

In the bottom plot, the transcripts are reordered by their percentage. We are reordering the long data frame, so the transcripts will go from the one with the highest values among all transcripts among all samples to the lowest values among all transcripts among all samples.

No legend since this is a visualization to see the patterns and dive deeper. We do not aim to find which transcripts are overrepresented in this visualization, just to find whether there are any we should pay attention to.

par(mfrow = c(1,2))# Top entropy plot
# relevel factor to display in the correct order
aes(x=forcats::fct_relevel(sample,c(paste0("Sample_", seq(1, ncol(df_vis), 1)))), y=df_total_entropy))+
geom_bar(stat="identity", fill = "#7570B3") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(y="Entropy", x = "") +
theme(legend.position = "none")
# Create color pallete
# Bottom entropy plot
# no legend
# transcripts ordered from highest to lowest values
aes(fill=reorder(Transcript, Value), y = Value, x = Sample))+
geom_bar(position = "fill", stat = "identity") +
theme(legend.position = "none") +
scale_fill_manual(values=values) +
theme(axis.text.x = element_text(angle = 90)) +
labs(y="Percentage of transcript", x = "")

As a result, we got an impressively bright plot, which also appears to be informative.

These plots point us to to the samples, which have a highly overrepresented transcript. In the next steps of the analysis, we will most probably want to look closer to these samples and identify which transcript is overrepresented, why this happened and whether we can proceed with these samples.

I am sure this visualization could be utilized in a different area and might give you some preliminary results about your data. Feel free to use it and share your results!

My fav drink is gene and tonic. Adding informatics to my bio.