volcano plot что это
Visualization of RNA-Seq results with Volcano Plot
Overview
How to generate a volcano plot from RNA-seq data?
Create a volcano plot of RNA-seq data to visualize significant genes
Introduction
Agenda
In this tutorial, we will deal with:
Preparing the inputs
We will use two files for this analysis:
Import data
Click the new-history icon at the top of the history panel.
If the new-history is missing:
Import the differentially results table.
To import the file, there are two options:
Paste the link into the text field
Press Start
Close the window
As an alte rna tive to uploading the data from a URL or your computer, the files may also have been made available from a shared data library:
You can paste the links below into the Paste/Fetch box:
Select “Type”: tabular
Create volcano plot highlighting significant genes
First we will create a volcano plot highlighting all significant genes. We will call genes significant here if they have FDR hands_on Hands-on: Create a Volcano plot
In the plot above the genes are coloured if they pass the thresholds for FDR and Log Fold Change, red if they are upregulated and blue if they are downregulated. You can see in this plot that there are many (hundreds) of significant genes in this dataset.
Why does the y axis use a negative P value scale?
The negative log of the P values are used for the y axis so that the smallest P values (most significant) are at the top of the plot.
Create volcano plot labelling top significant genes
You can also choose to show the labels (e.g. Gene Symbols) for the significant genes with this volcano plot tool. You can select to label all significant or just the top genes. The top genes are those that pass the FDR and logFC thresholds that have the smallest P values. As there are hundreds of significant genes here, too many to sensibly label, let’s label the top 10 genes.
As in the previous plot, genes are coloured if they pass the thresholds for FDR and Log Fold Change, (red for upregulated and blue for downregulated) and the top genes by P value are labelled. Note that in the plot above we can now easily see what the top genes are by P value and also which of them have bigger fold changes.
Which gene is the most statistically significant with large fold change?
Csn1s2b, as it is the gene nearest the top of the plot and it is also far to the left. This gene is a calcium-sensitive casein that is important in milk production. As this dataset compares lactating and pregnant mice, it makes sense that it is a gene that is very differentially expressed.
Create volcano plot labelling genes of interest
We can also label one or more genes of interest in a volcano plot. This enables us to visualize where these genes are in terms of significance and in comparison to the other genes. In the original paper using this dataset, there is a heatmap of 31 genes in Figure 6b (see the tutorial here if you would like to see how to generate the heatmap). These genes are a set of 30 cytokines/growth factor identified as differentially expressed, and the authors’ main gene of interest, Mcl1. These genes are provided in the volcano_genes file and shown below. We will label these genes in the volcano plot. We’ll add boxes around the labels to highlight the gene names.
Which gene of interest is the most statistically significant?
As in the previous plots, genes are coloured if they pass the thresholds for FDR and Log Fold Change. Here all the genes of interest are significant (red or blue) except for two genes, Mcl1 and Gmfg. Gmfg, has an FDR just very slightly outside the significance threshold we used of 0.01 (0.0105). Mcl1 is the authors’ gene of interest and they showed that while it’s expression did increase at the protein level, it did not increase at the transcription level, as we can see here, suggesting it is regulated post-transcriptionally.
Conclusion
A volcano plot can be used to quickly visualize significant genes in RNA-seq results
Frequently Asked Questions
Useful literature
Further information, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
References
Feedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Volcano plot (statistics)
From Wikipedia, the free encyclopedia
In statistics, a volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets composed of replicate data. [1] It plots significance versus fold-change on the y and x axes, respectively. These plots are increasingly common in omic experiments such as genomics, proteomics, and metabolomics where one often has a list of many thousands of replicate data points between two conditions and one wishes to quickly identify the most meaningful changes. A volcano plot combines a measure of statistical significance from a statistical test (e.g., a p value from an ANOVA model) with the magnitude of the change, enabling quick visual identification of those data-points (genes, etc.) that display large magnitude changes that are also statistically significant.
A volcano plot is constructed by plotting the negative logarithm of the p value on the y axis (usually base 10). This results in data points with low p values (highly significant) appearing toward the top of the plot. The x axis is the logarithm of the fold change between the two conditions. The logarithm of the fold change is used so that changes in both directions appear equidistant from the center. Plotting points in this way results in two regions of interest in the plot: those points that are found toward the top of the plot that are far to either the left- or right-hand sides. These represent values that display large magnitude fold changes (hence being left or right of center) as well as high statistical significance (hence being toward the top).
Additional information can be added by coloring the points according to a third dimension of data (such as signal intensity), but this is not uniformly employed. Volcano plots are also used to graphically display a significance analysis of microarrays (SAM) gene selection criterion, an example of regularization. [2]
Volcano plots show a characteristic upwards two arm shape because the x axis, i.e. the underlying log2-fold changes, are generally normal distribution whereas the y axis, the log10-p values, tend toward greater significance for fold-changes that deviate more strongly from zero. The density of the normal distribution takes the form
which is a parabola whose arms reach upwards on the left and right sides. The upper bound of the data is one parabola and the lower bound is another parabola.
Volcano Plot: Everything you need to know
What is a volcano plot?
A volcano plot is a type of scatter-plot that can be used to quickly identify meaningful changes from within a very large data set. Volcano plots do this by plotting a measure of the statistical significance of a change (e.g., p-value) on the y-axis, versus the magnitude of the change (fold-change) on the x-axis.
When are volcano plots used?
Volcano plots are increasingly popular in ‘omics’ type experiments (e.g., genomics, proteomics, and metabolomics) that typically compare two conditions (e.g., wild-type vs. mutant or healthy vs. disease) and involve many thousands of replicate data points. By separating these data by the magnitude of the difference between the two conditions (on the x-axis) and the statistical significance of that difference (on the y-axis), it’s possible to quickly pick out those data points (e.g., genes or proteins) that display a large magnitude change but are also statistically significant.
How are volcano plots made?
A volcano plot is constructed by plotting the negative log of the p-value on the y-axis (usually base 10). This results in data points with low p-values (highly significant) appearing toward the top of the plot. The x-axis displays the fold-change between the two conditions; this is plotted as the log of the fold-change so that changes in both directions appear equidistant from the centre. Data sets plotted in this way often resemble an erupting volcano, which accounts for the name. Those data points in the top-right and top-left sectors are those of most interest because they are the most different between the two conditions and with high statistical confidence about that difference.
Let’s consider an example
Here we’ll use data from a proteomics experiment comparing wild-type plants versus mutant plants, with the aim of quickly identifying those proteins that have a very different abundance under these conditions. But this example is applicable for any situation where you are comparing two conditions, have replicate data, and many data points.
In this example we have two conditions (wild-type and mutant), replicate data (×3 replicates for wild-type and ×3 replicates for the mutant), and many data points (for around 1300 proteins).
Let’s consider the data for just one of those proteins, called Q9M0A7. In the wild-type, the abundance values for Q9M0A7 were 258, 310, and 297 in our three replicates. Whereas, in the mutant condition, the abundance values for Q9M0A7 were 18, 8, and 30.
This is equivalent to a fold-change of around 15 (a big change! There is around 15-times more of Q9M0A7 in the wild-type than in the mutant). Then, by calculating the log of the fold-change, we have a value of 3.9 that can be plotted on the x axis of our volcano plot.
When calculating the significance of this difference using a t-test, we get a p-value of 0.000086 (highly significant). Then, after calculating the LOG10 of the p-value, we can plot 4.06 on the y axis of our volcano plot.
Because Q9M0A7 had a large magnitude change when comparing our two samples and this change was highly significant, it falls into the top-right sector of our volcano plot, where it can be easily picked out as a protein of interest.
Volcano PlotВ¶
Plots significance versus fold-change for gene expression rates.
SignalsВ¶
Inputs:
Data
Outputs:
Selected data
DescriptionВ¶
Volcano plot is a graphical method for visualizing changes in replicate data. The widget plots a binary logarithm of fold-change on the x-axis versus statistical significance (negative base 10 logarithm of p-value) on the y-axis.
ExampleВ¶
Below you can see a simple workflow for Volcano Plot. We use Caffeine effect: time course and dose response data from GEO Data Sets widget and visualize them in a Data Table. We have 6378 gene in the input, so it is essential to prune the data and analyse only those genes that are statistically significant. Volcano Plot helps us do exactly that. Once the desired area is selected in the plot, we output the data and observe them in another Data Table. Now we get only 80 instances, which were those genes that had a high normalized fold change under high dose of caffeine and had a low p-value at the same time.
Volcano plot in Python
What is Volcano plot?
Applications
How to create Volcano plot in Python?
Note: If you have your own dataset, you should import it as pandas dataframe. Learn how to import data using pandas
Generated volcano plot by above code (green: upregulated and red: downregulated genes),
Change background theme to dark,
Add legend to the plot and adjust the legend position,
Change color of volcano plot
Change log fold change and p value threshold,
Change transparency of volcano plot
Change the shape of the points
Change the shape and size of the points
Add gene labels (text style) to the points,
Add gene labels (box style) to the points,
Add gene names instead of gene Ids,
Add threshold lines,
Change X and Y range ticks, font size and name for tick labels
To create a inverted volcano plot,
Generated inverted volcano plot by adding above code,
Change color inverted volcano plot
Add gene names instead of gene Ids,
If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com
Never miss the new article. Get notified direct to your email.
Updated: November 20, 2021
Share on
You may also enjoy
Perform differential gene expression analysis of RNA-seq data using DESeq2
Create a gene counts matrix from featureCounts
Generate a gene counts matrix when featureCounts run separately on individual aligned files
Parsing and analyzing BAM files
SAMtools for manipulation of BAM files
Entrez programming utilities for downloading the nucleotide and protein sequences from NCBI
NCBI E-utilities for downloading the single or large number of sequences from the NCBI sequence database