Plant Secondary Metabolite Analysis
home download help about contact
Server status: unknown
Running jobs: unknown
Queue length: unknown
Oldest job in queue submitted
Long runtime queue: unknown
Oldest job in queue submitted
Jobs processed: unknown

How to use plantiSMASH

News

Version 1.0.beta of plantiSMASH is live now!
plantiSMASH is an extension of the widely used tool antiSMASH, now optimized for and made to work on plant genomes. To top it off, these are the added features of plantiSMASH ver 1.0.beta:
  • Specific library of pHMM models and completely novel cluster calling logic designed to work on plant genomes
  • Specific plant clusterblast database from precalculated plant biosynthetic gene clusters
  • Specific training model for gene finding using glimmerHMM
  • Homology based metabolic modeling
  • Coexpression analysis and html-based visualization on detected biosynthetic gene clusters with the CoExpress module.
Besides these, all the usual great features found in antiSMASH are also available in plantiSMASH.

Introduction

Plants around the globe produce a wide variety of specialized metabolites that play key roles in communication and defense. Recently, evidence has been accumulating that (like in microbes) the genes encoding the biosynthetic pathways towards these metabolites are often densely clustered in specific genomic loci: biosynthetic gene clusters (BGCs). This offers great potential for genome-based discovery of plant natural products.

Here, we introduce plantiSMASH, a versatile online analysis platform that automates the identification of candidate plant BGCs, as well as their comparative genomic and transcriptomic analysis. The cluster detection logic, validated on a set of all plant BGCs that have been experimentally characterized thus far, is able to pinpoint many complex metabolic loci across the Plant Kingdom. Additionally, interactively visualized coexpression analysis and comparative cluster-cluster alignment allow users to judge multiple sources of evidence for a candidate BGC to encode a group of enzymes that truly functions jointly in a biosynthetic pathway. Furthermore, plantiSMASH finds coexpression correlations between candidate BGCs and genes elsewhere in the genome.

Altogether, this new software provides a comprehensive toolkit for plant geneticists to further explore the nature of gene clustering in plant metabolism. Moreover, spurred by the continuing decrease in costs of plant genome sequencing and assembly, it will soon allow natural product chemists to apply genome mining technologies to the discovery of novel medicinal compounds from a wide range of plant species.

plantiSMASH input parameters

The ideal input for plantiSMASH is an annotated nucleotide file in Genbank format or EMBL format. You can either upload a GenBank/EMBL file manually, or simply enter the GenBank/RefSeq accession number of your sequence for plantiSMASH to upload it. Alternatively, you can provide a FASTA file containing one or more sequences. If possible, you should also supply a separate GFF3-formatted file containing the annotation information for all said sequences. If no annotation is available, we recommend running your sequence through an annotation pipeline like MAKER (preferably with transcriptomic evidence) to obtain GBK/EMBL files with high-quality annotations. Otherwise, plantiSMASH will generate a preliminary annotation using GlimmerHMM, and use that to run the rest of the analysis.

Input files should be properly formatted. If you are creating your GBK/EMBL/FASTA file manually, be sure to do so in a plain text editor like Notepad or Emacs, and saving your files as "All files (*.*)", ending with the correct extension (for example ".fasta", ".gbk", or ".embl").

There are several optional analyses that may or may not be run on your sequence. Highly recommended is the ClusterBlast Comparative Analysis, which runs Diamond using each amino acid sequence from a detected gene cluster as a query on a large database of predicted protein sequences from secondary metabolite biosynthetic gene clusters, and pools the results to identify the gene clusters that are most homologous to the gene cluster that was detected in your query nucleotide sequence.

Also available is the analysis of gene coexpression from transcriptomics data. You can supply one or more additional file(s) either in GEO compliant *.soft format or in a simple CSV format (see CoExpress CSV file format); plantiSMASH will then generate several powerful visualizations to help you see how genes interacts in terms of expression in a set conditions, both within and between clusters.

plantiSMASH output

The output of the plantiSMASH analysis pipeline is organized in an interactive HTML page with SVG graphics, and different parts of the analysis are displayed in different panels for every gene cluster (see example output file on the Arabidopsis thaliana, or other species in our precalculated results page).

Initially, a list of identified clusters is displayed in the results page. A gene cluster can be selected for viewing by clicking its number (gene clusters are numbered in the order in which they appear on the input nucleotide sequence) in the "Select Gene Cluster" panel just below the top banner or by clicking on the colored "Cluster XX" boxes. A click on "Overview" brings you back to the overview list.
Gene cluster buttons are color-coded by predicted secondary metabolite type.

In the upper panel, "Gene cluster description", information is given about each gene cluster that was detected. In the upper line, the biosynthetic type and location of the gene cluster are displayed. Underneath this title line, all genes present in a detected gene cluster are outlined.

Genes are color-coded by their assigned enzymatic protein domain profile, which will be listed in the legend. Additionally, genes will also be categorized by its predicted function, which will be displayed as different color borders.

Hovering over a gene with the mouse will prompt the gene name to be displayed above the gene. Clicking the gene will provide more information on the gene: its annotation, its sequence similarity with other genes in the cluster, its location, and cross-links specific to that gene.

If coexpression data was provided (through either a .soft or .csv file), this panel will show expression information through both a hierarchically clustered heatmap and a coexpression network (see below).

You can choose to show either expression fluctuation (the rate of which expression level of a gene changes between samples), color-coded from white to black; or expression intensity (expression level of a gene related to the sample value distribution), color coded from yellow to red.

In the correlation network graph, you can see how genes within the cluster (box-shaped nodes) interact with each other, and with other genes in other clusters (ellipse-shaped nodes with solid edges and the corresponding cluster number inside) or anywhere else on the genome (ellipse-shaped nodes with dashed edges).

Additionally, by enabling the coexpression analysis, you will also get a Hiveplot overview of significant cluster-cluster interactions detected in the selected transcriptomics dataset. This can be accessed in the cluster overview screen.

If you enabled Gene Cluster Comparative Analysis (ClusterBlast), this panel will display the top ten plant gene clusters from precomputed data that are most similar to a detected gene cluster, visually aligned to it. The drop-down selection menu can be used to browse through the gene clusters. Genes with the same colour are putative homologs based on significant Blast (Diamond) hits between them.

In the upper right, a small list of buttons offers further functionality. The house-shaped button will get you back to the plantiSMASH start page. The question-mark button will get you to this help page. The exclamation-mark button leads to a page explaining about plantiSMASH. The downward-pointing arrow will open a menu offering to download the complete set of results from the antiSMASH run, a summary Excel file and to the summary EMBL/GenBank output file. The EMBL/GenBank file can be viewed in a genome browser such as Artemis.

CoExpress CSV file format

This is an example of CSV format accepted by plantiSMASH-CoExpress:

The first two lines containing #title and #desc should be self-explanatory. These are then followed by the comma separated rows. The first row would be the header, containing each sample / condition name. The next rows are for the actual expression data, with the first column containing the locus tag directly matching to the supplied genome annotation, and the next columns containing expression values of the gene for each sample /condition.

The number of columns in each row should be identical. In case of no expression data for a specific gene in a specific sample, simply put 0 or an empty character in the space between the commas. Also, one important thing to note is that the expression values should be comparable across all samples, i.e. they are measured and normalized in the same way (log values, signal count data, etc).

Frequently Asked Questions

  1. Can I run plantiSMASH locally as a stand-alone program?
    Yes. A stand-alone version of plantiSMASH is available from the download section of this website. PlantiSMASH natively run on python 2.7 under GNU/Linux, but will also run on other environment as well (OSX / Windows), given the dependency requirements are fulfilled.
  2. Why doesn't plantiSMASH detect my gene cluster?
    Some gene clusters, such as fatty acid or cofactor biosynthesis gene clusters are not identified by antiSMASH, as they are categorized as primary metabolism instead of secondary metabolism. If you are aware of a true secondary metabolite biosynthesis gene cluster that escapes detection by antiSMASH, please contact us, and we will add the models necessary to detect it.
  3. Why are several genes upstream and downstream of gene clusters included, even though they do not seem to be part of the gene cluster?
    In designing plantiSMASH, we tried to be very conservative in cutting gene cluster borders, leaving this to the expert eye of the user, as it is better to show some extra genes than to leave out key genes.
  4. Can I also submit an unannotated genome sequence in FASTA format?
    Yes. We offer integrated preliminary gene prediction by GlimmerHMM based on a FASTA input. If you want the highest possible quality of results, we recommend you to use an annotation pipeline like MAKER (with transcriptomic evidence) first to obtain high-quality gene predictions in GBK or GFF format.
  5. What is the privacy policy of antiSMASH concerning my sequence data?
    We try to keep this site and the data that it analyzes as safe and secure as possible. Your output files will be deleted from our server within one month. However, sending your data to the web site is at your own risk. If you are concerned about the sensitivity of your data, please use the stand-alone version of our tool.

Secondary metabolite types detected

alkaloid
Putative alkaloid biosynthetic cluster. Characterized by the presence of Bet_v_1/Cu_amine_oxid/Str_synth enzymes.
lignan
Putative lignan biosynthetic cluster. Characterized by the presence of Dirigent enzyme.
alkaloid
Putative polyketide biosynthetic cluster. Characterized by the presence of Dirigent enzymes.
saccharide
Putative saccharide biosynthetic cluster. Characterized by the presence of Chal_sti_synt enzymes.
terpene
Putative terpene biosynthetic cluster. Characterized by the presence of Terpene_synth/Prenyltrans/SQHop_cyclase/PRISE enzymes.
putative
Putative / uncategorized biosynthetic cluster
{A} – {B}
Biosynthetic cluster containing signature traces from both A and B secondary metabolite type.