Author Correspondence author
Computational Molecular Biology, 2014, Vol. 4, No. 9 doi: 10.5376/cmb.2014.04.0009
Received: 03 Sep., 2014 Accepted: 25 Sep., 2014 Published: 23 Oct., 2014
Patel et al., 2014, Comparative study of five Legume species based on De Novo Sequence Assembly and Annotation, Computational Molecular Biology, Vol.4, No.9, 1-6 (doi: 10.5376/cmb.2014.04.0009)
Legume species are an important oilseed crop in tropical and subtropical regions of the world. Recently, next-generation sequencing technology, termed RNA-seq, has provided a powerful approach for analysing the Transcriptome. This study is focus on RNA-seq of five legume species which are Arachis hypogaea L. (The peanut) of SRR1212866, Cicer arietinum L. of SRR627764, Phaseolus vulgaris L. of SRR1283084, Trigonella foenum-graecum L. of SRR066197 and Vicia sativa L. of SRR403901 from NCBI database. Comparative study focuses on various important features like; reads were generated with N50, sequence assembly contigs which is further searched with known proteins and genes; among these, how many genes were annotated with gene ontology (GO) functional categories and sequences mapped to pathways by searching against the Kyoto Encyclopedia of Genes and Genomes pathway database (KEGG). These data will be useful for gene discovery and functional studies and the large number of transcripts reported in the current study will serve as a valuable genetic resource of these five legume species.
Next generation sequencing methods for high throughput RNA sequencing (transcriptome) is becoming increasingly utilized as the technology of choice to detect and quantify known and novel transcripts in plants. This Transcriptome analysis method is fast and simple because it does not require cloning of the cDNAs. Direct sequencing of these cDNAs can generate short reads at an extraordinary depth. After sequencing, the resulting reads can be assembled into a genome-scale transcription profile. It is a more comprehensive and efficient way to measure Transcriptome composition, obtain RNA expression patterns, and discovers new exons and genes (Mortazavi et al., 2008; Wang et al.,2009); sequencing data of Transcriptome was assembled using various assembly tools, functional annotation of genes and pathway analysis carried with various Bioinformatics tools. The large number of transcripts reported in the current study will serve as a valuable genetic resource for described five legume species.
1.1 Sequence Retrieval
This study is focus on the de novo assembly and sequence annotation of five legume species which are Arachis hypogaea L. (The peanut) of SRR1212866, Cicer arietinum L. of SRR627764, Phaseolus vulgaris L. of SRR1283084, Trigonella foenum-graecum L. of SRR066197 and Vicia sativa L. of SRR403901 from NCBI database for de novo Transcriptome analysis. Raw data downloaded from NCBI SRA (http://trace. ncbi.nlm.nih.gov/Traces/sra/) which are from Illumina HiSeq 2000 platform and LS454 platform- 454 GS FLX. Raw sequence was converted into fastq file format for further annotation with the use of SRA TOOL KIT from NCBI (http://trace.ncbi.nlm.nih. gov/Traces/sra/sra.cgi?view=software).
NGS QC Toolkit, it is an application for quality check and filtering of high-quality data. This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html. The toolkit is comprised of user-friendly tools for QC of sequencing data generated using Roche 454 and Illumina platforms, and additional tools to aid QC (sequence format converter and trimming tools) and analysis (statistics tools). A variety of options have been provided to facilitate the QC at user-defined parameters. The toolkit is expected to be very useful for the QC of NGS data to facilitate better downstream analysis (Patel RK, et al).
A comprehensive and user-friendly analysis package for analyzing, comparing, and visualizing next generation sequencing data. This package was used for de novo sequence assembly of sequence with by default parameters of de novo assembly tool (http://www.clcbio.com/products/clc-genomics-workbench/).
The assembled file was further considered for annotation in which first step was to identify translated protein sequences from contigs. BLASTX at NCBI (http://www.ncbi.nlm.nih.gov/blast/Blast.cgi? PROGRAM=blastx&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome) performed with changing few parameters like non redundant protein database (nr) selected as Database; Eudicots selected in organism option and in Algorithm parameters Max target Sequences set to 10 and Expect threshold set to 6.
Blast2GO is an ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data (http://www.blast2go.com/b2ghome). Based on the results of the protein database annotation, Blast2GO was employed to obtain the functional classification of the unigenes based on GO terms. The transcript contigs were classified under three GO terms such as molecular function, cellular process and biological process (Ness et al., 2011; Shi et al., 2011; Wang et al., 2010). WEGO (http://www.wego. genomics.org.cn) tool was used to perform the GO functional classification for all of the unigenes and to understand the distribution of the gene functions of this species at the macro level. The KEGG database (http://www.genome.jp/kegg/pathway.html) was used to annotate the pathway of these unigenes.
We employed MIcroSAtellite (MISA) (http://pgrc.ipk- gatersleben.de/misa/) for microsatellite mining which gives various statistical outputs of transcripts with useful information.
PlantTFcat: An Online Plant Transcription Factor and Transcriptional Regulator Categorization and Analysis Tool used for identifying plant transcription factor in sequences (http://plantgrn.noble.org/PlantTFcat/).
Table 1 Species comparison based on sequence |
Table 2 NGS QC Toolkit Result |
2.3 De novo Sequence Assembly
Table 3 Contig measurement in Length |
2.4 Functional annotation with BLASTX and blast2GO
2.4.3 Gene Ontology (GO) Classification
Table 6 Gene Ontology (GO) Classification |
Figure 1 which is output of WEGO tool; it shows that, Within the Molecular Function category, genes encoding binding proteins and proteins related to catalytic activity were the most enriched. Proteins related to metabolic processes and cellular processes were enriched in the Biological Process category. With regard to the Cellular Components category, the cell and cell part were the most highly represented categories. We found same in all other legume species so we have considered only this one figure for illustration of WEGO tool.
Figure 1 WEGO Tool Result of Arachis hypogaea L. |
Many genes were annotated with different pathways in the KEGG database (http://www.genome.jp/kegg/ pathway.html). Further comparative result is shown in Table 7. Many transcripts include various pathways like metabolic pathways, plant-pathogen interaction pathways, fatty acid metabolism pathway and fatty acid biosynthesis.
Table 7 KEGG Result |
2.5 SSR mining
Table 8 Statistics of SSRs identified in transcripts |
2.6 Plant Transcription Factor
Table 9 Plant Transcription Factor Result |
Figure 2 Plant Transcription Factor Result of Trigonella foenum-graecum L. |
3 Conclusion
http://dx.doi.org/10.1142/9781848163324_0001
http://dx.doi.org/10.1186/1471-2164-13-90
http://dx.doi.org/10.1104/pp.109.144105
http://dx.doi.org/10.1038/nmeth.1226
http://dx.doi.org/10.1186/1471-2164-12-298
http://dx.doi.org/10.1371/journal.pone.0030619
http://dx.doi.org/10.1093/dnares/dsq028
http://dx.doi.org/10.1186/1471-2164-12-131
http://dx.doi.org/10.3835/plantgenome2012.08.0021
http://dx.doi.org/10.1186/1471-2164-11-400
http://dx.doi.org/10.1038/nrg2484
. PDF(624KB)
. FPDF(win)
. HTML
. Online fPDF
Associated material
. Readers' comments
Other articles by authors
. Sagar S. Patel
. Dipti B. Shah
. Hetalkumar J. Panchal
Related articles
. De Novo assembly
. Bioinformatics
. Legume species
. Sequence Assembly and Annotation
Tools
. Email to a friend
. Post a comment