Sequence Tag Alignment and Consensus Knowledgebase Database

The STACKdb is knowledgebase generated by processing EST and mRNA sequences obtained from GenBank through a pipeline consisting of masking, clustering, alignment and variation analysis steps. The STACK project aims to generate a comprehensive representation of the sequence of each of the expressed genes in the human genome by extensive processing of gene fragments to make accurate alignments, highlight diversity and provide a carefully joined set of consensus sequences for each gene. The STACK project is comprised of the STACKdb human gene index, a database of virtual human transcripts, as well as stackPACK, the tools used to create the database. STACKdb is organized into 15 tissue-based categories and one disease category. STACK is a tool for detection and visualization of expressed transcript variation in the context of developmental and pathological states. The data system organizes and reconstructs human transcripts from available public data in the context of expression state. The expression state of a transcript can include developmental state, pathological association, site of expression and isoform of expressed transcript. STACK consensus transcripts are reconstructed from clusters that capture and reflect the growing evidence of transcript diversity. The comprehensive capture of transcript variants is achieved by the use of a novel clustering approach that is tolerant of sub-sequence diversity and does not rely on pairwise alignment. This is in contrast with other gene indexing projects. STACK is generated at least four times a year and represents the exhaustive processing of all publicly available human EST data extracted from GenBank. This processed information can be explored through 15 tissue-specific categories, a disease-related category and a whole-body index The stackPACK transcript reconstruction and variation analysis system allows the rapid and accurate processing of EST and mRNA data through a pipeline consisting of a series of steps including masking, loose clustering, assembly and alignment, alignment analysis for variation in transcripts and linking of non-overlapping clusters by clone ID. The system is unique due to its visualization tools and efficient data management, using a relational database. StackPACK can be accessed either through command line or through a web-based interface. The STACK_PACK clustering system has been applied to dbEST release 121598. 64% of 1313103 Homo sapiens ESTs are condensed into 143,885 tissue level multiple sequence clusters; linking through clone-ID annotations produces 68,701 total assemblies, such that 81% of the original input set is captured in a STACK multiple sequence or linked cluster. Indexing of alignments by substituent EST accession allows browsing of the data structure and its cross-links to UniGene. STACK meta-clusters consolidate a greater number of ESTs by a factor of 1.86 with respect to the corresponding UniGene build. Fidelity comparison with genome reference sequence AC004106 demonstrates consensus expression clusters that reflect significantly lower spurious repeat sequence content and capture alternate splicing within a whole body index cluster and three STACK v2.3 tissue-level clusters. :Sponsors: This work was originally funded under U.S. Department of Energy grant DE-FC03-95ER62062 (W.A.H.) and S.A. Foundation for Research grant GUN 2039524 (W.A.H.) :

URL: http://ww2.sanbi.ac.za/Dbases.html

Resource ID: nif-0000-20946     Resource Type: Resource     Version: Latest Version


database, software resource, data visualization software

