Forgot your Password

If you have forgotten your password, please enter your account email below and we will reset your password and email you the new password.


Login to SciCrunch


Register an Account

Delete Saved Search

Are you sure you want to delete this saved search?


NIF LinkOut Portal


Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure.

Gough J, Karplus K, Hughey R, Chothia C
Journal of molecular biology


Of the sequence comparison methods, profile-based methods perform with greater selectively than those that use pairwise comparisons. Of the profile methods, hidden Markov models (HMMs) are apparently the best. The first part of this paper describes calculations that (i) improve the performance of HMMs and (ii) determine a good procedure for creating HMMs for sequences of proteins of known structure. For a family of related proteins, more homologues are detected using multiple models built from diverse single seed sequences than from one model built from a good alignment of those sequences. A new procedure is described for detecting and correcting those errors that arise at the model-building stage of the procedure. These two improvements greatly increase selectivity and coverage. The second part of the paper describes the construction of a library of HMMs, called SUPERFAMILY, that represent essentially all proteins of known structure. The sequences of the domains in proteins of known structure, that have identities less than 95 %, are used as seeds to build the models. Using the current data, this gives a library with 4894 models. The third part of the paper describes the use of the SUPERFAMILY model library to annotate the sequences of over 50 genomes. The models match twice as many target sequences as are matched by pairwise sequence comparison methods. For each genome, close to half of the sequences are matched in all or in part and, overall, the matches cover 35 % of eukaryotic genomes and 45 % of bacterial genomes. On average roughly 15% of genome sequences are labelled as being hypothetical yet homologous to proteins of known structure. The annotations derived from these matches are available from a public web server at: http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY. This server also enables users to match their own sequences against the SUPERFAMILY model library.

  1. Welcome

    Welcome to NIF. Explore available research resources: data, tools and materials, from across the web

  2. Community Resources

    Search for resources specially selected for NIF community

  3. More Resources

    Search across hundreds of additional biomedical databases

  4. Literature

    Search Pub Med abstracts and full text from PubMed Central

  5. Insert your Query

    Enter your search terms here and hit return. Search results for the selected tab will be returned.

  6. Join the Community

    Click here to login or register and join this community.

  7. Categories

    Narrow your search by selecting a category. For additional help in searching, view our tutorials.

  8. Query Info

    Displays the total number of search results. Provides additional information on search terms, e.g., automated query expansions, and any included categories or facets. Expansions, filters and facets can be removed by clicking on the X. Clicking on the + restores them.

  9. Search Results

    Displays individual records and a brief description. Click on the icons below each record to explore additional display options.