ExPASy logo ExPASy Home page Site Map Search ExPASy Contact us HAMAP Swiss-Prot
Search for

What is HAMAP?


[General] [HAMAP family content] [Technical aspects] [Accessing HAMAP] [FAQ]

General

HAMAP stands for High-quality Automated and Manual Annotation of microbial Proteomes.

Due to the rapid rate of bacterial and archaeal genome sequencing it is no longer possible to manually annotate even a small portion of these genomes, despite the considerable demand for corrected and annotated complete proteome sets. To increase their representation in UniProtKB/Swiss-Prot we started HAMAP, whose goal is to semi-automatically annotate a significant percentage of proteins originating from bacterial and archaeal genome sequencing projects. It is designed to do so without decreasing the annotation quality of UniProtKB/Swiss-Prot. HAMAP is also used to annotate proteins encoded by plastid genomes (i.e. chloroplasts, cyanelles, apicoplasts, non-photosynthetic plastids), and could be extended to mitochondrial genomes. Our automatic annotation methods, using a rule-based system, are only applied in cases where they are able to produce the same quality as manual annotation would, that is for proteins that are part of well-defined families or subfamilies. By this we mean protein families which have a well-defined function and which are well conserved at the sequence level. We also make families for uncharacterized protein families (UPFs), i.e. conserved protein families of unknown function.

To create each family, the available literature is consulted and all proteins for which there is experimental characterization are manually annotated. These proteins are called "templates". Decisions are made regarding what annotation can be safely propagated to orthologs. The use of "cases" (for example: restriction on the propagation of the annotation to a taxonomic group, dependence on the detection of a certain conserved active-site amino acid residue, etc.; see examples below) helps to limit the extension of the propagation if more characterization is lacking and it is not safe to assume that the same function, subunit, cofactor, etc. apply to all members of a protein family.
The criteria to assign initial membership to a family are sequence similarity and what is known in the literature about the protein in question. The "seed members" are manually chosen and aligned. This "seed alignment" will be used to automatically generate a profile that will detect possible members by scanning the UniProt Knowledgebase (Swiss-Prot and TrEMBL). For more details see "Automatic annotation of microbial proteomes in Swiss-Prot". Comput. Biol. Chem. 27:49-58(2003).
We use a somewhat different approach for the annotation of ABC transporters, which are large and complex paralogous families. For these, stringent profiles are required to distinguish between functional subfamilies regarding the transported substrate. For ABC transporters, manually built PROSITE profiles are used to assign family membership and there are no seed alignments.

The HAMAP automatic pipeline is then used to annotate additional members, in the following way: protein sequences from the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL databases are scanned against the HAMAP profile collection on a daily basis. True matches are annotated using the corresponding family rule. Many checks are performed in order to prevent the propagation of wrong annotation and to spot problematic cases, which are channeled to manual curation. The results of this annotation are integrated in UniProtKB/Swiss-Prot.

A relational database that supports incremental updates has been developed to store family rules, profiles, sequences and hits.

HAMAP family content

An example: MF_00497.

The view of each family rule contains:
A user manual (UniRule User Manual for the Web View) is available upon "mousing-over" the headers of each section.

Technical aspects

HAMAP and how it is updated

See the HAMAP families page for up-to-date statistics about the number and taxonomic coverage of HAMAP families. From this page, one can browse and perform searches in HAMAP families, or scan user-entered sequences. See the HAMAP status report page for up-to-date statistics about the number of complete microbial proteomes currently available. The HAMAP release is concurrent with every release of UniProtKB/Swiss-Prot, which takes place every 3 weeks. New families are added in each release, and existing families are periodically updated.

How to tell if a UniProtKB/Swiss-Prot entry has been automatically annotated

Every UniProtKB/Swiss-Prot entry incorporates annotation extracted from a variety of information sources, and it is not currently possible to mark the origin(s) of each annotated item in the database. The objective of the database lies more in providing a homogeneous view of the data. Automatically annotated entries present these general features: The extent of the annotation that is propagated automatically can be found in each family rule.

Cross-references from UniProtKB/Swiss-Prot to HAMAP

Cross-references are present in all UniProtKB/Swiss-Prot entries that are members of a HAMAP family (or several). These cross-references are found in the Cross-references/Family and domain databases section of the entry, and have the following format:
HAMAP; family-identifier; status; count.

The identifiers are:
Family Identifier: HAMAP unique identifier for a microbe protein family
Status: The values are either '-', 'fused', 'atypical' or 'atypical/fused'. The value '-' is a placeholder for an empty field; the 'fused' value indicates that the family rule does not cover the entire protein; the value 'atypical' points out that the protein is divergent in sequence or has mutated functional sites, and should not be included in family datasets; the value 'atypical/fused' indicates the last 2 findings.
Count: Number of domains found in the protein, generally '1', occasionally '2' for the fusion of 2 identical domains.

Example: HAMAP; MF_00012; -; 1.

Feature propagation

Protein features (and associated comments and keywords) are propagated automatically using two different approaches.

Propagated features
are propagated on the basis of their conservation throughout the family. The alignment(s) of representative entries present in the family rule is used to transfer features from the family rule to new members, provided that conserved residues as indicated in the family are observed.

Computed features
are predicted using the following ad hoc methods:
Computed feature Method used
Inteins (protein splicing) PROSITE profiles (PDOC00687)
Signal sequence type 1 SignalP (Nielsen et al., 1997)
Signal type 2 (lipoprotein) PROSITE profile (PS51257)
Signal type 4 (pilin) PROSITE pattern (PS00409)
Transmembrane regions TMHMM (Krogh et al., 2001)
Coiled coils Modified COILS (Lupas et al., 1991)
ATP/GTP binding sites Walker A profile (not yet done)
LPXTG cell-wall anchor PROSITE profile (PS50847)
Repeats: ANK, Kelch, LRR, TPR, WD REP (Andrade et al., 2000)

HAMAP profiles

The HAMAP profiles that are used to identify potential new members are generated using an automatic procedure based on the method used to generate PROSITE profiles (see Sigrist et al., Brief. Bioinform. 3(3):265-274 (2002)).

Accessing HAMAP

The most efficient and user-friendly way to access HAMAP data is to browse interactively on one of the mirror sites of the ExPASy server, at http://www.expasy.org/sprot/hamap/.

Downloading HAMAP data

Linking to HAMAP

See How to create HTML links to services on ExPASy to find out how to create links to HAMAP web pages.

Frequently asked questions (FAQ)

Will HAMAP be extended to eukaryotes?

HAMAP is already used to annotate the genomes of plastids. The HAMAP annotation procedure relies on the very high quality of gene prediction in genome sequences. While this postulate is true for most submitted prokaryotic genomes, the complex structure of eukaryotic genes makes high-quality automatic annotation more difficult.

What is the coverage of HAMAP in a genome?

Since family rules have been built with a bias toward well-studied phyla and housekeeping genes, the coverage is dependent on the organism type and the genome size. HAMAP families cover 58% of the genome in Buchnera aphidicola (subsp. Acyrthosiphon pisum), 21% in Escherichia coli K12, and only 6% in Streptomyces coelicolor.

Is it possible to annotate all the proteins of a new complete genome using HAMAP?

In certain new genomes, it is possible to annotate just over half of the proteins automatically with the current set of families. This coverage is constantly expanded with the addition of new families. However, the current approach is intrinsically limited to 'well-behaved' orthologous families, and new methods are being developed for the annotation of complex protein families.

Why and when do we merge genomes?

Historically when a new genome arrived we used to merge it with all preexisting entries of that bacterium or archaea, regardless of the strain of the new genome. As more and more strains of the same organism were sequenced it has become evident that this is no longer appropriate or desirable. Thus we now usually assign a new species code for each new strain of a particular organism, and then merge the new entries with any entries of the same strain that already exist in UniProtKB/Swiss-Prot. Presently we only merge complete microbial genomes when they are from the same strain.

ExPASy logo ExPASy Home page Site Map Search ExPASy Contact us HAMAP Swiss-Prot
 Hosted by ch flag SIB Switzerland Mirror sites: Australia  Brazil  Canada  China  Korea