Popup window
User: Password:
 
Description

Overview

    Prokaryotic Mobile Genetic Elements (MGE's) are central players in mobilizing genes, whether within a given genome (intra-cellular mobility) or between bacterial cells (inter-cellular mobility). Traditionally, MGE's have been classified as either bacteriophages, plasmids or transposons. This classification becomes more and more obsolete as many chimerical elements are identified, which display strong similarities with elements of different families.

    The ACLAME database is dedicated to the collection and classification of MGEs from various sources (phage genomes, plasmids, transposons and other genomic islands). All the publicly available data related to MGEs are collected, organized, eventually corrected and stored in the ACLAME database. Such information is then made accessible to the scientific community through the ACLAME web interface.

    From the collected MGE data, we aim at building a comprehensive classification of the functional modules of MGE's at the protein, gene, and higher levels. The classification is generated in a semi-automatic way. At present, proteins annotated on complete phage and plasmid DNA sequences are automatically classified using the graph theory based Markov clustering algorithm MCL (van Dongen, 2000) to produce families (see below). An ACLAME family is defined as a set of similar sequences sharing one or more functions. Families of phage proteins, plasmid proteins and phage+plasmid proteins have been built and are accessible through the ACLAME web interface as well.

    A continuous process of manual functional annotation of the protein families is taking place and is relying on information available in public databases. The collection of information in public databases is made by similarity searches using Blast, PSI-Blast (Altschul, 1997) and Hidden Markov Models (see below). The annotation process is open to expert volunteers willing to participate to the database curation.

    The function annotations rely on multiple sources of function definition. The classical Gene Ontology (GO) is used whenever terms are suitable for the annotation. However, GO is, at present, more focused on functions found in eukaryotic organisms. We are developing in our lab an ontology dedicated to mobile genetic elements: MeGO. New terms required for the annotation of MGEs in ACLAME are regularly added in MeGO. Whenever terms not suitable for MeGO are required, they are added in a dedicated section of the ACLAME database for immediate availability. On regular basis, terms in MeGO and ACLAME will be submitted to the GeneOntology.

    The next step in the ACLAME development will be the definition of functional modules based on the ACLAME protein families.

Protein families

    As described on the schema below, all the proteins encoded by a group of MGEs (phages, plasmids, ...) are compared between each other using the Ssearch program. Blastp was previously used in this step but was replaced in order to produce higher alignments quality and therefore better defined families. The resulting list of sequence pairs is transformed into a scoring matrix with the E-value as distance. The matrix is given as input to the MCL algorithm to produce a list of clustered proteins.

Classification diagram

    To generate clusters of proteins that can be considered as families, two parameters have been optimized based on the SCOP classification (Andreeva, 2004). We first ran the procedure using as input the sequences of protein domains with at most 90% sequence identity (pdb90d) provided by ASTRAL (Chandonia, 2002). Different E-value thresholds and different inflation values (an MCL parameter influencing clusters granularity) have been tested and the combination giving clusters closest to the SCOP Family-level classification was retained.

    To check that there was no bias toward the SCOP classification, we applied the same procedure with the optimized parameters on proteins encoded by IS elements, kindly provided by M. Chandler (http://www-is.biotoul.fr). This test showed that we could reproduce the established IS families and even identify some inconsistencies that were known to exist in these families.

    This set of parameters are used in our automated round of proteins classification. Results can be accessed for phage proteins, plasmid proteins and phage+plasmid proteins on the ACLAME web site. Each family has an identifier of the type family:<category>:<integer>. The <category> allows to identify from which type of MGE the family refers to. The <integer> indicates the family index, starting from 1 (since ACLAME version 0.3, the index was starting from 0 in previous versions). Providing a family identifier in the ACLAME search engine will bring you directly to the page describing the family composition and annotation.

Modules definition

    Well-characterized features found in MGEs can be viewed as functional modules, independently of the 'generic identification' of the elements. Defining such modules would allow the reconstruction of known MGEs as a combination of modules allowing to deal with their high mosaic nature (Figure 1), better reflecting the functional roles and evolutionary history of the individual modules. A better understanding of the modularity of MGEs should help to build up a rational ontology for MGEs and a taxonomy for viruses of bacteria and archae. To achieve this goal, we will regularly update ACLAME with information from newly sequenced MGE genomes and with help of expert knowledge in the scientific community.

Classification diagram
Figure1: classification diagram where each circle represents a functional module. Relationships between MGEs and modules is shown by the coloured arrows.

    The modules definition will be the next development step. It will be based on the analysis of protein families in order to define groups of proteins common or specific to MGEs.

Database searches

    The functional annotation process requires in part sequence similarity information collected from various public sequence databases. Sequence similarity detection is performed through two approaches: a single sequence and a sequence family approach.

    The first one consists in detecting sequence similarities in NCBI-NRDB (Benson, 2003), SCOP (Andreeva, 2004) and SwissProt (Boeckmann, 2003) databases for each individual sequence. A first search is performed using Blastp (Altschul, 1997) with an E-value score threshold of 10E-10. This will be the default result displayed for proteins having hits with Blastp. A second round of searches is made using PSI-Blast. This search consists in a several steps: 1) Construction of a version of the NCBI-NRDB with at most 90% sequence identity using the cd-hit program (Li, 2006). This DB is called NRDB90 2) Construiction of PSSMs for each sequence using 3 iterations against the NRDB90. 3) Using the PSSMs for one iteration against the NCBI-NRDB, SCOP and SwissProt with an E-value score threshold of 0.001. Proteins lacking results from Blastp but having hits with this procedure will report them on the ACLAME web site.

    The second approach allows searching similar sequences to a given ACLAME family. The procedure being used is:

  1. Build a multiple sequence alignment with the MUSCLE program (Edgar, 2004) for each ACLAME family.
  2. Use the multiple alignments as input to the hmmbuild and hmmcalibrate programs from the HMMer package (Eddy, 1998, see also http://hmmer.wustl.edu) for Hidden Markov Model building.
  3. The HMMs are then used to search for homologs in SCOP (Andreeva, 2004), NCBI-NRDB (Benson, 2003) and SwissProt (Boeckmann, 2003) sequence databases using HMMer hmmsearch program.

References

  • Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Nucleic Acids Res, 25, 3389-3402.
  • Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J. and Wheeler, D.L. (2003) Nucleic Acids Res, 31, 23-27.
  • Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I. et al. (2003) Nucleic Acids Res, 31, 365-370.
  • Chandonia, J.M., Walker, N.S., Lo Conte, L., Koehl, P., Levitt, M. and Brenner, S.E. (2002) Nucleic Acids Res, 30, 260-263.
  • Eddy, S.R. (1998) Bioinformatics, 14, 755-763.
  • van Dongen, S. (2000) Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000.
  • Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992) Comput Appl Biosci, 8, 189-191.
  • Andreeva A., Howorth D., Brenner S.E., Hubbard T.J.P., Chothia C., Murzin A.G. (2004) Nucl. Acid Res. 32, D226-D229.
  • Li, W. and Godzik, A. (2006) Bioinformatics, 22, 1658-9.
  • Edgar, R.C. (2004) Nucleic Acids Research 32(5), 1792-97.