Sequence Alignment Help

The alignment files produced use the following nomenclature and numbering conventions. These conventions are based on the recommendations published for Human Gene Mutations. These were prepared by a nomenclature-working group looking at how to name and store sequences for human allelic variants. These recommendations can be found in Human Mutation 11:1-3, 1998 (1).

  • Only alleles officially recognised by the WHO HLA Nomenclature Committee for Factors of the HLA System are included in the sequence alignments.
  • As recommended for all human gene mutations, a standard reference sequence should be used for all alignments. A complete list of reference sequences for each allele can be seen below.
  • The reference sequence will always be associated with the same (original) accession number, unless this sequence is shown to be in error.
  • All alleles are aligned to the reference sequences.
  • Naming of the sequence is based upon the published naming conventions (2).

Using the Sequence Alignment Tool

This docuemntation covers how to use both the HLA Alignment Tool and the KIR Alignment Tool

The sequence alignment form contains the following options:

  • Select Locus - this option allows the user to choose which of the HLA or KIR genes to align from the drop-down menu. The drop-down menu also includes a number of special choices, like multiple sequence alignments for all the DRB1, 3, 4 & 5 alleles or all the DRB pseudogene alleles.

  • Select the feature to align - this option provides a list of alignments available for the locus selected. The alignments available include CDS alignments, individual exon alignments or alignments of combined regions. If an option listed in the drop-down menu for one locus is not listed for another locus, then it is either not possible or is currently unavailable.

  • Enter any specific sequences required - this allows the user to view alignments of specific sequences by entering either common nomenclature or by listing allele names. For example, to align DRB1*01:01, DRB1*01:02:01 and DRB1*01:02:02 the user could enter 01 or 01:0 into this box as the common nomenclature and this will match the desired alleles. Alternatively, the user could enter "01:01, 01:02:01, 01:02:02" in the box, separating each allele name with either a comma or a new line, and this will also match the desired alleles. The wildcard character (*) may also be used in the allele name.

  • Enter the reference sequence - the alignment tool allows the user seleect an alternative reference sequence. This is optional and, if not selected or altered, the tool uses the default sequence as listed:

    To use an alternative reference sequence, simply enter the full numerical code in the box provided. Please note that incorrect codes will cause errors in the alignment e.g. 01:01 is not a valid code for specifying A*01:01:01:01 to be used as a reference sequence - the full numerical code must be entered. A consensus sequence based on the specified alleles in the alignment can be used by typing "CONSENSUS" into the reference box. The consensus sequence will be derived from the alleles specified for the alignment.


Note - in the DQB1 alignments, the DQB1*05 alleles are displayed first.

Alignment Display Options

  • Mismatches - this option selects whether to display the full sequence for all alleles in the alignment, or to only display those bases that mismatch the reference, e.g.:

    - Show mismatches between sequences:

    A*01:01:01:01 CGGGGGCCCT GGCCCTGACC
    A*01:02       ---------- -------C--

    - Show all bases:

    A*01:01:01:01 CGGGGGCCCT GGCCCTGACC
    A*01:02       CGGGGGCCCT GGCCCTGCCC
  • Numbering - depending on the alignment type, different numbering formats can be selected. For nucleotide sequences the alignments can be displayed in blocks of 10 nucleotides or in blocks of 3 nucleotides to represent the amino acid codons. Genomic alignments are only displayed in blocks of 10 nucleotides and protein alignments are always displayed in blocks of 10 amino acids. For either format, it may be necessary to increase the width of your browser window (or zoom out) to fully view the alignment. Full details of how sequences are numbered is explained here.

  • Alleles unsequenced in selected region - the user can omit the alleles that are not sequenced over the region of interest from their alignment. This will reduce the time taken to perform the alignment. For some loci, genomic alignments can contain over 1.5 million bases if all sequences are selected. When non-coding regions are selected, all alleles which contain unsequenced regions are removed from the alignment by default. Where possible, select only the sequences needed as this will reduce the loading time and make the alignments easier to view.

  • Output - to aid printing of the alignments, the user can select a text only version of the output. This removes all interactive tags and is easier to cut and paste into applications like Microsoft Word.

Constructing the Virtual Sequence

The procedure for inclusion of an allele into the sequence alignments is described below.

  • The sequence of the allele is derived from all sequence entries submitted to the IPD Sequence Database. These entries are from the generalist databanks like EMBL/GenBank/DDBJ.
  • A "virtual sequence" is constructed for each allele. This is produced using all the individual sequence entries in the IPD Sequence Database.

Image of Virtual Sequence

Alignment of component sequences to form "virtual sequence".

  • The virtual sequence is then aligned against the reference sequence for that locus.
  • Insertions, periods (.), are added to the virtual sequence to ensure alignment to the reference sequence.
  • If the new allele has an insertion that causes the reference sequence to be amended then this same change is propogated to all other sequences. This ensures that the reference sequence remains standardised.

The finalised sequence alignments are provided at a number of web sites. These alignments contain a number of conventions for display identity and evolutionary events, as well as the numbering of the alignments. These conventions are explained below.

Numbering of the Sequence Alignment

In order to provide standardised sequences for any loci, the following numbering system has been established that accurately represents the sequence at both the nucleotide and protein level. We have looked at the HUGO Gene Nomenclature Committee (1) recommendations proposed for the numbering of genomic sequences, and use a similar model for the HLA and KIR sequences held in the IPD Sequence Database. Many of their proposals already match our current strategy. HUGO recommends that for all nomenclature systems a standard reference sequence should be used for each locus. In the case of HLA and KIR sequences a standard reference sequence is already established for each gene. The remaining recommendations for nucleotide sequences are as follows;

Nucleotide Sequence Numbering.

  • The numbering of the nucleotides in the reference sequence should remain constant.
  • For both gDNA and cDNA the A of the ATG initiator Methionine codon has been denoted nucleotide +1. In some non-expressed genes this codon is not present and in these cases the first base of the reference sequence has been denoted as nucleotide +1.
  • The nucleotide immediately preceding the A of the ATG initiator Methionine codon has been denoted nucleotide -1. Note: that there is no nucleotide 0.
  • cDNA sequences are numbered consecutively from the A of the ATG initiator Methionine codon.
  • Nucleotide sequences may be displayed in codons, in this case the numbering follows that for protein sequences.

The following recommendations are used for describing mutations in nucleotide sequences;

  • Nucleotide substitutions are designated using the nucleotide number, followed by the substitution. For example; 997G>T denotes a substitution of G to T at position 997 of the DNA sequence.
  • Deletions are designated by 'del' after the nucleotide number. For example; 997delT denotes the deletion of a T at position 997 of the DNA. For deletions of a number of consecutive bases the mutation should be described as 997-998delTG which denotes a deletion of TG at positions 997 and 998 of the DNA.
  • Insertions are designated by 'ins' after the nucleotide numbers bordering the insertion. For example; 997-998insT, represents an insertion of T between bases 997 and 998 of the DNA. In the alignments produced this will be represented by a period (.), but the numbering of the reference sequence will not be altered to include this base. Insertions of multiple bases are designated using the same form, 997-998insTG denotes an insertion of TG between positions 997 and 998 of the DNA.

Protein Sequence Numbering

  • For amino acid-based systems, the start codon of the mature protein is labeled codon 1.
  • The codon 5' to this is numbered -1.
  • All numbering is based on the reference sequence.
  • The single letter amino acid code is used in all protein alignments.
  • Nucleotide sequences may be displayed in codons, in this case the numbering follows that for protein sequences.
  • To avoid confusion with the nucleotide numbering p. may be added to the nomenclature to denote a protein sequence.

Mutations in protein sequences follow a similar format;

  • For amino acid nomenclature the reference amino acid is listed first followed by the codon and then the mutation. For example; Y97S represents a substitution of the Tyrosine at codon 97 for a Serine.
  • Stop codons are always designated by X. For example; T97X represents a Threonine substituted for a stop codon.
  • Deletions are again designated used 'del'. For example; T97del is the deletion of a Threonine at codon 97.
  • Insertions again follow the 'ins' convention. For example; T97-98ins represents a Threonine inserted between codons 97 and 98

Some tools provide sequence alignments where identity and mismatches are highlighted. In these tools, the following conventions are used.

  • The entry for each allele is displayed in respect to the reference sequences.
  • Where identity to the reference sequence is present the base will be displayed as a hyphen (-).
  • Non-identity to the reference sequence is shown by displaying the appropriate base at that position.
  • Where an insertion or deletion has occurred this will be represented by a period (.).
  • If the sequence is unknown at any point in the alignment, this will be represented by an asterisk (*).
  • In protein alignments for null alleles, the 'Stop' codons will be represented by a hash (X).
  • In protein alignments, sequence following the termination codon, will not be marked and will appear blank.
  • These conventions are used for both nucleotide and protein alignments.

References

  1. Antonarakis SE and the Nomenclature Working Group
    Recommendations for a Nomenclature System for Human Gene Mutations
    Human Mutation (1998) 11 1-3
  2. SGE Marsh, ED Albert, WF Bodmer, RE Bontrop, B Dupont, HA Erlich, M Fernández-Vina, DE Geraghty, R Holdsworth,
    CK Hurley, M Lau, KW Lee, B Mach, WR Mayr, M Maiers, CR Müller, P Parham, EW Petersdorf, T Sasazuki, JL Strominger,
    A Svejgaard, PI Terasaki, JM Tiercy, J Trowsdale
    Nomenclature for Factors of the HLA System, 2010
    Tissue Antigens 2010 75:291-455
  3. Thompson JD, Higgins DG, Gibson TJ
    CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
    Nucleic Acids Research (1994) 22 4673-4680