EnsemblEnsembl mirror

Ensembl Core API Tutorial

Introduction

This tutorial demonstrates general API concepts, applicable across all parts of the Ensembl API. Assuming you have some familiarity with Perl, you will be shown how to get at information through the API, such as extracting genomic sequences and finding repeats.

The Perl API provides a level of abstraction over the Ensembl Core databases and is used by the Ensembl web interface, pipeline, and gene-build systems. To external users the API may be useful to automate the extraction of particular data, to customize Ensembl to fulfil a particular purpose, or to store additional data in Ensembl. As a brief introduction this tutorial focuses primarily on the retrieval of data from the Ensembl Core databases.

The Perl API is only one of many ways of accessing the data stored in Ensembl. Additionally there is a genome browser web interface, and the BioMart system. BioMart may be a more appropriate tool for certain types of data mining.

API Documentation

The Ensembl Core API has a decent set of code documentation in the form of standard Perl POD (Plain Old Documentation). This is documentation is mixed in with the actual code, but can be automatically extracted and formatted using some software tools. A web-browsable version is available on the Ensembl website.

To aid in your understanding of the relationships between objects used in Ensembl, you may wish to refer to the following Core API Overview. It shows the most commonly used objects in the core API as well as some links to other Ensembl APIs.

If you have your PERL5LIB environment variable set correctly (see the Perl API Installation instructions) you can use the command perldoc. For example the following command will bring up some documentation about the Slice class and each of its methods:

perldoc Bio::EnsEMBL::Slice

For additional information you can contact ensembl-dev, the Ensembl development mailing list.

Obtaining and Installing the Code

The Ensembl Perl API can be downloaded in gzipped TAR format from our FTP site, and is also made available through GitHub. Instructions for how to install the code.

Code Conventions

Several naming conventions are used throughout the API. Learning these conventions will aid in your understanding of the code.

  • Variable names are underscore-separated all lower-case words:
    $slice, @exons, %exon_hash, $gene_adaptor
  • Class and package names are mixed-case words that begin with capital letters.
    Bio::EnsEMBL::Gene, Bio::EnsEMBL::Exon, Bio::EnsEMBL::Slice, Bio::EnsEMBL::DBSQL::GeneAdaptor
  • Method names are entirely lower-case, underscore separated words. Class names in the method are an exception to this convention; these words begin with an upper-case letter and are not underscore separated. The word dbID is another exception which denotes the unique database identifier of an object. No method names begin with a capital letter, even if they refer to a class.
    fetch_all_by_Slice(), get_all_Genes(), translation(), fetch_by_dbID()
  • Method names that begin with an underscore '_' are intended to be private and should not be called externally from the class in which they are defined.
  • Object adaptors are responsible for the creation of various objects. The adaptor is named after the object it creates, and the methods responsible for the retrieval of these objects all start with the word fetch. All of the fetch methods returns only objects of the type that the adaptor creates. Therefore the object name is not required in the method name. For example, all fetch methods in the Gene adaptor return Gene objects. Non-adaptor methods generally avoid the use of the word fetch.
    fetch_all_by_Slice(), fetch_by_dbID(), fetch_by_region()
  • Methods which begin with get_all or fetch_all return references to lists. Many methods in Ensembl pass lists by reference, rather than by value, for efficiency. This might take some getting used to, but it results in more efficient code, especially when very large lists are passed around (as they often are in Ensembl).
    get_all_Transcripts(), fetch_all_by_Slice(), get_all_Exons()

The following examples demonstrate some of Perl's list reference syntax. You do not need to understand the API concepts in this example. The important thing to note is the language syntax; the concepts will be described later.

# get a slice adaptor for the human core database
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );

# Fetch all clones from a slice adaptor (returns a list reference)
my $clones_ref = $slice_adaptor->fetch_all('clone');

# If you want a copy of the contents of the list referenced by
# the $clones_ref reference...
my @clones = @{$clones_ref};

# Get the first clone from the list via the reference:
my $first_clone = $clones_ref->[0];

# Iterate through all of the genes on a clone
foreach my $gene ( @{ $first_clone->get_all_Genes() } ) {
    print $gene->stable_id(), "\n";
}

# More memory efficient way of doing the same thing
my $genes = $first_clone->get_all_Genes();
while ( my $gene = shift @{$genes} ) {
    print $gene->stable_id(), "\n";
}

# Retrieve a single Slice object (not a list reference)
my $clone = $slice_adaptor->fetch_by_region( 'clone', 'AL031658.11' );
# No dereferencing needed:
print $clone->seq_region_name(), "\n";

A note about lazy loading and memory usage

Some of the data that makes up the objects returned from the Ensembl API is lazy loaded. By using lazy loading, we are able to minimize the number of database queries and only "fill in" the data in the object that the program actually asked for. This makes the code faster and its memory footprint smaller, but it also means that the more data that the program requests from an object the larger it becomes. The consequence of this is that looping over a large number of these objects in some cases might grow the memory footprint of the program considerably.

By using a while-shift loop rather than a foreach loop, the growth of the memory footprint due to lazy loading of data is more likely to stay small. This is why the comment on the last loop above says that it is a "more memory efficient way", and this is also why we use this convention for most similar loop constructs in the remainder of this API tutorial.

NB: This strategy obviously won't work if the contents of the list being iterated over is needed at some later point after the end of the loop.

Connecting to the Database: The Registry

All data used and created by Ensembl is stored in MySQL relational databases. If you want to access this database the first thing you have to do is to connect to it. This is done behind the scenes by Ensembl using the standard Perl DBI module. However, if your computer is behind a firewall, you need to allow ougoing connections to the corresponding ports. You will need to know two things before you start:

host
the name of the host where the Ensembl database lives
user
the user name used to access the database

First, we need to import all Perl modules that we will be using. Since we need a connection to an Ensembl database through the Registry we first have to import the Registry module which we use to establish this connection. Almost every Ensembl script that you will write will contain a use statement like the following:

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org'
    -user => 'anonymous'
);

We've made a connection to an Ensembl Registry and passed parameters using the -attribute => 'somevalue' syntax present in many of the Ensembl object constructors. Formatted correctly, this syntax lets you see exactly what arguments and values you are passing.

In addition to the parameters provided above the optional port and pass parameters can be used specify the TCP port to connect via and the password to use respectively. These values have sensible defaults and can often be omitted.

The registry may be used to, for example, get a list of all Ensembl databases installed on a given database host:

use Bio::EnsEMBL::Registry;

my $registry = 'Bio::EnsEMBL::Registry';

$registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org', # alternatively 'useastdb.ensembl.org'
    -user => 'anonymous'
);

my @db_adaptors = @{ $registry->get_all_DBAdaptors() };

foreach my $db_adaptor (@db_adaptors) {
    my $db_connection = $db_adaptor->dbc();

    printf(
        "species/group\t%s/%s\ndatabase\t%s\nhost:port\t%s:%s\n\n",
        $db_adaptor->species(),   $db_adaptor->group(),
        $db_connection->dbname(), $db_connection->host(),
        $db_connection->port()
    );
}

Object Adaptors

Before we launch into the ways the API can be used to retrieve and process data from the Ensembl databases it is best to mention the fundamental relationships the Ensembl objects have with the database.

The Ensembl API allows manipulation of the database data through various objects. For example, some of the more heavily used objects are the Gene, Slice and Exon objects. More details of how to effectively use these objects will be covered later. These objects are retrieved and stored in the database through the use of object adaptors. Object adaptors have internal knowledge of the underlying database schema and use this knowledge to fetch, store and remove objects (and data) from the database. This way you can write code and use the Ensembl Core API without having to know anything about the underlying databases you are using. The database adaptor that we created in the previous section is a special adaptor which has the responsibility of maintaining the database connection and creating other object adaptors.

Object adaptors are obtained from the Registry via a method named get_adaptor(). To obtain a Slice adaptor or a Gene adaptor (which retrieve Slice and Gene objects respectively) for Human, do the following after having loaded the Registry, here called $registry, as above:

my $gene_adaptor  = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );

Don't worry if you don't immediately see how useful this could be. Just remember that you don't need to know anything about how the database is structured, but you can retrieve the necessary data (neatly packaged in objects) by asking for it from the correct adaptor. Throughout the rest of this document we are going to work through the ways the Ensembl objects can be used to derive the information you want.

Slices

A Slice object represents a single continuous region of a genome. Slices can be used to obtain sequence, features or other information from a particular region of interest. To retrieve a slice it is first necessary to get a slice adaptor. Here we get such an adaptor for the latest version of the Human database.

my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );

The slice adaptor provides several ways to obtain slices, but we will start with the fetch_by_region() method which is the most commonly used. This method takes numerous arguments but most of them are optional. In order, the arguments are: coord_system_name, seq_region_name, start, end, strand, coord_system_version. The following are several examples of how to use the fetch_by_region() method:

# get a slice adaptor for the human core database
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );

# Obtain a slice covering the entire chromosome X
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', 'X' );

# Obtain a slice covering the entire clone AL359765.6
$slice = $slice_adaptor->fetch_by_region( 'clone', 'AL359765.6' );

# Obtain a slice covering an entire NT contig
$slice = $slice_adaptor->fetch_by_region( 'supercontig', 'NT_011333' );

# Obtain a slice covering the region from 1MB to 2MB (inclusively) of
# chromosome 20
$slice = $slice_adaptor->fetch_by_region( 'chromosome', '20', 1e6, 2e6 );

Another useful way to obtain a slice is with respect to a gene:

my $slice = $slice_adaptor->fetch_by_gene_stable_id( 'ENSG00000099889', 5e3 );

This will return a slice that contains the sequence of the gene specified by its stable Ensembl ID. It also returns 5000bp of flanking sequence at both the 5' and 3' ends, which is useful if you are interested in the environs that a gene inhabits. You needn't have the flanking sequence it you don't want it — in this case set the number of flanking bases to zero or simply omit the second argument entirely. Note that for historical reasons the fetch_by_gene_stable_id() method always returns a slice on the forward strand even if the gene is on the reverse strand.

To retrieve a set of slices from a particular coordinate system the fetch_all() method can be used:

# Retrieve slices of every chromosome in the database
@slices = @{ $slice_adaptor->fetch_all('chromosome') };

# Retrieve slices of every BAC clone in the database
@slices = @{ $slice_adaptor->fetch_all('clone') };

For certain types of analysis it is necessary to break up regions into smaller manageable pieces. The method split_Slices() can be imported from the Bio::EnsEMBL::Utils::Slice module to break up larger slices into smaller component slices.

use Bio::EnsEMBL::Utils::Slice qw(split_Slices);

# ...

my $slices = $slice_adaptor->fetch_all('chromosome');

# Base pair overlap between returned slices
my $overlap = 0;

# Maximum size of returned slices
my $max_size = 100_000;

# Break chromosomal slices into smaller 100k component slices
$slices = split_Slices( $slices, $max_size, $overlap );

To obtain sequence from a slice the seq() or subseq() methods can be used:

my $sequence = $slice->seq();
print $sequence, "\n";

$sequence = $slice->subseq( 100, 200 );

We can query the slice for information about itself:

# The method coord_system() returns a Bio::EnsEMBL::CoordSystem object
my $coord_sys  = $slice->coord_system()->name();
my $seq_region = $slice->seq_region_name();
my $start      = $slice->start();
my $end        = $slice->end();
my $strand     = $slice->strand();

print "Slice: $coord_sys $seq_region $start-$end ($strand)\n";

Many object adaptors can provide a set of features which overlap a slice. The slice itself also provides a means to obtain features which overlap its region. The following are two ways to obtain a list of genes which overlap a slice:

@genes = @{ $gene_adaptor->fetch_all_by_Slice($slice) };

# Another way of doing the same thing:
@genes = @{ $slice->get_all_Genes() };

Features

Features are objects in the database which have a defined location on the genome. All features in Ensembl inherit from the Bio::EnsEMBL::Feature class and have the following location defining attributes: start, end, strand, slice.

In addition to locational attributes all features have internal database identifiers accessed via the method dbID(). All feature objects can be retrieved from their associated object adaptors using a slice object or the feature's internal identifier (dbID). The following example illustrates how Transcript features and DnaDnaAlignFeature features can be obtained from the database. All features in the database can be retrieved in similar ways from their own object adaptors.

my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
my $tr_adaptor    = $registry->get_adaptor( 'Human', 'Core', 'Transcript' );
my $daf_adaptor   = $registry->get_adaptor( 'Human', 'Core', 'DnaAlignFeature' );

# Get a slice of chromosome 20, 10MB-11MB
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '20', 10e6, 11e6 );

# Fetch all of the transcripts overlapping chromosome 20, 10MB-11MB
my $transcripts = $tr_adaptor->fetch_all_by_Slice($slice);
while ( my $tr = shift @{$transcripts} ) {
    my $dbID      = $tr->dbID();
    my $start     = $tr->start();
    my $end       = $tr->end();
    my $strand    = $tr->strand();
    my $stable_id = $tr->stable_id();

    print "Transcript $stable_id [$dbID] $start-$end ($strand)\n";
}

# Fetch all of the DNA-DNA alignments overlapping chromosome 20, 10MB-11MB
my $dafs = $daf_adaptor->fetch_all_by_Slice($slice);
while ( my $daf = shift @{$dafs} ) {
    my $dbID     = $daf->dbID();
    my $start    = $daf->start();
    my $end      = $daf->end();
    my $strand   = $daf->strand();
    my $hseqname = $daf->hseqname();

    print "DNA Alignment $hseqname [$dbID] $start-$end ($strand)\n";
}

# Fetch a transcript by its internal identifier
my $transcript = $tr_adaptor->fetch_by_dbID(100);

# Fetch a dnaAlignFeature by its internal identifiers
my $dna_align_feat = $daf_adaptor->fetch_by_dbID(100);

All features also have the methods transform(), transfer(), and project() which are described in detail in the Transform, Transfer and Project sections of this tutorial.

Genes, Transcripts, and Exons

Genes, exons and transcripts are also features and can be treated in the same way as any other feature within Ensembl. A transcript in Ensembl is a grouping of exons. A gene in Ensembl is a grouping of transcripts which share any overlapping (or partially overlapping) exons. Transcripts also have an associated Translation object which defines the UTR and CDS composition of the transcript. Introns are not defined explicitly in the database but can be obtained by the Transcript method get_all_Introns().

Like all Ensembl features the start of an exon is always less than or equal to the end of the exon, regardless of the strand it is on. The start of the transcript is the start of the first exon of a transcript on the forward strand or the end of the last exon of a transcript on the reverse strand. The start and end of a gene are defined to be the lowest start value of its transcripts and the highest end value respectively.

Genes, translations, transcripts and exons all have stable identifiers. These are identifiers that are assigned to Ensembl's predictions, and maintained in subsequent releases. For example, if a transcript (or a sufficiently similar transcript) is re-predicted in a future release then it will be assigned the same stable identifier as its predecessor.

The following is an example of the retrieval of a set of genes, transcripts and exons:

sub feature2string
{
    my $feature = shift;

    my $stable_id  = $feature->stable_id();
    my $seq_region = $feature->slice->seq_region_name();
    my $start      = $feature->start();
    my $end        = $feature->end();
    my $strand     = $feature->strand();

    return sprintf( "%s: %s:%d-%d (%+d)",
        $stable_id, $seq_region, $start, $end, $strand );
}

my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', 'X', 1e6, 10e6 );

my $genes = $slice->get_all_Genes();
while ( my $gene = shift @{$genes} ) {
    my $gstring = feature2string($gene);
    print "$gstring\n";

    my $transcripts = $gene->get_all_Transcripts();
    while ( my $transcript = shift @{$transcripts} ) {
        my $tstring = feature2string($transcript);
        print "\t$tstring\n";

        foreach my $exon ( @{ $transcript->get_all_Exons() } ) {
            my $estring = feature2string($exon);
            print "\t\t$estring\n";
        }
    }
}

In addition to the methods which are present on every feature, the transcript class has many other methods which are commonly used. Several methods can be used to obtain transcript related sequences. For historical reasons some of these methods return strings while others return Bio::Seq objects. The following example demonstrates the use of some of these methods:

# The spliced_seq() method returns the concatenation of the exon
# sequences. This is the cDNA of the transcript
print "cDNA: ", $transcript->spliced_seq(), "\n";

# The translateable_seq() method returns only the CDS of the transcript
print "CDS: ", $transcript->translateable_seq(), "\n";

# UTR sequences are obtained via the five_prime_utr() and
# three_prime_utr() methods
my $fiv_utr = $transcript->five_prime_utr();
my $thr_utr = $transcript->three_prime_utr();

print "5' UTR: ", ( defined $fiv_utr ? $fiv_utr->seq() : "None" ), "\n";
print "3' UTR: ", ( defined $thr_utr ? $thr_utr->seq() : "None" ), "\n";

# The protein sequence is obtained from the translate() method. If the
# transcript is non-coding, undef is returned.
my $protein = $transcript->translate();

print "Translation: ", ( defined $protein ? $protein->seq() : "None" ), "\n";

Translations and ProteinFeatures

Translation objects and protein sequence can be extracted from a Transcript object. It is important to remember that some Ensembl transcripts are non-coding (pseudo-genes, ncRNAs, etc.) and have no translation. The primary purpose of a Translation object is to define the CDS and UTRs of its associated Transcript object. Peptide sequence is obtained directly from a Transcript object not a Translation object as might be expected. Once you have a Translation you can go back to its Transcript. If you retrieved the Translation using a stable identifier then the API will fetch the appropriate Transcript automatically. The following example obtains the protein sequence of a Transcript and the Translation's stable identifier:

my $stable_id = 'ENST00000528762';

my $transcript_adaptor =
  $registry->get_adaptor( 'Human', 'Core', 'Transcript' );
my $transcript = $transcript_adaptor->fetch_by_stable_id($stable_id);

print $transcript->translation()->stable_id(), "\n";
print $transcript->translate()->seq(),         "\n";

print $transcript->translation()->transcript()->stable_id(), "\n";

ProteinFeatures are features which are on an amino acid sequence rather than a nucleotide sequence. The method get_all_ProteinFeatures() can be used to obtain a set of protein features from a Translation object.

$translation = $transcript->translation();

my $pfeatures = $translation->get_all_ProteinFeatures();
while ( my $pfeature = shift @{$pfeatures} ) {
    my $logic_name = $pfeature->analysis()->logic_name();

    printf(
        "%d-%d %s %s %s\n",
        $pfeature->start(), $pfeature->end(), $logic_name,
        $pfeature->interpro_ac(),
        $pfeature->idesc()
    );
}

If only the protein features created by a particular analysis are desired the name of the analysis can be provided as an argument. To obtain the subset of features which are considered to be 'domain' features the convenience method get_all_DomainFeatures() can be used:

my $seg_features    = $translation->get_all_ProteinFeatures('Seg');
my $domain_features = $translation->get_all_DomainFeatures();

PredictionTranscripts

PredictionTranscripts are the results of ab initio gene finding programs that are stored in Ensembl. Example programs include Genscan and SNAP. Prediction transcripts have the same interface as normal transcripts and thus they can be used in the same way.

my $ptranscripts = $slice->get_all_PredictionTranscripts();

while ( my $ptranscript = shift @{$ptranscripts} ) {
    my @exons = @{ $ptranscript->get_all_Exons() };
    my $type  = $ptranscript->analysis()->logic_name();

    printf "%s prediction, has %d exons\n", $type, scalar @exons;

    foreach my $exon (@exons) {
        printf( "%d - %d (%+d) %d\n",
            $exon->start(), $exon->end(), $exon->strand(), $exon->phase() );
    }
}

Alignment Features

Two types of alignments are stored in the core Ensembl database: alignments of DNA sequence to the genome and alignments of protein sequence to the genome. These can be retrieved as DnaDnaAlignFeatures and DnaPepAlignFeatures respectively. A single gapped alignment is represented by a single feature with a cigar line. A cigar line is a compact representation of a gapped alignment as single string containing letters M (match) D (deletion), and I (insertion) prefixed by integer lengths (the number may be omitted if it is 1). A gapped alignment feature can be broken into its component ungapped alignments by the method ungapped_features() which returns a list of FeaturePair objects. The following example shows the retrieval of some alignment features.

# Retrieve dna-dna alignment features from the slice region
my @features = @{ $slice->get_all_DnaAlignFeatures('Vertrna') };
print_align_features( \@features );

# Retrieve protein-dna alignment features from the slice region
@features = @{ $slice->get_all_ProteinAlignFeatures('Swall') };
print_align_features( \@features );

sub print_align_features
{
    my $features_ref = shift;

    foreach my $feature ( @{$features_ref} ) {
        print_feature_pairs( [$feature] );

        print "Percent identity: ", $feature->percent_id(),   "\n";
        print "Cigar string: ",     $feature->cigar_string(), "\n";

        my @ungapped = $feature->ungapped_features();

        print "ungapped:\n";
        print_feature_pairs( \@ungapped );
        print "\n";
    }
}

sub print_feature_pairs
{
    my $features_ref = shift;

    foreach my $feature ( @{$features_ref} ) {
        # Display the 'hit' name and coordinates together with the
        # genomic coordinates.
        printf(
            "%s %d-%d (%+d)\t=> %d-%d (%+d)\n",
            $feature->hseqname(), $feature->hstart(), $feature->hend(),
            $feature->hstrand(),  $feature->start(),  $feature->end(),
            $feature->strand()
        );
    }
}

Repeats

Repetitive regions found by RepeatMasker and TRF (Tandem Repeat Finder) are represented in the Ensembl database as RepeatFeatures. Low-complexity regions are found by the program Dust and are also stored as RepeatFeatures. RepeatFeatures can be retrieved and used in the same way as other Ensembl features.

my @repeats = @{ $slice->get_all_RepeatFeatures() };

foreach my $repeat (@repeats) {
    printf( "%s %d-%d\n",
        $repeat->display_id(), $repeat->start(), $repeat->end() );
}

RepeatFeatures are used to perform repeat masking of the genomic sequence. Hard or soft-masked genomic sequence can be retrieved from Slice objects using the get_repeatmasked_seq() method. Hard-masking replaces sequence in repeat regions with Ns. Soft-masking replaces sequence in repeat regions with lower-case sequence.

my $unmasked_seq   = $slice->seq();
my $hardmasked_seq = $slice->get_repeatmasked_seq();
my $softmasked_seq = $slice->get_repeatmasked_seq( undef, 1 );

# Soft-mask sequence using TRF results only
my $tandem_masked_seq = $slice->get_repeatmasked_seq( ['TRF'], 1 );

Markers

Markers are imported into the Ensembl database from UniSTS and several other sources. A marker in Ensembl consists of a pair of primer sequences, an expected product size and a set of associated identifiers known as synonyms. Markers are placed on the genome electronically using an analysis program such as ePCR and their genomic positions are retrievable as MarkerFeatures. Map locations (genetic, radiation hybrid and in situ hybridization) for markers obtained from actual experimental evidence are also accessible.

Markers can be fetched via their name from the MarkerAdaptor.

my $marker_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Marker' );

# Obtain marker by synonym (this returns a list, but there's seldom more
# than one marker in the list)
my $marker = $marker_adaptor->fetch_all_by_synonym('D9S1038E')->[0];

# Display the various names associated with the same marker
foreach my $synonym ( @{ $marker->get_all_MarkerSynonyms() } ) {
    if ( defined $synonym->source() ) {
        print $synonym->source(), ':';
    }
    print $synonym->name(), ' ';
}
print "\n";

# Display the primer info
printf( "left primer : %s\n", $marker->left_primer() );
printf( "right primer: %s\n", $marker->right_primer() );
printf(
    "product size: %d-%d\n",
    $marker->min_primer_dist(),
    $marker->max_primer_dist()
);

# Display out genetic/RH/FISH map information
print "Map locations:\n";
foreach my $map_loc ( @{ $marker->get_all_MapLocations() } ) {
    printf( "\t%s %s %s\n",
        $map_loc->map_name(), $map_loc->chromosome_name(),
        $map_loc->position() );
}

MarkerFeatures, which represent genomic positions of markers, can be retrieved and manipulated in the same way as other Ensembl features.

# Obtain the positions for an already retrieved marker
foreach my $marker_feature ( @{ $marker->get_all_MarkerFeatures() } ) {
    printf( "%s %d-%d\n",
        $marker_feature->seq_region_name(),
        $marker_feature->start(),
        $marker_feature->end() );
}

# Retrieve all marker features in a given region
my $marker_features = $slice->get_all_MarkerFeatures();
while ( my $marker_feature = shift @{$marker_features} ) {
    printf( "%s %s %d-%d\n",
        $marker_feature->display_id(), $marker_feature->seq_region_name(),
        $marker_feature->start(),      $marker_feature->end() );
}

MiscFeatures

MiscFeatures are features with arbitrary attributes which are placed into arbitrary groupings. MiscFeatures can be retrieved as any other feature and are classified into distinct sets by a set code. Generally it only makes sense to retrieve all features which have a particular set code because very diverse types of MiscFeatures are stored in the database.

MiscFeature attributes are represented by Attribute objects and can be retrieved via a get_all_Attributes() method.

The following example retrieves all MiscFeatures representing ENCODE regions on a given slice and prints out their attributes:

my $encode_regions = $slice->get_all_MiscFeatures('encode');

while ( my $encode_region = shift @{$encode_regions} ) {
    my @attributes = @{ $encode_region->get_all_Attributes() };
    foreach my $attribute ( @attributes ) {
        printf "%s:%s\n", $attribute->name(), $attribute->value();
    }
}

This example retrieves all misc features representing a BAC clone via its name and prints out their location and other information:

my $mf_adaptor = $registry->get_adaptor( 'Human', 'Core', 'MiscFeature' );

my $clones =
  $mf_adaptor->fetch_all_by_attribute_type_value( 'clone_name', 'RP11-62N12' );

while ( my $clone = shift @{$clones} ) {
    my $slice = $clone->slice();

    printf( "%s %s %d-%d\n",
        $slice->coord_system->name(),
        $slice->seq_region_name(),
        $clone->start(), $clone->end() );

    my @attributes = @{ $clone->get_all_Attributes() };
    foreach my $attribute ( @attributes ) {
        printf "\t%s:%s\n", $attribute->name(), $attribute->value();
    }
}

External References

Ensembl cross references its genes, transcripts and translations with identifiers from other databases. A DBEntry object represents a cross reference and is often referred to as an Xref. The following code snippet retrieves and prints DBEntries for a gene, its transcripts and its translations:

# Define a helper subroutine to print DBEntries
sub print_DBEntries
{
    my $db_entries = shift;

    foreach my $dbe ( @{$db_entries} ) {
        printf "\tXREF %s (%s)\n", $dbe->display_id(), $dbe->dbname();
    }
}

my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );

# Get the 'COG6' gene from human
my $gene = $gene_adaptor->fetch_by_display_label('COG6');

print "GENE ", $gene->stable_id(), "\n";
print_DBEntries( $gene->get_all_DBEntries() );

foreach my $transcript ( @{ $gene->get_all_Transcripts() } ) {
    print "TRANSCRIPT ", $transcript->stable_id(), "\n";
    print_DBEntries( $transcript->get_all_DBEntries() );

    # Watch out: pseudogenes have no translation
    if ( defined $transcript->translation() ) {
        my $translation = $transcript->translation();

        print "TRANSLATION ", $translation->stable_id(), "\n";
        print_DBEntries( $translation->get_all_DBEntries() );
    }
}

Often it is useful to obtain all of the DBEntries associated with a gene and its associated transcripts and translation as in the above example. As a shortcut to calling get_all_DBEntries() on all of the above objects the get_all_DBLinks() method can be used instead. The above example could be shortened by using the following:

print_DBEntries( $gene->get_all_DBLinks() );

Coordinates

We have already discussed the fact that slices and features have coordinates, but we have not defined exactly what these coordinates mean.

Ensembl, and many other bioinformatics applications, use inclusive coordinates which start at 1. The first nucleotide of a DNA sequence is 1 and the first amino acid of a protein sequence is also 1. The length of a sequence is defined as end - start + 1.

In some rare cases inserts are specified with a start which is one greater than the end. For example a feature with a start of 10 and an end of 9 would be a zero length feature between base pairs 9 and 10.

Slice coordinates are relative to the start of the underlying DNA sequence region. The strand of the slice represents its orientation relative to the default orientation of the sequence region. By convention the start of the slice is always less than the end, and does not vary with its strandedness. Most slices you will encounter will have a strand of 1, and this is what we will consider in our examples. It is legal to create a slice which extends past the boundaries of a sequence region. Sequence retrieved from regions where the sequence is not defined will consist of Ns.

All features retrieved from the database have an associated slice (accessible via the slice() method). A feature's coordinates are always relative to this associated slice, i.e. the start and end attributes define a feature's position relative to the start of the slice the feature is on (or the end of the slice if it is a negative strand slice). The strand attribute of a feature is relative to the strand of the slice. By convention the start of a feature is always less than or equal to the end of the feature regardless of its strand (except in the case of an insert). It is legal to have features with coordinates which are less than one or greater than the length of the slice. Such cases are common when features that partially overlap a slice are retrieved from the database.

Consider, for example, the following figure of two features associated with a slice:

 [-----] (Feature A)

    |================================| (Slice)

          [--------] (Feature B)

 A  C  T  A  A  A  T  C  T  T  G   (Sequence)
 1  2  3  4  5  6  7  8  9  10 11 12 13

The slice itself has a start of 2, an end of 13, and a length of 12 even though the underlying sequence region only has a length of 11. Retrieving the sequence of such a slice would give the string CTAAATCTTGNN — the undefined region of sequence is represented by Ns. Feature A has a start of 0, an end of 2, and a strand of 1. Feature B has a start of 3, an end of 6, and a strand of -1.

Coordinate Systems

Sequences stored in Ensembl are associated with coordinate systems. What the coordinate systems are varies from species to species. For example, the homo_sapiens database has the following coordinate systems: contig, clone, supercontig, chromosome. Sequence and features may be retrieved from any coordinate system despite the fact they are only stored internally in a single coordinate system. The database stores the relationship between these coordinate systems and the API provides means to convert between them. The API has a CoordSystem object and object adaptor, however, these are most often used internally. The following example fetches a chromosome coordinate system object from the database:

my $cs_adaptor = $registry->get_adaptor( 'Human', 'Core', 'CoordSystem' );
my $cs = $cs_adaptor->fetch_by_name('chromosome');

printf "Coordinate system: %s %s\n", $cs->name(), $cs->version();

A coordinate system is uniquely defined by its name and version. Most coordinate systems do not have a version, and the ones that do have a default version, so it is usually sufficient to use only the name when requesting a coordinate system. For example, chromosome coordinate systems have a version which is the assembly that defined the construction of the coordinate system. The version of the human chromosome coordinate system might be something like NCBI35 or NCBI36, depending on the version of the Core databases used.

Slice objects have an associated CoordSystem object and a seq_region_name() method that returns its name which uniquely defines the sequence that they are positioned on. You may have noticed that the coordinate system of the sequence region was specified when obtaining a slice in the fetch_by_region() method. Similarly the version may also be specified (though it can almost always be omitted):

$slice =
  $slice_adaptor->fetch_by_region( 'chromosome', 'X', 1e6, 10e6, '1', 'NCBI36' );

Sometimes it is useful to obtain full slices of every sequence in a given coordinate system; this may be done using the SliceAdaptor method fetch_all():

my @chromosomes = @{ $slice_adaptor->fetch_all('chromosome') };
my @clones      = @{ $slice_adaptor->fetch_all('clone') };

Now suppose that you wish to write code which is independent of the species used. Not all species have the same coordinate systems; the available coordinate systems depends on the style of assembly used for that species (WGS, clone-based, etc.). You can obtain the list of available coordinate systems for a species using the CoordSystemAdaptor and there is also a special pseudo-coordinate system named toplevel. The toplevel coordinate system is not a real coordinate system, but is used to refer to the highest level coordinate system in a given region. The toplevel coordinate system is particularly useful in genomes that are incompletely assembled. For example, the latest zebrafish genome consists of a set of assembled chromosomes, and a set of supercontigs that are not part of any chromosome. In this example, the toplevel coordinate system sometimes refers to the chromosome coordinate system and sometimes to the supercontig coordinate system depending on the region it is used in.

# List all coordinate systems in this database:
my @coord_systems = @{ $cs_adaptor->fetch_all() };
foreach $cs (@coord_systems) {
    printf "Coordinate system: %s %s\n", $cs->name(), $cs->version;
}

# Get all slices on the highest coordinate system:
my @slices = @{ $slice_adaptor->fetch_all('toplevel') };

Transform

Features on a slice in a given coordinate system may be moved to another slice in the same coordinate system or to another coordinate system entirely. This is useful if you are working with a particular coordinate system but you are interested in obtaining the features coordinates in another coordinate system.

The transform() method can be used to move a feature to any coordinate system which is in the database. The feature will be placed on a slice which spans the entire sequence that the feature is on in the requested coordinate system.

if ( my $new_feature = $feature->transform('clone') ) {
    printf(
        "Feature's clonal position is: %s %d-%d (%+d)\n",
        $new_feature->slice->seq_region_name(),
        $new_feature->start(), $feature->end(), $new_feature->strand()
    );
} else {
    print "Feature is not defined in clonal coordinate system\n";
}

The transform() method returns a copy of the original feature in the new coordinate system, or undef if the feature is not defined in that coordinate system. A feature is considered to be undefined in a coordinate system if it overlaps an undefined region or if it crosses a coordinate system boundary. Take for example the tiling path relationship between chromosome and contig coordinate systems:

                  |~~~~~~~| (Feature A) |~~~~| (Feature B)

 (ctg 1) [=============]
         (ctg 2) (------==========] (ctg 2)
                      (ctg 3)   (--============] (ctg3)

Both Feature A and Feature B are defined in the chromosomal coordinate system described by the tiling path of contigs. However, Feature A is not defined in the contig coordinate system because it spans both Contig 1 and Contig 2. Feature B, on the other hand, is still defined in the contig coordinate system.

The special toplevel coordinate system can also be used in this instance to move the feature to the highest possible coordinate system in a given region:

my $new_feature = $feature->transform('toplevel');

printf(
    "Feature's toplevel position is: %s %s %d-%d (%+d)\n",
    $new_feature->slice->coord_system->name(),
    $new_feature->slice->seq_region_name(),
    $new_feature->start(),
    $new_feature->end(),
    $new_feature->strand()
);

Transfer

Another method that is available on all Ensembl features is the transfer() method. The transfer() method is similar to the previously described transform() method, but rather than taking a coordinate system argument it takes a slice argument. This is useful when you want a feature's coordinates to be relative to a certain region. Calling transform() on the feature will return a copy of the which is shifted onto the provided slice. If the feature would be placed on a gap or across a coordinate system boundary, then undef is returned instead. It is illegal to transfer a feature to a slice on a sequence region which is cannot be placed on. For example, a feature which is on chromosome X cannot be transferred to a slice on chromosome 20 and attempting to do so will raise an exception. It is legal to transfer a feature to a slice on which it has coordinates past the slice end or before the slice start. The following example illustrates the use of the transfer method:

my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '2',
1e6, 2e6 );

my $new_slice =
  $slice_adaptor->fetch_by_region( 'chromosome', '2', 1.5e6, 2e6 );

foreach my $simple_feature ( @{ $slice->get_all_SimpleFeatures('Eponine') } ) {
    printf(
        "Before:\t%6d - %6d\n",
        $simple_feature->start(),
        $simple_feature->end()
    );

    my $new_feature = $simple_feature->transfer($new_slice); if (
    !defined $new_feature ) { print "Could not transfer feature\n"; }
    else { printf( "After:\t%6d - %6d\n", $new_feature->start(),
    $new_feature->end() ); } }

In the above example a slice from another coordinate system could also have been used, provided you had an idea about what sequence region the features would be mapped to.

Project

When moving features between coordinate systems it is usually sufficient to use the transfer() or transform() methods. Sometimes, however, it is necessary to obtain coordinates in a another coordinate system even when a coordinate system boundary is crossed. Even though the feature is considered to be undefined in this case, the feature's coordinates in can still be obtained in the requested coordinate system using the project() method.

Both slices and features have their own project() methods, which take the same arguments and have the same return values. The project() method takes a coordinate system name as an argument and returns a reference to a list of ProjectionSegment objects. A projection segment has three attributes, accessible by the methods from_start(), from_end(), to_Slice(). The from_start() and from_end() methods return integers representing the part of the feature or slice that is used to form that part of the projection. The to_Slice() method returns a slice object representing the part of the region that the slice or feature was projected to. The following example illustrates the use of the project() method on a feature. The project() method on a slice can be used in the same way. As with the feature transform() method the pseudo coordinate system toplevel can be used to indicate you wish to project to the highest possible level.

printf( "Feature at: %s %d-%d (%+d) projects to\n",
    $feature->seq_region_name(), $feature->start(),
    $feature->end(),             $feature->strand() );

my $projection = $feature->project('clone');

foreach my $segment ( @{$projection} ) {
    my $to_slice = $segment->to_Slice();

    printf(
        "\t%s %d-%d (%+d)\n",
        $to_slice->seq_region_name(), $to_slice->start(),
        $to_slice->end(),             $to_slice->strand()
    );
}

An example script and sample input for mapping coordinates between assembly versions can be found in the Assembly converter directory on the Ensembl GitHub website.

Feature Convenience Methods

We have described how a feature's position on the genome is defined by an associated slice and a start, end, and strand on that slice. Often it is more convenient to retrieve a feature's absolute position on the underlying sequence region rather than its relative position on a slice. For convenience a number of methods are provided that can be used to easily obtain a feature's absolute coordinates.

# Shortcuts to doing $feat->slice()->coord_system()->name() and
# $feat->slice()->seq_region_name();
print $feature->coord_system_name(), ' ', $feature->seq_region_name(), ' ';

# Get and display the feature's position on the sequence region
printf( "%d-%d (%+d)\n",
    $feature->seq_region_start(),
    $feature->seq_region_end(),
    $feature->seq_region_strand() );

Another useful method is display_id(). This will return a string that can be used as the name or identifier for a particular feature. For a gene or transcript this method would return the name/symbol, for an alignment feature this would return the hit sequence name (hseqname), etc.

# The display_id() method returns a suitable display value for any
# feature type
print $feature->display_id(), "\n";

The feature_Slice() method will return a slice which is the exact overlap of the feature the method was called on. This slice can then be used to obtain the underlying sequence of the feature or to retrieve other features that overlap the same region, etc.

$feature_slice = $feature->feature_Slice();

# Display the sequence of the feature region
print $feature_slice->seq(), "\n";

# Display the sequence of the feature region + 5000bp flanking sequence
print $feature_slice->expand( 5000, 5000 )->seq(), "\n";

# Get all genes which overlap the feature
$genes = $feature_slice->get_all_Genes();

Other Scripts

Other useful scripts can be found in the misc-scripts directory on the Ensembl FTP site:

  • print_alternate_loci.pl - prints alternate loci for a given species and reference chromosome
  • print_refseq_transcripts.pl - prints refseq transcripts for a given species and logic name