ENA European Nucleotide Archive

ENA_logo_2021

The European Nucleotide Archive (ENA) provides a comprehensive record of the world's nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. ENA is developed and operated by the EMBL-European Bioinformatics Institute (EMBL-EBI), an academic research institute based in the UK and part of the European Molecular Biology Laboratory (EMBL). ENA is one of the three databases that make up the International Nucleotide Sequence Database Collaboration (INSDC).

GFBio Data Center

The EBI is a world-renowned center for research and services in bioinformatics and is the European node for globally coordinated efforts to collect and disseminate biological data. The EBI operates as the central hub of the intra-European infrastructure Elixir, whose goal is to orchestrate the collection, quality control and archiving of large amounts of biological data produced by life science experiments. As such, the EBI’s mission is to ensure that the growing body of information from molecular biology and genome research is placed in the public domain and is accessible freely to all facets of the scientific community in ways that promote scientific progress. The ENA Team has almost 30 years of experience in capturing nucleotide sequence data, including the metadata that describes the experimental design for producing it. Currently ENA holds about 2.5 Petabyte of data with an approximate doubling time of 20 months.

Scientific data curation services (incl. taxonomic services)

Nucleotide sequence data and associated information (metadata) is deposited to the ENA using one of three submission routes: (1) programmatic submission, (2) interactive submission using the submission interface Webin, and (3) semi-automated route, where metadata are submitted using the interactive interface and data are deposited via an established institutional data submission service. Regardless of the submission route, all data and metadata are subject to the same validation tests. ENA issues permanent identifier to all conceptual objects of the ENA metadata model and supports consistent description of the objects using checklists of information elements specific to each object.
Scientific data curation is progressively moving from manual review of sequence annotation towards more impactful definition of validation rules for all supported data classes and checklists of the data objects. This approach allows more scalable and sustainable quality checks of incoming data. The ENA team also provides helpdesk that supports depositors in resolving issues related to data submission as well as data retrieval.
Taxonomic classification of all sequence data is based on the NCBI Taxonomy index and all incoming sequences are validated against this index. Organisms yet unclassified at the NCBI Taxonomy follow rules summarised at the ENA documentation. Essentially, data depositors report basic details on the unclassified sequenced organism and an amendment is requested at the NCBI Taxonomy, which is typically resolved within a few working days.
ENA as a primary data archive serves data either for direct browsing, search or download or provides infrastructure services to domain-specific databases that add value to the primary data.

Data domains

ENA's main focus is on long-term archiving and provisioning of nucleotide sequence data.

IT services

ENA provides software tools for efficient sequence data submission and retrieval. For a complete overview, see https://www.ebi.ac.uk/ena/browser/guides.

User services