The GISAID EpiFlu™ Influenza Database - Curation of the Data
Background: In response to growing needs of the global influenza community to share genetic sequences and associated epidemiological and clinical data, the GISAID initiative launched in May 2008 the EpiFlu™ database as a tool for sharing and analyzing such data. The EpiFlu™ database is publicly accessible (www.gisaid.org) and has introduced a unique sharing mechanism that protects the rights of the data submitter while facilitating further research and the development of vaccines and policies for drug use. It enshrines the principle of acknowledging the contributions of all participants to sustain a collaborative ethos throughout the influenza community. All users identify themselves and agree not to attach any restrictions to the data, to acknowledge both the originator of the specimen and the submitter of the data, and to seek collaborations with the data providers. With this distinctive mechanism GISAID’s EpiFlu™ database provides an alternative to current public-domain databases. As of May 11, 2011, GISAID’s EpiFlu™ database comprised 194,244 nucleotide sequences from 56,673 isolates; approximately 20% of them (36,935 nucleotide sequences from 15,102 isolates) were submitted directly to GISAID’s EpiFlu™. With the majority of the latter available only in the GISAID’s database, EpiFlu™ emerged as the world’s most comprehensive collection of influenza sequence data. Methods and Results: As of January 2011, the Federal Republic of Germany is the official host for the EpiFlu™ database. Three German institutions are engaged in development and maintenance of the database: the Max Planck Institute for Informatics (MPII) is responsible for the development of the software, the Federal Office for Agriculture and Food (BLE), hosts the GISAID portal, and the Friedrich-Loeffler-Institute (FLI) performs quality control and data curation. The quality of the data is crucial for detailed analyses of huge molecular datasets. With the rapidly rising volume of sequence data, a systematic and scalable procedure for data curation is becoming more and more essential. Thus, bringing forward the effectiveness and quality of data curation is an important aspect of the GISAID EpiFlu™ database. GISAID data curation comprises a two-stage process with automatic and manual annotation steps. During submission, the sequence is checked via an automatic procedure for the correct annotation of segment designation, virus type, virus subtype and lineage. The assignment of open reading frames and an examination for completeness of the segment sequence are also facilitated by an automatic process. The automatic tasks are accomplished by a sequence of BLAST searches and alignments against specific reference datasets. If the output of this protocol differs from the original annotation, the submitter will be informed about both the extent of and the character of the required change. Subsequently, the submitter is responsible for the release of the sequence to the GISAID community. After release, data are manually inspected in a second curation phase. The metadata of each submission are monitored for completeness and plausibility. Entries are checked for the correct assignment of isolate names or the origin of the sequence. If there is any need for correction, the curator enters into a respective dialog with the submitter and requests amendments. In addition, the curator monitors the quality of the submitted sequences regarding to the correct use of the IUPAC code and the absence of ambiguities. The data curator also monitors the availability of sequences and encourages those possessing long-time unreleased sequences to share their data promptly with the GISAID community. Conclusion: The meticulous curation of the GISAID EpiFlu™ database enhances data quality and consequently the scientific exploitation of the influenza sequence collection. The GISAID team will continue to develop and improve programs and procedures for analysis and annotation, and will provide effective tailor-made solutions for the influenza scientific community.