CRG Viral Beacon - Info
Summary
CRG Covid Viral Beacon is a tool for those interested in the SARS-CoV-2 variability, mainly at genomic scale, but also related changes such as at aminoacid level along with other metadata. It has been developed as a branch of the GA4GH Beacon standard, as a special use case for testing and demonstration of new features in Beacon v2 (and implicitly of Beacon v1).
As a use case, SARS-CoV-2 gave us the opportunity to work on a small genome but its importance and urgency catalyzed us to help focus on the utility of the features.
It is well suited for all the features that Beacon v2 currently have: diversity of queries, filters, additional schemas for core entities, handover to other solutions or extended data, etc.
Although a virus genome is not limited by human-related data constraints, it is not exempt from licensing diversity and, hence, allowing additional experimentation with usage limitations.
It has been organized as an iterative project, starting with a quick solution for determining requirements and usage and gradually shifting to an orthodox Beacon v2 API interface, e.g. at the first iteration data is presented “as is”, without any harmonization nor pre-processing from our side.
Although driven by the EGA Team at CRG, it should be considered a joint effort between our institution, our partners and our founders.
Data in COVID Viral Beacon
We would like to thank:
- ENA
for providing us with SARS-COV2 data.
- Galaxy project
and CRG's Biocore team
for their collaboration in data analysis for ENA-Illumina data and ENA-ONT data, respectively.
- More info about variants and data used in Viral Beacon can be found in the Pipeline page.
- Summary statistics for data used in Viral Beacon to know the distribution of platforms, variants and metadata.
- All publicly available dataset used in Viral Beacon are available for download.
Query options in Beacon
Query by variant
This query uses a single genomic position with reference and alternate allele, for example 1105A>T, where 1105 is the numeric value for position in the genome, A is reference allele and T is alternate allele. Beacon will show only those variants in the database that are changed from A > T.
Query by region
This query uses genomic range start-end position to extract all variants found within this range. Only numeric input is accepted.
Query by region name/feature
This query uses name/aliases (such as gene_id/protein_id) to refer to genomic regions, based on genome annotation from SARS-CoV2 assembly refseq (NC_045512.2).
Query by motif
This query takes one or more short k-mer sequences and searches them in covid reference sequence using ‘Fimo’ tool from MEME-Suite. The reference genome sequence in the server is >NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1 and can be downloaded from here: NC_045512.
Query by amoniacid
This query is to find amino acid mutations and it takes name of the CDS region (selected through autocomplete only) and exact amino acid mutation user is searching for (e.g. 'F2L', where Phenylalanine changes into Leucine at position 2). Beacon will search for exact matches for amino acid mutation at given position for given id, if there are overlapping transcripts, more than one annotation will be reported.
Metadata information
For now, covid viral Beacon shows all metadata information available for the variants, in future some columns might be deprecated or clustered together. Following information is available so far in the database:
Variant basic
genomicRegion Classification(s) of the variant according to the genomic region affected (e.g. intergenic, 5UTR, 3UTR, coding)
effect Predicted effect of variant at nucleotide level (e.g. STOP GAINED, NON_SYNONYMOUS_CODING)
functionalClass Predicted effect of variant at protein level for protein affecting variants (e.g. nonsense, missense)
locusName Name of genomic region/locus affected by the variant (e.g. ORF1ab)
locusId ID of locus (loci) affected by variant (e.g. GU280_gp01)
aminoacidChange Change(s) at aminoacid level for protein affecting variants (e.g. E2904G)
annotationToolVersion Tool used for annotation and prediction of variant effects (e.g. SnpEffVersion=4.3t (build 2017-11-24 1018))
Sample metadata
frequency Proportion of variant occurence in sample(s) compared to total number of samples in the database (ranges between 0 to 1)
caller Variant calling software/ pipeline (e.g. GATKvxxx)
run Platform Sequencing platform type (e.g. Illumina MiSeq)
runId Run accession ID in original database (ENA-Consensus) (e.g. SRR10903401)
sampleId Biosample accession ID in original database (ENA-Consensus) (e.g. SRS6007144)
sampleType Anatomical origin of biosample (e.g. Bronchoalveolar lavage fluid)
collectionDate Date of biosample collection
Host information
hostAge Individual age at the time of biosample collection (e.g. 82)
hostSex Sex of individual (e.g. female)
geoOrigin Individual's country or region of origin (birthplace or residence place regardless of ethnic origin)
disease Disease(s) been diagnosed to the individual
diseaseOutcome Outcome of disease in individual