CRG Viral Beacon - Pipeline
CRG Covid Viral Beacon is a tool for those interested in the SARS-CoV-2 variability, mainly at genomic scale, but also related changes such as at aminoacid level, motifs along with other metadata.
In this document, we describe the workflows we have followed to analyse the covid-19 genomic variants from the dataset available from different repositories.
The vast majority of available SARS-CoV2 data (GISAID, ENA-Consensus) is in the form of assembled genome sequences. This is unfortunate as any variation that exists within individual samples is obliterated and converted to the most frequent base, during the assembly process. Raw sequencing reads are required to detect within-sample sequence variants.
For the purpose of CRG’s Covid-19 Viral Beacon we selected available public resources containing both raw and consensus sequence data. Our goal was to integrate covid-19 genomic data from different resources representing different geographic origins, demographics etc to avoid skewness of the data. At the time of data freeze, 24th of May 2020, 3960 raw read files from Oxford Nanopore Technologies (ENA-ONT) and 1497 raw reads files from ENA-Illumina were available from ENA-Consensus.Along with that, 31,885 and 3245 complete genome assemblies were available from Global Initiative on Sharing All Influenza Data (GISAID) and ENA-Consensus respectively (Table 1).
|Resources||# of runs||# of samples|
Table 1 - Available data from the selected repositories in covid-19 Viral Beacon at the time of data freeze, 24th of May, 2020
GISAID data provided on this website is subject to GISAID’s Terms and Conditions.
We have defined different approaches to obtain, process and analyse the data accordingly to the data type: 1) Raw reads (ENA-ONT, ENA-Illumina data) and 2) Assembled genome (ENA-Consensus and GISAID data).
Variant analysis on raw reads data
The raw read data was split by sequencing technology, such as ENA-Illumina Sequencing and Oxford Nanopore Sequencing. For ENA-Illumina Sequencing we re-used the public available data processed from the Galaxy Covid-19 project. The analysis performed by Galaxy Covid-19 project fits with the Covid-19 viral beacon as it focuses on the detection of within sample sequence variants. The project also provides on a daily basis the raw reads and genome assemblies accessions, re-used by our project. More information and detailed Galaxy pipeline can be accessed at https://covid19.galaxyproject.org/.
To analyse variants from Nanopore data, we used Master of Pores (Cozzuto et al, 2020), a Nextflow pipeline for analysis of Nanopore data from direct RNA sequencing, in collaboration with the Centre for Genomic Regulation (CRG) Bioinformatics Core. The Master of Pores pipeline was custom modified for the purposes of the Viral Beacon project. Further information about the pipeline can be found in https://biocorecrg.github.io/master_of_pores/.
Variant analysis on genome assembly data
We queried ENA-Consensus and GISAID for assembled genome sequences that were deposited between 1st of January 2020, and 24th of May. We further filter this dataset by human host, and complete genome sequence.
Regarding assembled genome sequences, we created a single nucleotide polymorphism (SNP) calling workflow to be applied to ENA-Consensus and GISAID complete genomes data.
The pipeline first joins the SARS-CoV2 reference genome (NC_045512) with the sample's genome assembly to create a multi-fasta file containing two genomes. Then, we use Mafft (Katoh et al. 2002), global pair alignment using Needleman-Wunsch algorithm with 1000 maximum iterations. The gap values were set to default. The output aligned sequence file is then used as input for the application Snp-Sites (Page et al, 2016) to call SNPs on aligned data to get SNPs in vcf format. Finally, we annotate each SNP using snpEff tool (Cingolani, P., et. al., 2012). For the file annotation, we have used old style annotations instead of Sequence Ontology and HGVS, and we have excluded downstream, intergenic intro, upstream and 5’ prime utr and 3’ prime utr changes as default parameters for viral and bacterial genomes. Please see Table 2 below for parameters used in each tool.
Table 2 - Genome Assembly SNP Workflow Parameters
Formatting the data files
Files downloaded from the Galaxy database although they have the “.vcf” extension, they have been post processed and some mandatory columns in vcf spec have been removed. Therefore, first we convert Galaxy files into vcf files (check vcf spec). With that purpose we added the following columns:
- Added column with the sample name and the genotypes associated (1 in this case as we are working with viral data).
- Add column FORMAT (add "GT" value for each variant).
- Add the following tag in the header: ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
In order to calculate the parameters displayed in Beacon (TYPE, SVLEN, AC, AN, NS…), bcftools v1.9 was used in different steps of the analysis. Check bcftools manual for details.
‘Frequency’ of the variant is the summary frequency shown in Beacon that is calculated taking into account the number of times a particular variant appears in the sample run divided by the total number of runs in each dataset (ENA-Illumina, ENA-ONT, ENA-Consensus and GISAID).
Frequency calculation per dataset
‘Frequency per dataset’ for the variant is the summary frequency shown in Beacon that is calculated by number of times a particular variant appears in the sample run divided by the total number of runs in each dataset (ENA-ONT, ENA-Consensus and GISAID).
For ENA-Illumina this summary frequency is calculated differently. In ENA-Illumina data each vcf in the header provides the following information:
This information is specific to each variant in each vcf.
An example of a vcf file:
#CHROM POS ID REF ALT QUAL FILTER INFO
NC_045512 61 . G C 104.0 PASS DP=340;AF=0.020588;SB=7;DP4=284,37,5,2
NC_045512 61 . G T 181.0 PASS DP=340;AF=0.032353;SB=2;DP4=284,37,11,0
In order to calculate the allele frequency for the whole ENA-Illumina dataset we proceed in the following way:
- 1 - First we extract in one file all the DP and AF values for all variants in all runs.
- 2 - For each variant we calculate the DP x AF value.
- 3 - For each variant, the sum of all DP x AF in all runs is calculated.
- 4 - For each variant the sum of all DP in all runs is calculated.
- 5 - For calculating the final AF for each variant we divide the value obtained in 3 by the value obtained in 4.
All publicly available dataset used in Viral Beacon can be downloaded from here:
Note: We do not have permission to share GISAID data, please contact them directly.