CRG Viral Beacon - Pipeline
CRG Covid Viral Beacon is a tool for those interested in the SARS-CoV-2 variability, mainly at genomic scale, but also related changes such as at aminoacid level, motifs along with other metadata.
In this document, we describe the workflows we have followed to analyse the covid-19 genomic variants from the dataset available from different repositories.
The vast majority of available SARS-CoV2 data (ENA-Consensus) is in the form of assembled genome sequences. This is unfortunate as any variation that exists within individual samples is obliterated and converted to the most frequent base, during the assembly process. Raw sequencing reads are required to detect within-sample sequence variants.
For the purpose of CRG’s Covid-19 Viral Beacon we selected available public resources containing both raw and consensus sequence data. Our goal was to integrate covid-19 genomic data from different resources representing different geographic origins, demographics etc to avoid skewness of the data.
For the purpose of CRG’s Covid-19 Viral Beacon we selected available public resources containing both raw and consensus sequence data (Table 1).
|Resources||# of runs||# of samples|
Table 1 - Available data from the selected repositories in Covid-19 Viral Beacon at the time of data freeze, 2021-06-19 14:12.
We have defined different approaches to obtain, process and analyse the data accordingly to the data type: 1) Raw reads (ENA-ONT, ENA-Illumina data) and 2) Assembled genome (ENA-Consensus data).
Variant analysis on raw reads data
The raw read data was split by sequencing technology, such as ENA-Illumina Sequencing and Oxford Nanopore Sequencing. For ENA-Illumina Sequencing we re-used the public available data processed from the Galaxy Covid-19 project. The analysis performed by Galaxy Covid-19 project fits with the Covid-19 viral beacon as it focuses on the detection of within sample sequence variants. The project also provides on a daily basis the raw reads and genome assemblies accessions, re-used by our project. More information and detailed Galaxy pipeline can be accessed at https://covid19.galaxyproject.org/.
To analyse variants from Nanopore data, we used Master of Pores (Cozzuto et al, 2020), a Nextflow pipeline for analysis of Nanopore data from direct RNA sequencing, in collaboration with the Centre for Genomic Regulation (CRG) Bioinformatics Core. The Master of Pores pipeline was custom modified for the purposes of the Viral Beacon project. Further information about the pipeline can be found in https://biocorecrg.github.io/master_of_pores/.
Variant analysis on genome assembly data
With the continued spread of SARS-CoV-2 virus around the world, we needed to quickly identify novel mutations in newly available SARS-CoV-2 assembled sequences data.
The ENA consensus dataset contains all assembled sequences deposited in ENA.
We further filtered the data by human host, and complete genome sequence, having a minimum sequence length of 29400, no stretches of >99 Ns in a row, 12 CDS features contained in sequence and all annotated features are complete; for example if a stretch of Ns occurs over the start codon of a protein, the CDS feature would be marked as partial and the genome would not be complete.
With the added urgency to identify further structural variation, such as insertions and deletions, for studying the molecular evolution and epidemiology of the virus we selected the python package, MicroGMT: A Mutation Tracker for SARS-CoV-2 and Other Microbial Genome Sequences, (Xing Y. et al, 2020) which takes either raw sequence reads or assembled genome sequences as input and compares against database sequences to identify and characterize indels and point mutations.
Finally, we annotated each variant using snpEff tool (Cingolani, P., et. al., 2012). We used old style annotations instead of Sequence Ontology and HGVS, and excluded downstream, intergenic intro, upstream and 5’ prime utr and 3’ prime utr changes as default parameters for viral and bacterial genomes. Please, see Table 2 below for parameters used.
Table 2 - Genome Assembly SNP Workflow Parameters
Formatting the data files
Files downloaded from the Galaxy database although they have the “.vcf” extension, they have been post processed and some mandatory columns in vcf spec have been removed. Therefore, first we convert Galaxy files into vcf files (check vcf spec). With that purpose we added the following columns:
- Added column with the sample name and the genotypes associated (1 in this case as we are working with viral data).
- Add column FORMAT (add "GT" value for each variant).
- Add the following tag in the header: ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
In order to calculate the parameters displayed in Beacon (TYPE, SVLEN, AC, AN, NS…), bcftools v1.9 was used in different steps of the analysis. Check bcftools manual for details.
‘Frequency’ of the variant is the summary frequency shown in Beacon that is calculated taking into account the number of times a particular variant appears in the sample run divided by the total number of runs in each dataset (ENA-Illumina, ENA-ONT and ENA-Consensus).
Frequency calculation per dataset
‘Frequency per dataset’ for the variant is the summary frequency shown in Beacon that is calculated by number of times a particular variant appears in the sample run divided by the total number of runs in each dataset (ENA-ONT and ENA-Consensus).
For ENA-Illumina this summary frequency is calculated differently. In ENA-Illumina data each vcf in the header provides the following information:
This information is specific to each variant in each vcf.
An example of a vcf file:
#CHROM POS ID REF ALT QUAL FILTER INFO
NC_045512 61 . G C 104.0 PASS DP=340;AF=0.020588;SB=7;DP4=284,37,5,2
NC_045512 61 . G T 181.0 PASS DP=340;AF=0.032353;SB=2;DP4=284,37,11,0
In order to calculate the allele frequency for the whole ENA-Illumina dataset we proceed in the following way:
- 1 - First we extract in one file all the DP and AF values for all variants in all runs.
- 2 - For each variant we calculate the DP x AF value.
- 3 - For each variant, the sum of all DP x AF in all runs is calculated.
- 4 - For each variant the sum of all DP in all runs is calculated.
- 5 - For calculating the final AF for each variant we divide the value obtained in 3 by the value obtained in 4.
All publicly available dataset used in Viral Beacon can be downloaded