Logo for CRG Viral Beacon

CRG Viral Beacon - Pipeline

Enabled by data from

Summary

CRG Covid Viral Beacon is a tool for those interested in the SARS-CoV-2 variability, mainly at genomic scale, but also related changes such as at aminoacid level, motifs along with other metadata.

In this document, we describe the workflows we have followed to analyse the covid-19 genomic variants from the dataset available from different repositories.

The vast majority of available SARS-CoV2 data (ENA-Consensus) is in the form of assembled genome sequences. This is unfortunate as any variation that exists within individual samples is obliterated and converted to the most frequent base, during the assembly process. Raw sequencing reads are required to detect within-sample sequence variants.

For the purpose of CRG’s Covid-19 Viral Beacon we selected available public resources containing both raw and consensus sequence data. Our goal was to integrate covid-19 genomic data from different resources representing different geographic origins, demographics etc to avoid skewness of the data.

Datasets

For the purpose of CRG’s Covid-19 Viral Beacon we selected available public resources containing both raw and consensus sequence data (Table 1).

Resources # of runs # of samples
ENA-Consensus 11611 11611
ENA-Illumina 3288 3194
ENA-ONT 11144 11086
GISAID 90856 90856

Table 1 - Available data from the selected repositories in Covid-19 Viral Beacon at the time of data freeze, 2020-05-12 00:00.


Variant analyses

We have defined different approaches to obtain, process and analyse the data accordingly to the data type: 1) Raw reads (ENA-ONT, ENA-Illumina data) and 2) Assembled genome (ENA-Consensus data).

Variant analysis on raw reads data

The raw read data was split by sequencing technology, such as ENA-Illumina Sequencing and Oxford Nanopore Sequencing. For ENA-Illumina Sequencing we re-used the public available data processed from the Galaxy Covid-19 project. The analysis performed by Galaxy Covid-19 project fits with the Covid-19 viral beacon as it focuses on the detection of within sample sequence variants. The project also provides on a daily basis the raw reads and genome assemblies accessions, re-used by our project. More information and detailed Galaxy pipeline can be accessed at https://covid19.galaxyproject.org/.

To analyse variants from Nanopore data, we used Master of Pores (Cozzuto et al, 2020), a Nextflow pipeline for analysis of Nanopore data from direct RNA sequencing, in collaboration with the Centre for Genomic Regulation (CRG) Bioinformatics Core. The Master of Pores pipeline was custom modified for the purposes of the Viral Beacon project. Further information about the pipeline can be found in https://biocorecrg.github.io/master_of_pores/.

Variant analysis on genome assembly data

We queried ENA-Consensus for assembled genome sequences that were deposited between 1st of January 2020, and 24th of May. We further filter this dataset by human host, and complete genome sequence.

Regarding assembled genome sequences, we created a single nucleotide polymorphism (SNP) calling workflow to be applied to ENA-Consensus complete genomes data.

The pipeline first joins the SARS-CoV2 reference genome (NC_045512) with the sample's genome assembly to create a multi-fasta file containing two genomes. Then, we use Mafft (Katoh et al. 2002), global pair alignment using Needleman-Wunsch algorithm with 1000 maximum iterations. The gap values were set to default. The output aligned sequence file is then used as input for the application Snp-Sites (Page et al, 2016) to call SNPs on aligned data to get SNPs in vcf format. Finally, we annotate each SNP using snpEff tool (Cingolani, P., et. al., 2012). For the file annotation, we have used old style annotations instead of Sequence Ontology and HGVS, and we have excluded downstream, intergenic intro, upstream and 5’ prime utr and 3’ prime utr changes as default parameters for viral and bacterial genomes. Please see Table 2 below for parameters used in each tool.

Software Version Parameters Availability
Mafft v7.453 --globalpair
--maxiterate 1000
--inputorder
https://github.com/GSLBiotech/mafft
Snp-Sites v2.5.1 -v
-o
https://github.com/sanger-pathogens/snp-sites
snpEFF v4.5 -formatEff
-classic
-no-downstream
-no-intergenic
-no-intron
-no-upstream
-no-utr
https://github.com/pcingola/SnpEff

Table 2 - Genome Assembly SNP Workflow Parameters


Formatting the data files

Files downloaded from the Galaxy database although they have the “.vcf” extension, they have been post processed and some mandatory columns in vcf spec have been removed. Therefore, first we convert Galaxy files into vcf files (check vcf spec). With that purpose we added the following columns:

  • Added column with the sample name and the genotypes associated (1 in this case as we are working with viral data).
  • Add column FORMAT (add "GT" value for each variant).
  • Add the following tag in the header: ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

In order to calculate the parameters displayed in Beacon (TYPE, SVLEN, AC, AN, NS…), bcftools v1.9 was used in different steps of the analysis. Check bcftools manual for details.


Frequency calculation

‘Frequency’ of the variant is the summary frequency shown in Beacon that is calculated taking into account the number of times a particular variant appears in the sample run divided by the total number of runs in each dataset (ENA-Illumina, ENA-ONT and ENA-Consensus).


Frequency calculation per dataset

‘Frequency per dataset’ for the variant is the summary frequency shown in Beacon that is calculated by number of times a particular variant appears in the sample run divided by the total number of runs in each dataset (ENA-ONT and ENA-Consensus).

For ENA-Illumina this summary frequency is calculated differently. In ENA-Illumina data each vcf in the header provides the following information:

##INFO=<ID=AF,Number=1,Type=Float,Description="Allele Frequency">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw Depth">

This information is specific to each variant in each vcf.

An example of a vcf file:

#CHROM POS ID REF ALT QUAL FILTER INFO
NC_045512 61 . G C 104.0 PASS DP=340;AF=0.020588;SB=7;DP4=284,37,5,2
NC_045512 61 . G T 181.0 PASS DP=340;AF=0.032353;SB=2;DP4=284,37,11,0

In order to calculate the allele frequency for the whole ENA-Illumina dataset we proceed in the following way:

  • 1 - First we extract in one file all the DP and AF values for all variants in all runs.
  • 2 - For each variant we calculate the DP x AF value.
  • 3 - For each variant, the sum of all DP x AF in all runs is calculated.
  • 4 - For each variant the sum of all DP in all runs is calculated.
  • 5 - For calculating the final AF for each variant we divide the value obtained in 3 by the value obtained in 4.

Data download

All publicly available dataset used in Viral Beacon can be downloaded from here: