CRG Viral Beacon - Warnings
We have defined different filtering approaches according to the data source:
- Regarding assembled genomes from GISAID, we filtered by collection date between the 1st of January and 24th of May. We further filtered by virus name “hCoV-19/..” , host “Human”, and complete sequences, with more than 29.000 nucleotides.
- For ENA-Consensus, we obtained the sample accessions from Galaxy SARS-CoV-2 (Fasta file), then we applied the same filtering as described for GISAID assembled genome sequences data.
- Nanopore and ENA-Illumina data were filtered by collection date between the 1st of January and 24th of May. We further filtered by organism taxon id: 2697049, and host: Homo sapiens
- In their post processing process, Galaxy pipeline is including only one annotation tag for each of the variants while in some cases we have found more than one annotation for the same variant (multiple annotations in the ENA-Consensus file in the same cds and categorized as “mat_peptide”, in some cases there may be also some overlapping genes). ViralBeacon will include all annotations in the future update.
Nanopore mapping quality
We created density plots with the mapping quality for the ENA-ONT (Oxford Nanopore) dataset. The following plots show the distribution of mapping quality for the 3960 runs available at the data freeze, filtered by PASS (Figure 1) and lowQual (Figure 2). The PASS threshold value was set to MQ with the value of 12 or higher. 27,774 variants were reported with low quality, and removed from the analysis.
ENA-Illumina mapping quality
We also created a density plot with the mapping quality for the ENA-Illumina dataset (Figure 3).
During the analysis of the Nanopore (ONT) dataset, we reported the existence of insertions, deletions, single nucleotide polymorphisms and multi nucleotide polymorphism in the data.
Table 1 - Observed variants in position 28881 for Nanopore data.
We verified the same positions for the GISAID dataset in order to validate the variants.
Table 2 - Observed variants in position 28881 for GISAID data.
(Positions, etc. that should be masked by user)
These are positions Goldman et al. 2020 recommends to mask. In this Beacon dataset we are not masking these regions, however we want to report them as warnings, here are the positions that are suggested to be masked:
- Genomic positions at the alignment ends affected by low coverage and high rate of apparent sequencing/mapping errors):
- Genomic positions that appear to be highly homoplasic & with no phylogenetic signal and/or low prevalence (probably recurrent artifacts or otherwise hypermutable low-fitness sites):
- 187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700
- Homoplasic genomic positions that are exclusive to a single sequencing lab or geographic location, regardless of phylogenetic signal (probably occurring as a common source of error):
- 4050, 13402
- Genomic Positions that despite strong phylogenetic signal are also strongly homoplasic (presumptive hypermutability):
- 11083, 15324, 21575
Assemblies ambiguity in ENA-Consensus data
The consensus sequence shows which residues are conserved, and which residues are variable. A consensus is constructed from the most frequent residues at each site (alignment column), so that the total fraction of rows represented by the selected residues in that column reaches at least a specified threshold. IUPAC ambiguity codes (such as R for an A or G nucleotide) are counted as fractional support for each nucleotide in the ambiguity set (A and G, in this case), thus two rows with R are counted the same as one row with A and one row with G. When more than one nucleotide is necessary to reach the desired threshold, this is represented by the best-fit ambiguity symbol in the consensus; for protein sequences, this will always be an X. For example, assume a column contains 6 A’s, 3 G’s and 1 T. If the consensus threshold is set to 60% or below, then the consensus will be A. If the consensus threshold is set to between 60% and 90%, then the consensus will be R. If the consensus threshold is set to over 90%, then the consensus will be D (Table 3).
|M||A or C||K|
|R||A or G||Y|
|W||A or T||W|
|S||C or G||S|
|Y||C or T||R|
|K||G or T||M|
|V||A or C or G||B|
|H||A or C or T||D|
|D||A or G or T||H|
|B||C or G or T||V|
|N||G or A or T or C||N|
Table 3 - IUPAC ambiguity codes, meaning and complement.
We have observed that the genome assemblies for Sars-CoV-2 data contain a large number of ambiguities and the consensus data is reported following IUPAC code.
For the ENA consensus sequences, 57% of the variants called are ambiguous reported in following tables 4, 5 and 6.
|REF||A||C||G||T||Total # of variants|
Table 4 - ENA consensus Spectrum of Mutations values in absolute and percentage values.
Table 5 - ENA consensus sequences Ambiguous and Confident calls.
|Count||70 A>M||223 A>R||42 C>S||977 C>Y||808 G>K||1 A>V||7 A>H||4 A>D||2 C>B|
|78 C>M||171 G>R||102 G>S||355 T>Y||61 T>K||1 C>V||8 C>H||9 G>D||11 G>B|
|-||-||-||-||-||2 G>V||2 T>H||2 T>D||2 T>B|
Table 6 - ENA consensus ambiguity spectrum of mutations absolute values. 3050 variants have been found with ambiguous codes.