Logo for CRG Viral Beacon

CRG Viral Beacon - Warnings

Enabled by data from

Investigating artefacts

Data filtering

We have defined different filtering approaches according to the data source:

  • Regarding assembled genomes from ENA-Consensus, we filtered by collection date between the 1st of January and last data freeze (2020-05-12 00:00). We further filtered by virus name “hCoV-19/..” , host “Human”, and complete sequences, with more than 29.000 nucleotides.
  • Nanopore and ENA-Illumina data were filtered by collection date between the 1st of January and 24th of May. We further filtered by organism taxon id: 2697049, and host: Homo sapiens

Annotation anomalies

  • In their post processing process, Galaxy pipeline is including only one annotation tag for each of the variants while in some cases we have found more than one annotation for the same variant (multiple annotations in the ENA-Consensus file in the same cds and categorized as “mat_peptide”, in some cases there may be also some overlapping genes). ViralBeacon will include all annotations in the future update.

Mapping quality

Nanopore mapping quality

We created density plots with the mapping quality for the ENA-ONT (Oxford Nanopore) dataset. The following plots show the distribution of mapping quality for the 3960 runs available at the data freeze, filtered by PASS (Figure 1) and lowQual (Figure 2). The PASS threshold value was set to MQ with the value of 12 or higher. 27,774 variants were reported with low quality, and removed from the analysis.

Figure 1: Mapping quality for ENA-ONT Data, filtered by PASS. Density plot of the mapping quality for the individual Oxford Nanopore Technologies dataset files. The plot shows the distribution of mapping quality for the 3960 Runs for ENA-ONT.
Figure 2: Mapping quality density plot for ENA-ONT data, filtered by lowqual variants. Density plot of the average mapping quality for the individual Oxford Nanopore Technologies dataset files. The plot shows the distribution of average mapping quality for the 3960 runs for ENA-ONT.

ENA-Illumina mapping quality

We also created a density plot with the mapping quality for the ENA-Illumina dataset (Figure 3).

Figure 3: Mapping quality density plot for ENA-Illumina data. Density plot of the average mapping quality for the individual ENA-Illumina dataset files. The plot shows the distribution of average mapping quality for the 1497 runs for ENA-Illumina.

Masked regions

(Positions, etc. that should be masked by user)

These are positions Goldman et al. 2020 recommends to mask. In this Beacon dataset we are not masking these regions, however we want to report them as warnings, here are the positions that are suggested to be masked:

  • Genomic positions at the alignment ends affected by low coverage and high rate of apparent sequencing/mapping errors):
    • 1-55
    • 29804–29903
  • Genomic positions that appear to be highly homoplasic & with no phylogenetic signal and/or low prevalence (probably recurrent artifacts or otherwise hypermutable low-fitness sites):
    • 187, 1059, 2094, 3037, 3130, 6990, 8022, 10323, 10741, 11074, 13408, 14786, 19684, 20148, 21137, 24034, 24378, 25563, 26144, 26461, 26681, 28077, 28826, 28854, 29700
  • Homoplasic genomic positions that are exclusive to a single sequencing lab or geographic location, regardless of phylogenetic signal (probably occurring as a common source of error):
    • 4050, 13402
  • Genomic Positions that despite strong phylogenetic signal are also strongly homoplasic (presumptive hypermutability):
    • 11083, 15324, 21575

Assemblies ambiguity in ENA-Consensus data

The consensus sequence shows which residues are conserved, and which residues are variable. A consensus is constructed from the most frequent residues at each site (alignment column), so that the total fraction of rows represented by the selected residues in that column reaches at least a specified threshold. IUPAC ambiguity codes (such as R for an A or G nucleotide) are counted as fractional support for each nucleotide in the ambiguity set (A and G, in this case), thus two rows with R are counted the same as one row with A and one row with G. When more than one nucleotide is necessary to reach the desired threshold, this is represented by the best-fit ambiguity symbol in the consensus; for protein sequences, this will always be an X. For example, assume a column contains 6 A’s, 3 G’s and 1 T. If the consensus threshold is set to 60% or below, then the consensus will be A. If the consensus threshold is set to between 60% and 90%, then the consensus will be R. If the consensus threshold is set to over 90%, then the consensus will be D (Table 3).

IUPAC Code Meaning Complement
AAT
CCG
GGC
T/UTA
MA or CK
RA or GY
WA or TW
SC or GS
YC or TR
KG or TM
VA or C or GB
HA or C or TD
DA or G or TH
BC or G or TV
NG or A or T or CN

Table 3 - IUPAC ambiguity codes, meaning and complement.


We have observed that the genome assemblies for Sars-CoV-2 data contain a large number of ambiguities and the consensus data is reported following IUPAC code.

For the ENA consensus sequences, 57% of the variants called are ambiguous reported in following tables 4, 5 and 6.

ALT
REFACGTTotal # of variants
A-60271110441
C 59-34832925
G19740-352589
T6423446-344
32033435112942299

Table 4 - ENA consensus Spectrum of Mutations values in absolute and percentage values.


ambiguous305057.02000374
called229942.97999626
total5349-

Table 5 - ENA consensus sequences Ambiguous and Confident calls.


Ambiguity Code MRSYKVHDB
Count 70 A>M223 A>R42 C>S977 C>Y808 G>K1 A>V7 A>H4 A>D2 C>B
78 C>M171 G>R102 G>S355 T>Y61 T>K1 C>V8 C>H9 G>D11 G>B
-----2 G>V2 T>H2 T>D2 T>B

Table 6 - ENA consensus ambiguity spectrum of mutations absolute values. 3050 variants have been found with ambiguous codes.