Summary

An outbreak of coronavirus disease 2019 (COVID-19) caused by the 2019 novel coronavirus (SARS-CoV-2) began in the city of Wuhan in China and has widely spread worldwide. Currently, it is vital to explore potential intermediate hosts of SARS-CoV-2 to control COVID-19 spread. Therefore, we reinvestigated published data from pangolin lung samples from which SARS-CoV-like CoVs were detected by Liu et al. [1]. We found genomic and evolutionary evidence of the occurrence of a SARS-CoV-2-like CoV (named Pangolin-CoV) in dead Malayan pangolins. Pangolin-CoV is 91.02% and 90.55% identical to SARS-CoV-2 and BatCoV RaTG13, respectively, at the whole-genome level. Aside from RaTG13, Pangolin-CoV is the most closely related CoV to SARS-CoV-2. The S1 protein of Pangolin-CoV is much more closely related to SARS-CoV-2 than to RaTG13. Five key amino acid residues involved in the interaction with human ACE2 are completely consistent between Pangolin-CoV and SARS-CoV-2, but four amino acid mutations are present in RaTG13. Both Pangolin-CoV and RaTG13 lost the putative furin recognition sequence motif at S1/S2 cleavage site that can be observed in the SARS-CoV-2. Conclusively, this study suggests that pangolin species are a natural reservoir of SARS-CoV-2-like CoVs.

Graphical Abstract

img1111.jpg

Keywords

pangolin SARS-CoV-2 COVID-19 origin

Results and Discussion

Similar to the case for SARS-CoV and MERS-CoV [2], the bat is still a probable species of origin for 2019 novel coronavirus (SARS-CoV-2) because SARS-CoV-2 shares 96% whole-genome identity with a bat CoV, BatCoV RaTG13, from Rhinolophus affinis from Yunnan Province [3]. However, SARS-CoV and MERS-CoV usually pass into intermediate hosts, such as civets or camels, before leaping to humans [4]. This fact indicates that SARS-CoV-2 was probably transmitted to humans by other animals. Considering that the earliest coronavirus disease 2019 (COVID-19) patient reported no exposure at the seafood market [5], it is vital to find the intermediate SARS-CoV-2 host to block interspecies transmission. On 24 October 2019, Liu and his colleagues from the Guangdong Wildlife Rescue Center of China [1] first detected the existence of a SARS-CoV-like CoV from lung samples of two dead Malayan pangolins with a frothy liquid in their lungs and pulmonary fibrosis, and this fact was discovered close to when the COVID-19 outbreak occurred. Using their published results, we showed that all virus contigs assembled from two lung samples (lung07 and lung08) exhibited low identities, ranging from 80.24% to 88.93%, with known SARSr-CoVs. Hence, we conjectured that the dead Malayan pangolins may carry a new CoV closely related to SARS-CoV-2.

Assessing the Probability of SARS-CoV-2-like CoV Presence in Pangolin Species

To confirm our assumption, we downloaded raw RNA sequencing (RNA-seq) data (SRA: PRJNA573298) for those two lung samples from the SRA and conducted consistent quality control and contaminant removal, as described by Liu’s study [1]. We found 1,882 clean reads from the lung08 sample that mapped to the SARS-CoV-2 reference genome (GenBank: MN908947) [6] and covered 76.02% of the SARS-CoV-2 genome. We performed de novo assembly of those reads and obtained 36 contigs with lengths ranging from 287 bp to 2,187 bp, with a mean length of 700 bp. Via Blast analysis against proteins from 2,845 CoV reference genomes, including RaTG13, SARS-CoV-2s, and other known CoVs, we found that 22 contigs were best matched to SARS-CoV-2s (70.6%–100% amino acid identity; average: 95.41%) and that 12 contigs matched to bat SARS-CoV-like CoV (92.7%–100% amino acid identity; average: 97.48%) (Table S1). These results indicate that the Malayan pangolin might carry a novel CoV (here named Pangolin-CoV) that is similar to SARS-CoV-2.

Draft Genome of Pangolin-CoV and Its Genomic Characteristics

Using a reference-guided scaffolding approach, we created a Pangolin-CoV draft genome (19,587 bp) based on the above 34 contigs. To reduce the effect of raw read errors on scaffolding quality, small fragments that aligned against the reference genome with a length less than 25 bp were manually discarded if they were unable to be covered by any large fragments or reference genome. Remapping 1,882 reads against the draft genome resulted in 99.99% genome coverage (coverage depth range: 1X–47X) (Figure 1A). The mean coverage depth was 7.71X across the whole genome, which was two times higher than the lowest common 3X read coverage depth for SNP calling based on low-coverage sequencing in the 1000 Genomes Project pilot phase [7]. Similar coverage levels are also sufficient to detect rare or low-abundance microbial species from metagenomic datasets [8], indicating that our assembled Pangolin-CoV draft genome is reliable for further analyses. Based on Simplot analysis [9], Pangolin-CoV showed high overall genome sequence identity to RaTG13 (90.55%) and SARS-CoV-2 (91.02%) throughout the genome (Figure 1B), although there was a higher identity (96.2%) between SARS-CoV-2 and RaTG13 [3]. Other SARS-CoV-like CoVs similar to Pangolin-CoV were bat SARSr-CoV ZXC21 (85.65%) and bat SARSr-CoV ZC45 (85.01%). While this manuscript was under review, two similar preprint studies found that CoVs in pangolins shared 90.3% [10] and 92.4% [11] DNA identity with SARS-CoV-2, approximating the 91.02% identity to SARS-CoV-2 observed here and supporting our findings. Taken together, these results indicate that Pangolin-CoV might be the common origin of SARS-CoV-2 and RaTG13.

img222.jpg

Figure 1. Genome-Related Analysis

(A) Sequence depth of reads remapped to Pangolin-CoV.

(B) Similarity plot based on the full-length genome sequence of Pangolin-CoV. Full-length genome sequences of SARS-CoV-2 (Beta-CoV/Wuhan-Hu-1), BatCoV RaTG13, bat SARSr-CoV 21, bat SARSr-CoV45, bat SARSr-CoV WIV1, and SARS-CoV BJ01 were used as reference sequences.

(C) Comparison of common genome organization similarity among SARS-CoV-2, Pangolin-CoV, and BatCoV RaTG13.

Related to Table S2.

The Pangolin-CoV genome organization was characterized by sequence alignment against SARS-CoV-2 (GenBank: MN908947) and RaTG13. The Pangolin-CoV genome consists of six major open reading frames (ORFs) common to CoVs and four other accessory genes (Figure 1C; Table S2). Further analysis indicated that Pangolin-CoV genes aligned to SARS-CoV-2 genes with coverage ranging from 45.8% to 100% (average coverage 76.9%). Pangolin-CoV genes shared high average nucleotide and amino acid identity with both SARS-CoV-2 (GenBank: MN908947) (93.2% nucleotide/94.1% amino acid identity) and RaTG13 (92.8% nucleotide/93.5% amino acid identity) genes (Figure 1C; Table S2). Surprisingly, some Pangolin-CoV genes showed higher amino acid sequence identity to SARS-CoV-2 genes than to RaTG13 genes, including orf1b (73.4%/72.8%), the spike (S) protein (97.5%/95.4%), orf7a (96.9%/93.6%), and orf10 (97.3%/94.6%). The high S protein amino acid identity implies functional similarity between Pangolin-CoV and SARS-CoV-2.

Phylogenetic Relationships among Pangolin-CoV, RaTG13, and SARS-CoV-2

To determine the evolutionary relationships among Pangolin-CoV, SARS-CoV-2, and previously identified CoVs, we estimated phylogenetic trees based on the nucleotide sequences of the whole-genome sequence, RNA-dependent RNA polymerase gene (RdRp), non-structural protein genes ORF1a and ORF1b, and main structural proteins encoded by the S and M genes. In all phylogenies, Pangolin-CoV, RaTG13, and SARS-CoV-2 were clustered into a well-supported group, here named the “SARS-CoV-2 group” (Figures 2S1, and S2). This group represents a novel Betacoronavirus group. Within this group, RaTG13 and SARS-CoV-2 were grouped together, and Pangolin-CoV was their closest common ancestor. However, whether the basal position of the SARS-CoV-2 group is SARSr-CoV ZXC21 and/or SARSr-CoV ZC45 is still under debate. Such debate also occurred in both the Wu et al. [6] and Zhou et al. [3] studies. A possible explanation is a past history of recombination in the Betacoronavirus group [6]. It is noteworthy that the discovered evolutionary relationships of CoVs shown by the whole genome, RdRp gene, and S gene were highly consistent with those exhibited by complete genome information in the Zhou et al. study [3]. This correspondence indicates that our Pangolin-CoV draft genome has enough genomic information to trace the true evolutionary position of Pangolin-CoV in CoVs.

img444.jpg

Figure 2. Phylogenetic Relationship of CoVs Based on the Whole Genome and RdRp Gene Nucleotide Sequences

Red text denotes the Malayan Pangolin-CoV. Pink text denotes SARS-CoV-2. Green text denotes a bat CoV with 96% similarity at the genome level to SARS-CoV-2. Blue text denotes the reference CoVs used in Figure 1B. Detailed information can be found in the STAR Methods. Related to Figures S1–S3.

Dualism of the S Protein of Pangolin-CoV

The CoV S protein consists of two subunits (S1 and S2), mediates infection of receptor-expressing host cells, and is a critical target for antiviral neutralizing antibodies [12]. S1 contains a receptor-binding domain (RBD) that consists of an approximately 193 amino acid fragment, which is responsible for recognizing and binding the cell surface receptor [1314]. Zhou et al. experimentally confirmed that SARS-CoV-2 is able to use human, Chinese horseshoe bat, civet, and pig ACE2 proteins as an entry receptor in ACE2-expressing cells [3], suggesting that the RBD of SARS-CoV-2 mediates infection in humans and other animals. To gain sequence-level insight into the pathogenic potential of Pangolin-CoV, we first investigated the amino acid variation pattern of the S1 proteins from Pangolin-CoV, SARS-CoV-2, RaTG13, and other representative SARS/SARSr-CoVs. The amino acid phylogenetic tree showed that the S1 protein of Pangolin-CoV is more closely related to that of 2019-CoV than to that of RaTG13. Within the RBD, we further found that Pangolin-CoV and SARS-CoV-2 were highly conserved, with only one amino acid change (500H/500Q) (Figure 3), which is not one of the five key residues involved in the interaction with human ACE2 [314]. These results indicate that Pangolin-CoV could have pathogenic potential similar to that of SARS-CoV-2. In contrast, RaTG13 has changes in 17 amino acid residues, 4 of which are among the key amino acid residues (Figure 3). There are evidences suggesting that the change of 472L (SARS-CoV) to 486F (SARS-CoV-2) (corresponding to the second key amino acid residue change in Figure 3) may make stronger van der Waals contact with M82 (ACE2) [15]. Besides, the major substitution of 404V in the SARS-CoV-RBD with 417K in the SARS-CoV-2-RBD (see 420 alignment position in Figure 3 and without amino acid change between the SARS-CoV-2 and RaTG13) may result in tighter association because of the salt bridge formation between 417K and 30D of ACE2 [15]. Nevertheless, further investigation is still needed about whether those mutations affect the affinity for ACE2. Whether the Pangolin-CoV or RaTG13 are potential infectious agents to humans remains to be determined.

img555.jpg

Figure 3. Amino Acid Sequence Alignment of the S1 Protein and Its Phylogeny

The receptor-binding motif of SARS-CoV and the homologous region of other CoVs are indicated by the gray box. The key amino acid residues involved in the interaction with human ACE2 are marked with the orange box. Bat SARS-CoV-like CoVs had been reported to not use ACE2 and have amino acid deletions at two motifs marked by the yellow box. Detailed information can be found in the STAR Methods.

The S1/S2 cleavage site in the S protein is also an important determinant of the transmissibility and pathogenicity of SARS-CoV/SARS-CoVr viruses [16]. The trimetric S protein is processed at the S1/S2 cleavage site by host cell proteases during infection. Following cleavage, also known as priming, the protein is divided into an N-terminal S1-ectodomain that recognizes a cognate cell surface receptor and a C-terminal S2-membrane anchored protein that drives fusion of the viral envelope with a cellular membrane. We found that the SARS-CoV-2 S protein contains a putative furin recognition motif (PRRARSV) (Figure 4) similar to that of MERS-CoV, which has a PRSVRSV motif that is likely cleaved by furin [1617] during virus egress. Conversely, the furin sequence motif at the S1/S2 site is missing in the S protein of Pangolin-CoV and all other SARS/SARSr-CoVs. This difference indicates the SARS-CoV-2 might gain a distinct mechanism to promote its entry into host cells [18]. Interestingly, aside from MERS-CoV, similar sequence patterns to the SARS-CoV-2 were also presented in some members of Alphacoronavirus, Betacoronavirus, and Gammcoronavirus [19], raising an interesting question regarding whether this furin sequence motif in SARS-CoV-2 might be derived from those existing S proteins of other coronaviruses or alternatively if the SARS-CoV-2 might be the recombinant of Pangolin-CoV or RaTG13 and other coronaviruses with a similar furin recognition motif in the unknown intermediate host.

img666.jpg

Figure 4. CoV S Protein S1/S2 Cleavage Sites

Four amino acid insertions (SPRRs) unique to SARS-CoV-2 are marked in yellow. Conserved S1/S2 cleavage sites are marked in green.

Amino Acid Variations in the Nucleocapsid (N) Protein for Potential Diagnosis

The N protein is the most abundant protein in CoVs. The N protein is a highly immunogenic phosphoprotein, and it is normally very conserved. The CoV N protein is often used as a marker in diagnostic assays. To gain further insight into the diagnostic potential of Pangolin-CoV, we investigated the amino acid variation pattern of the N proteins from Pangolin-CoV, SARS-CoV-2, RaTG13, and other representative SARS-CoVs. Phylogenetic analysis based on the N protein supported the classification of Pangolin-CoV as a sister taxon of SARS-CoV-2 and RaTG13 (Figure S3). We further found seven amino acid mutations that differentiated our defined “SAR-CoV-2 group” CoVs (12N, 26 G, 27S, 104D, 218A, 335T, 346N, and 350Q) from other known SARS-CoVs (12S, 26D, 27N, 104E, 218T, 335H, 346Q, and 350N). Two amino acid sites (38P and 268Q) are shared by Pangolin-CoV, RaTG13, and SARS-CoVs, which are mutated to 38S and 268A in SARS-CoV-2. Only one amino acid residue shared by Pangolin-CoV and other SARS-CoVs (129E) is consistently different in both SARS-CoV-2 and RaTG13 (129D). The observed amino acid changes in the N protein would be useful for developing antigens with improved sensitivity for SARS-CoV-2 serological detection.

Conclusion

Based on published metagenomic data, this study provides the first report on a potential closely related kin (Pangolin-CoV) of SARS-CoV-2, which was discovered from dead Malayan pangolins after extensive rescue efforts. Aside from RaTG13, the Pangolin-CoV is the CoV most closely related to SARS-CoV-2. Due to unavailability of the original sample, we did not perform further experiments to confirm our findings, including PCR validation, serological detection, or even isolation of the virus particles. Our discovered Pangolin-CoV genome showed 91.02% nucleotide identity with the SARS-CoV-2 genome. However, whether pangolin species are good candidates for SARS-CoV-2 origin is still under debate. Considering the wide spread of SARSr-CoVs in natural reservoirs, such as bats, camels, and pangolins, our findings would be meaningful for finding novel intermediate SARS-CoV-2 hosts to block interspecies transmission..