Read this report on the emergent Coronavirus lineage known as B.1.1.7 https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563
It was quickly identified as a “variant of concern” given the large number of mutations with phenotypic implications.
A major concern is that the Spike D614G amino acid change (caused by an A-to-G nucleotide mutation at position 23,403 in the Wuhan reference strain) is spreading quickly.
We want to compare the genome of this newer viral strain with that of the refrence sequence (AKA, the type strain) to confirm what’s being said in recent reports. Your task is to retrieve amino acid and DNA sequences for the sars-cov-2 surface glycoprotein. We need you to provide us with separate files of these sequences for the original type (reference) strain, along with all strains deposited into NCBI since Jan 8th, 2021. These sequence files need to have a strong chain of custody, so document all steps, including the data retrieval!
Fortunately, the NCBI has made a special piece of software available for accessing Sars-Cov-2 genomic information.
Get to work!
Here, you will be using a couple command-line tools from NCBI to download sars-cov-2 genomic data
Everything you do must be documented so that it is completely reproducible. That includes version numbers of all software, etc..
You will be downloading the complete genome of the Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1. This is a fully annotated complete genome, and you will be downloading not only the DNA sequence, but associated metadata and protein info and documenting every step.
This is a special NCBI tool for easy downloading of genes and genomes. Download it into your Case Study 2 main directory. You can run it from within there.
Getting it installed and working properly is the first task:
Download and install the version for your own computer.
If on Linux or Mac, be sure to set the executable permission with:
chmod +x datasets
Once you’ve done that
Find the “help” documentation in the program and then find the help documentation for the download command within the program
Once you have it working, be sure to document the version number in a documentation.txt file associated with this project directory.
Also be sure to document the date you are doing this!!
Search NCBI for: “sars-cov-2”
Browse the info from NCBI and see what is available
Find the accession number for the reference sars-cov-2 genome, known as the viral “RefSeq”
Use the datasets program to download just the sars-cov-2 refseq, along with all other data available for that refseq.
Useful file info for NCBI genomes:
fna: FastA format file containing Nucleotide sequence (i.e. DNA).
gbff: GenBank genome Feature File containing genome sequence and annotation.
gff: General Feature Format containing genomic regions (i.e. genes, transcripts).
faa: FastA format file containing Amino acid sequence (i.e. protein, peptide).
gpff: GenBank Protein Feature File containing protein sequence and annotation.
cds: A sequence of nucleotides that corresponds with the sequence of amino acids in a protein.
pdb: Text-based description of 3-dimensional structures of molecules.
Find the amino acid and nucleotide sequences for “surface glycoprotein” in sars-cov-2
You’ll need to browse through these files, and then use a combination of tools like seqtk and grep to pull it out
Now, that was the type collection. Let’s also download some more recent strains so we can compare the nucleotide and amino acid sequences for the “surface glycoprotein” between the variant strains.
We want to look for an A-to-G nucleotide mutation at position 23,403 in the Wuhan reference strain. That’s a job better suited for a different language like R, Python or Perl, so we just want to get the data we need into files.
Let’s use the Datasets tool again, but this time, instead of using the –refseq flag, let’s use some of the other built-in flags to select strains that were released after 01/01/2023. Be sure to save these as a new zip file name so you don’t overwrite your reference genome. Maybe do this in a subfolder (any commands to make that subfolder should be documented as well!)
Again, find the amino acid and nucleotide sequences for “surface glycoprotein” in these new strains you’ve downloaded.
The main difference is that each of these files will have data from multiple isolate genomes.
The amino acid and nucleotide sequences for each dataset should be in their own files. So 4 files in total:
newer.faa #amino-acid seqs of surface glycoprotein from all strains since 01/08/2020
newer.fna #nucleotide seqs of surface glycoprotein from all strains since 01/08/2020
refseq.faa #nucleotide seqs of surface glycoprotein from reference genome
refseq.fna #nucleotide seqs of surface glycoprotein from reference genome
Bundle those 4 files, along with documentation.txt, into an archive called “sars_cov2_Surface_Glycoprotein_Sequences.zip” (documentation.txt should include the datasets software version number, along with ALL the code you used to generate these 4 files and the zip archive .. and the two bonus tasks below)
Include the code to add the following to “documentation.txt” before you bundle it into the zip archive:
Report the number of strains you downloaded from NCBI (not including the reference strain)
Report the number of unique Surface Glycoprotein DNA variants you found among these new strains (Since Jan 1, 2023)