Bioinformatics Data Skills

Case Study 2: Fully document the downloading of a data set

Read this report on the emergent Coronavirus lineage known as B.1.1.7 https://virological.org/t/preliminary-genomic-characterisation-of-an-emergent-sars-cov-2-lineage-in-the-uk-defined-by-a-novel-set-of-spike-mutations/563

It was quickly identified as a “variant of concern” given the large number of mutations with phenotypic implications.

A major concern is that the Spike D614G amino acid change (caused by an A-to-G nucleotide mutation at position 23,403 in the Wuhan reference strain) is spreading quickly.

We want to compare the genome of this newer viral strain with that of the refrence sequence (AKA, the type strain) to confirm what’s being said in recent reports. Your task is to retrieve amino acid and DNA sequences for the sars-cov-2 surface glycoprotein. We need you to provide us with separate files of these sequences for the original type (reference) strain, along with all strains deposited into NCBI since Jan 8th, 2021. These sequence files need to have a strong chain of custody, so document all steps, including the data retrieval!

Fortunately, the NCBI has made a special piece of software available for accessing Sars-Cov-2 genomic information.

Get to work!

Here, you will be using a couple command-line tools from NCBI to download sars-cov-2 genomic data

Everything you do must be documented so that it is completely reproducible. That includes version numbers of all software, etc..

You will be downloading the complete genome of the Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1. This is a fully annotated complete genome, and you will be downloading not only the DNA sequence, but associated metadata and protein info and documenting every step.

The tool you need is datasets, which can be found at https://www.ncbi.nlm.nih.gov/datasets/docs/command-line-start/

This is a special NCBI tool for easy downloading of genes and genomes. Download it into your Case Study 2 main directory. You can run it from within there.

Getting it installed and working properly is the first task:
Download and install the version for your own computer.
If on Linux or Mac, be sure to set the executable permission with:
```
chmod +x datasets
```
Once you’ve done that
Find the “help” documentation in the program and then find the help documentation for the download command within the program
Once you have it working, be sure to document the version number in a documentation.txt file associated with this project directory.
Also be sure to document the date you are doing this!!

Now, your tasks.

Search NCBI for: “sars-cov-2”
Browse the info from NCBI and see what is available
Find the accession number for the reference sars-cov-2 genome, known as the viral “RefSeq”
Use the datasets program to download just the sars-cov-2 refseq, along with all other data available for that refseq.

Useful file info for NCBI genomes:

fna: FastA format file containing Nucleotide sequence (i.e. DNA).

gbff: GenBank genome Feature File containing genome sequence and annotation.

gff: General Feature Format containing genomic regions (i.e. genes, transcripts).

faa: FastA format file containing Amino acid sequence (i.e. protein, peptide).

gpff: GenBank Protein Feature File containing protein sequence and annotation.

cds: A sequence of nucleotides that corresponds with the sequence of amino acids in a protein.

pdb: Text-based description of 3-dimensional structures of molecules.

Find the amino acid and nucleotide sequences for “surface glycoprotein” in sars-cov-2

You’ll need to browse through these files, and then use a combination of tools like seqtk and grep to pull it out
Now, that was the type collection. Let’s also download some more recent strains so we can compare the nucleotide and amino acid sequences for the “surface glycoprotein” between the variant strains.

We want to look for an A-to-G nucleotide mutation at position 23,403 in the Wuhan reference strain. That’s a job better suited for a different language like R, Python or Perl, so we just want to get the data we need into files.

Let’s use the Datasets tool again, but this time, instead of using the –refseq flag, let’s use some of the other built-in flags to select strains that were released after 01/01/2023. Be sure to save these as a new zip file name so you don’t overwrite your reference genome. Maybe do this in a subfolder (any commands to make that subfolder should be documented as well!)
Again, find the amino acid and nucleotide sequences for “surface glycoprotein” in these new strains you’ve downloaded.

The main difference is that each of these files will have data from multiple isolate genomes.

The amino acid and nucleotide sequences for each dataset should be in their own files. So 4 files in total:

 newer.faa   #amino-acid seqs of surface glycoprotein from all strains since 01/08/2020

 newer.fna   #nucleotide seqs of surface glycoprotein from all strains since 01/08/2020

 refseq.faa  #nucleotide seqs of surface glycoprotein from reference genome

 refseq.fna  #nucleotide seqs of surface glycoprotein from reference genome

Bundle those 4 files, along with documentation.txt, into an archive called “sars_cov2_Surface_Glycoprotein_Sequences.zip” (documentation.txt should include the datasets software version number, along with ALL the code you used to generate these 4 files and the zip archive .. and the two bonus tasks below)

BONUS

Include the code to add the following to “documentation.txt” before you bundle it into the zip archive:

Report the number of strains you downloaded from NCBI (not including the reference strain)
Report the number of unique Surface Glycoprotein DNA variants you found among these new strains (Since Jan 1, 2023)

What to turn in:

A cleaned-up plain-text file that documents the code you used to accomplish everything
Be sure to include comments explaining what is going on in your code
Upload that plain-text file to Canvas. You’ll be graded on completeness, documentation, and readability.