Bioinformatics Data Skills

Assignment 7

You will be creating some SLURM scripts that will run on the UofU CHPC Cluster.

1. Read through Chapter 4 at least twice

2. Watch the first 6 videos in the CHPC series in the course website resources

3. Write and submit the following 3 jobs via SLURM scripts:

Each of these three jobs should be self-contained SLURM scripts that you submit to the “kingspeak-shared” cluster

Job 1: Testing this out

Resources = 1 core, 1 minute of wall time, 10 Mb of memory
Task:
```
cat <(ls -l1 ~) 
```

Job 2: Downloading surface glycoprotein sequences

Resources = 1 core, 30 minutes of wall time, 6400 Mb of memory
Tasks:
1. Download and setup the NCBI datasets program we used before
2. Use datasets to download just the sars-cov-2 genomes deposited since 01/01/2023, extracting the CDS (CoDing Sequences)
- That command should look like…
```
PATH/TO/datasets download virus genome taxon sars-cov2 --include cds --filename sars-cov2.zip --released-after 01/01/2023 --annotated --complete-only
```

Extract and format the nucleotide sequences for just the “surface glycoprotein” CDS
Calculate the number of sequences you downloaded
Calculate the number of unique nucleotide variants for that gene

Job 3: Clustering those sequences

Resources = 12 cores, 4 hours of wall time (to start with, at least…increase if necessary), 28000 Mb of memory
Modules = vsearch/040218

Tasks:

Run the following program on the cds.fna file of surface glycoproteins (tweak file names/paths if necessary)

     vsearch --cluster_fast surface_glycoproteins.fasta -id 0.97 --centroids centroids.fasta --consout consensus.fasta --msaout msa.fasta --profile cluster_profiles.txt

     cat centroids.fasta | grep "^>" > centroid_ids.txt

     grep -Fwf centroid_ids.txt cds.fna

This vsearch program is clustering the nucleotide sequences based on similarity. In other words, it’s finding groups of nucleotide variants that are at least 97% similar to each other, and then profiling the makeup of these clusters at each nucleotide position. This is useful for several applications… Firstly, if you have 9999 identical sequences and 1 that has a single nucleotide change, that could be real, or it could be a sequencing error. Secondly, it’s performing an alignment step that can be useful for downstream processes. Thirdly, these clusters represent sequence similarity and we can get a quick feel for diversity in this locus, including where the clusters are geographically.

The second 2 lines of that script will return the full sequence names for the representative sequences of each identified cluster. Those full names have location information embedded in them!

Each of these 3 jobs will produce an outfile containing STDOUT.

What to turn in:

Your 3 SLURM scripts, concatenated into a single plantext file.
Your 3 outfiles, concatenated into a single plantext file.

Upload those plain-text files to Canvas. You’ll be graded on completeness, documentation, and readability.