You will use the CHPC server to download and process seqeuences from the Sequence Read Archive (SRA).
The SRA is a bit different from the nuccore NCBI database. It’s for raw data from high-throughput sequencing technologies. Most of the time metagenomes are reported in the literature, the raw data from the sequencer are deposited here, along with study metadata. Go to the SRA website and …
I’ve got one such list of accession numbers for you in this file.
These accessions are from a UVU student-led research project comparing soil microbes along a chronosequence of alpine wildfires with nearby unburned controls. The goal was to see how long it takes soil fungi to recover after a burn. To do this, we sequenced ITS1 amplicons. Each soil sample was returned as it’s own file (accession).
You need to download these data and run itsxpress on them all.
You’ll be using the sra-toolkit module, available on the CHPC
See the course website for resources about configuring sra-toolkit and how to use the prefetch and fastq-dump programs (which are part of sra-toolkit)
1. Download and extract the fastq files from the accessions listed in that file above
2. Run itsxpress on all of them to extract just the ITS1 region
3. Make sure to request enough time and resources from the kingspeak-shared cluster to get the job done
Upload the plain-text file(s) to Canvas. You’ll be graded on completeness, documentation, and readability.
Note that itsxpress is NOT an available module. But it’s dependencies are! So you load those dependencies, and then import itsxpress as a python library
module load python/3.7.3
module load hmmer3/3.1b2
module load vsearch
# import itsxpress
python $HOME/programs/load_itsxpress.py
…and the contents of load_itsxpress.py are just one line:
import itsxpress