Bioinformatics Data Skills

Case Study 1: Use an existing tool to accompish a customized task

The internally transcribed spacer region is a region between highly conserved the small subunit (SSU) of rRNA and the large subunit (LSU) of the rRNA.
It refers to the two variable length spacer regions that flank the 5.8S coding region.
In amplicon sequencing studies it is common practice to trim off the conserved (SSU, 5,8S or LSU) regions.
The conserved sites, including where the primers sit, can mess up taxonomic assignment algorithms, and is generally not informative.

Your collaborator has taken 15 samples from peat bogs around the world and has amplified and sequenced the ITS1 region of rDNA from each of these samples. She has sent you the files to identify the ‘species’ found in each sample. Each file represents all of the forward reads (the reverse reads weren’t good quality) from a respective sample. The primers she used are allegedly universal for pretty much all eukaryotes, so there should be a wide variety of organisms present. However, the ITS regions are highly variable (length and composition) and you need to remove the flanking 18S and 5.8S regions before trying to assign taxonomy.

So, you’re going to be using an existing piece of software to extract just the ITS1 region from all these files of DNA reads. The tool you will use is called itsxpress

Getting it installed and working properly is the first task:
It depends on a lot of other software, including HMMER3, biopython, and others
You will need to install the conda system in order to even install itsxpress
miniconda installation instructions
Vsearch installation
BBtools installation
Once you have installed itsxpress and its dependencies, make sure it’s in your $PATH (add it if necessary) so the program can be found, and then take a look at how to use it from the help page.
You need to run itsxpress on all the fastq files in our data set, making sure to stash all the output information and diagnostics for each file.

That data set, once unzipped, contains 15 samples (each in its own file) of ITS amplicons. The sample names are:

    CC1,CC2,CC3,CC4,CC5,CR1,CR2,CR3,CR4,CR5,PF1,PF3,PF5,SW1,SW2

The problem is that the amplicons have overhanging regions outside of the ITS1 that we want, so wee need to extract just the regions of reads in each sample that corresponds to the ITS1 section and remove the partial 18S and 5.8S sections.

In other words, your documented script should automatically perform the following for each file:

1. Extract the ITS1 regions from each sample (AKA, from each file)
2. Non-default parameters for ITSxpress should be:
  - these are single reads (not paired)
  - save *only* the ITS1 region
  - search for ITS1 reads from *all* taxonomic groups
3. Store all diagnostic log information in a unique file for each sample
4. Be sure that each file gets its own log file

Further, we want to get some summary information about how many reads in each file successfully had an ITS1 detection and extraction
Use grep, sed, awk, cut, paste (or whatever tools you need) to extract summary info out of the log files about how well itsxpress performed

1. How many input sequences were there?
2. How many output ITS1 sequences made it through?
3. How long did it take to run for each file?
4. Save this selected summary info (for each sample) into a new file called "ITSxpress_summary_info.txt"
  - This file should be organized well enough so that for each sample, it's easy to see the selected summary info associated with it... something like this (looks ugly, but it's tab-separated):



    Filename    Seqs in Seqs out    Elapsed time
    DNA.CAC.CC1.ITS_S2_L001_R1_001.fastq    28009   15229   00:02:36
    DNA.CAC.CR5.ITS_S20_L001_R1_001.fastq   13630   5723    00:01:27
    . . .

Now your files are ready for taxonomic assignment!

What to turn in:

Store a cleaned-up plain-text file that documents the code you used to accomplish everything
Be sure to include comments explaining what is going on in your code
Upload that plain-text file to Canvas. You’ll be graded on completeness, documentation, and readability.
Copy/Paste the contents of ITSxpress_summary_info.txt into the text entry box of the Canvas assignment as well.