The internally transcribed spacer region is a region between highly conserved the small subunit (SSU) of rRNA and the large subunit (LSU) of the rRNA.
It refers to the two variable length spacer regions that flank the 5.8S coding region.
In amplicon sequencing studies it is common practice to trim off the conserved (SSU, 5,8S or LSU) regions.
The conserved sites, including where the primers sit, can mess up taxonomic assignment algorithms, and is generally not informative.
Getting it installed and working properly is the first task:
BBtools installation
Once you have installed itsxpress and its dependencies, make sure it’s in your $PATH (add it if necessary) so the program can be found, and then take a look at how to use it from the help page.
You need to run itsxpress on all the fastq files in our data set, making sure to stash all the output information and diagnostics for each file.
That data set, once unzipped, contains 15 samples (each in its own file) of ITS amplicons. The sample names are:
CC1,CC2,CC3,CC4,CC5,CR1,CR2,CR3,CR4,CR5,PF1,PF3,PF5,SW1,SW2
The problem is that the amplicons have overhanging regions outside of the ITS1 that we want, so wee need to extract just the regions of reads in each sample that corresponds to the ITS1 section and remove the partial 18S and 5.8S sections.
1. Extract the ITS1 regions from each sample (AKA, from each file)
2. Non-default parameters for ITSxpress should be:
- these are single reads (not paired)
- save *only* the ITS1 region
- search for ITS1 reads from *all* taxonomic groups
3. Store all diagnostic log information in a unique file for each sample
4. Be sure that each file gets its own log file
Further, we want to get some summary information about how many reads in each file successfully had an ITS1 detection and extraction
Use grep, sed, awk, cut, paste (or whatever tools you need) to extract summary info out of the log files about how well itsxpress performed
1. How many input sequences were there?
2. How many output ITS1 sequences made it through?
3. How long did it take to run for each file?
4. Save this selected summary info (for each sample) into a new file called "ITSxpress_summary_info.txt"
- This file should be organized well enough so that for each sample, it's easy to see the selected summary info associated with it... something like this (looks ugly, but it's tab-separated):
Filename Seqs in Seqs out Elapsed time
DNA.CAC.CC1.ITS_S2_L001_R1_001.fastq 28009 15229 00:02:36
DNA.CAC.CR5.ITS_S20_L001_R1_001.fastq 13630 5723 00:01:27
. . .