Bioinformatics Data Skills

Assignment 4


1. Read through Chapter 10 at least once

2. You don’t yet have the chops to work through the Example (Inspecting and trimming low-quality bases), so read through it carefully, but don’t bother trying to follow along in your terminal yet. BUT, you should work through the “Indexed FASTA files” section!

3. Here’s a single confusing line of code that converts fastq to fasta format. It accepts STDIN as input and output is written to STDOUT.

        sed -n '1~4s/^@/>/p;2~4p'
    

Download Fastq File 1 and File 2 and run that line of code on each of them.

4. Now, run that line of code on each file again, but this time, pipe the output to head to inspect the first 5 lines.

5. Write a piped command with process substitution that combines the first 8 lines of Assignment_4_File_1.fastq and the first 4 lines of the same file converted fasta format and redirect the output to a file named “Fastq_vs_Fasta.txt”

6. Repeat task 5, but for Assignment_4_File_2.fastq and redirect the output to “Fastq_vs_Fasta_2.txt”

7. Install the seqtk program (you’ll need to explore it yourself –online and help files– to see how it works): https://github.com/lh3/seqtk

        seqtk # returns list of available seqtk subcommands

Use: seqtk SUBCOMMAND -OPTIONS FILENAME

8. Use seqtk to trim the first 50 nucleotides from each sequence in Assignment_4_File_1.fastq and output the results to Assignment_4_File_1_trimmed.fastq

9. Using the “seqtk comp” command, write a piped chain of commands that returns THE NAME OF THE SEQUENCE THAT HAS THE GREATEST NUMBER OF ‘A’ NUCLEOTIDES.

10. Use seqtk to store a fasta version of the reverse-compliment of “Assignment_4_File_1_trimmed.fastq” in a new file called “Assignment_4_File_1_trimmed_RC.fasta”

Use pipe(s) to do this in one line of code

11. For the seqtk gc program, what command would return the regions that meet the following requirements for the Assignment_4_File_1.fastq file:

  • AT rich instead of GC rich
  • min region length of 20
  • min AT fraction of .75

12. For the seqtk rename program, what command would rename each sequence header to have the prefix “Seq_”

  • Be sure to keep track of any commands used in your terminal to install the program

13. This third file download contains a Fasta-formatted list of sequences, but it’s in an odd format. What is the difference from other ‘standard’ Fastas? What command found in the seqtk program would convert it to that ‘standard’ format?


What to turn in:

  • Store a cleaned-up plain-text file that documents the code you used for tasks 4-13
  • Be sure to include comments explaining what is going on in your code
  • At the bottom of the file, give your answer to the question in task 13, including:
    • an example of the first 2 lines of Assignment_4_File_3.fasta
    • a brief description about what’s different about this fasta file
    • the fastx-toolkit command you used to transform it to ‘standard’ format
    • an example of the first 2 lines of your converted Assignment_4_File_3.fasta file in its ‘standardized format’

Upload that plain-text file to Canvas. You’ll be graded on completeness, documentation, and readability.