HOME | ABOUT ME | LAB | RESEARCH | TEACHING | CV
Bioinformatics Data Skills
Utah Valley University - BIOL 3150
Handy links:
Computer requirements for this class: CLICK HERE
Command Line Projects and the Unix Philosophy
Week 1
Ideology of ‘Robust and Reproducible’ Bioinformatics
Topics:
- What are “data skills?” | Reproducibility and open science | How to learn bioinformatics | Documentation | The importance of caution
Assignments:
- Purchase the textbook
- Read through Chapter 1 of the textbook … twice, and carefully
- Find and explore the supplemental materials for the chapter on GitHub
- Go through the resources below (Do this every week before class!)
- Assignment 1 - Reflection piece on why you want to learn command line skills and best practices
- Set up your computer environment (Command-line, Git)
Resources
- Intro to Linux Video
- Download Git
- Supplemental files for all textbook chapters - REPOSITORY
- Navigating your computer with a terminal video
- Recorded Lesson (old) Part 1
- Recorded Lesson (old) Part 2
Practice
Make sure you’ve watched the videos above and can navigate in your command line terminal.
Do you know what the following commands do?
pwd cd ~ cd .. ls -a ls -l
For your consideration:
- “Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.” –Brian Kernighan
- “Since the computer is a sharp enough tool to be really useful, you can cut yourself on it.” – John Tukey
Week 2
Proper Project Organization
Topics:
- One directory per project | data as ‘read-only’ | rules for naming things | project structure | documentation
Assignments:
- Read through BDS Chapter 2 at least once
- Work through BDS Chapter 2, following along in your own terminal
- Assignment 2 - Create organized project template using code
Resources
- Guide to reproducible code… read this!
- File paths explained
- Example project structure
- Beginners’ guide to the Bash terminal video (need to watch if you haven’t used a terminal before)
- Common Unix commands
- Recorded Lesson (old) Part 1
- Recorded Lesson (old) Part 2
Practice
- Re-create your project directory template by copy-pasting each line of code from your assignment to make sure it gives the same result
- Spend time making sure that you intuitively understand relative filepaths and get comfy with the terminal
- Spend 2-3 hours mucking about in your terminal reworking the lines from Chapter 2 over and over until it feels normal
For your consideration:
- If you are learning to play the piano, and you settle for a couple hours a week of instruction without practicing on your own, you’re gonna be a really crappy piano player, like me. –Geoff Zahn
Unix refresher and sequence data types
Week 3
The Unix Shell
Topics:
- The Unix philosophy | text streams | pipes and redirection | process control | process substitution
Assignments:
- If you’re using a Mac, you should go ahead and install homebrew
- Next (Mac users only), paste the commands from this script into your terminal. That will use ‘homebrew’ to give you the added functionality of ‘GNU’ commands along with some other stuff you’ll need.
- Read through BDS Chapter 3
- Work through BDS Chapter 3, following along in your own terminal
- Assignment 3 - Running shell scripts, redirecting, pipes, background processes
- Read/watch ALL of the resources below. Be able to write a for-loop.
Resources
- Basic Unix Commands
- Very Useful Tutorial
- On the Value of Command-Line Bullshittery
- On the Annoyance of Command-Line Bullshittery
- So is this stuff even useful for bioinformatics? YES!
- Video walkthroughs of some command line stuff:
- Part 1 - first commands
- Part 2 - pipes and wildcards
- Part 3 - relative filepaths
- Command line program flags/parameters
- How to avoid two potentially dangerous command line errors
- For-loops video walkthrough in BASH
- Bonus tips:
- Recorded Lesson (old) Part 1
- Recorded Lesson (old) Part 2 (mostly trying to get things to work on Windows)
Practice
For your consideration:
- “This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” –Doug McIlroy
Week 4
Working with Sequence Data
Topics
- fasta and fastq file formats | using existing tools to work with sequence data
Assignments:
- Read through BDS Chapter 10 at least once
- Don’t work through the examples yet (we can return to them once we have more skills)
- Assignment 4 - converting between formats, inspecting and trimming reads, using pre-made command-line tools
Resources
- seqtk
- fasta and fastq formats
- VIDEO on fasta, fastq, and gzip compression
- phred scores
- fasta format for peptide sequences
- Recorded Lesson (old) Part 1
- Recorded Lesson (old) Part 2
- Recorded Lesson (old) Part 3
Practice
How many sequences are stored (in total) in the fastq files associated with Assignment_4?
How many sequences end with the seqeunce “AT” in each fastq file?
Which fastq file associated with Assignment_4 contains the following sequence:
CCTTCATGCTGTCCTGCAATTACGATAGCATTTCTTTGACGACGAC
For your consideration:
- “Treat data as read-only.” –Vince Buffalo
- Never directly edit any fasta or fastq file! If you have to make edits, redirect them to a new version of the raw file.
Using Existing Tools in the Command Line
Week 5
Combining Unix Skills and Command-Line Software
Topics:
- Interfacing with command-line tools | redirecting stdout and stderr | customizing parameters
Assignments:
- Case
study 1 - Using command-line skills to run existing software on many
files
- uses chapters: 2,3,10 (for-loop, grep, redirection, flags)
Resources
- Eukaryotic 18S regions and priming sites image
- ITSxpress repository
- Installing miniconda
More Powerful Unix Tools
Week 6
Unix Data Tools
Topics:
- Regular expressions (regex)
- sed, grep
- Chaining together links in a ‘pipeline’
- Intro to process substitution (if we have time)
Assignments:
- Work through BDS Chapter 7
- Spend 3 hours practicing everything we’ve done so far
- Try putting things together in original ways to get new insights on sequnce data
- Ask silly questions like: “What are the counts of Cytosine bases in all seqs that contain the pattern ‘GGCCG’?”
After playing around a bit, I came up with the following for that silly question:
cat Assignment_3_Combined_Files.fasta | grep -B 1 "GGCCG"| seqtk comp | cut -f3
- Playing around like this in freeform is the best way to build your skills between projects.
- Your practice tasks (below) will get you started
Resources
- Introduction to regular expressions video
- Regular expression tester … handy tool
- sed video playlist Definitely worth your time!
Practice
Week 7
Unix Data Tools, Continued
Topics:
- More handy shell programs: cut, paste, sort, uniq, tr, rename, tee, xargs, awk
- Manipulating text data from one format to another
Assignments:
- Continue working through BDS Chapter 7
- Assignment 5 - using ‘awk’ & ‘process substitution’ to interrogate a table
- Assignment 6 - convert between tabular and fasta formatted data | process/command substitution | advanced paste
Resources
Practice
Here’s an awful-looking one-line command that prints out the phylum from each line of Chapter_7_Practice_File_2.txt along with a number sequence next to it showing which line of the file it came from.
It uses both process and command substitution, but essentially, it’s just the paste command pasting together the phylum in the first field and the numbers 1-34 in the second field
I want you to break it apart, looking at each component and understand why it works!
paste <(cat Chapter_7_Practice_File_2.txt | cut -d ";" -f 2) <(seq $(wc -l Chapter_7_Practice_File_2.txt | cut -d " " -f 1))
If you wanted to use process substitution again to extend this whole command in order to add a header to the output, what would you do? (i.e., add a first row that is “PHYLUM LINE_NUMBER”)
Finding and Retrieving Data
Week 8
Online Repositories and Approaches to Downloading
Topics:
- NCBI / SRA
- Searches, filters, metadata
- Database files and formats
- Documenting data acquisition
- Checksums
- File compression
Assignments:
Work through BDS Chapter 6
Case Study 2 - Reproducibly downloading stuff (BDS p. 120)
- Full documentation
- Checksums
- Markdown README
Resources
- tmux tutorial
- curl vs wget comparison
Practice
Working with Supercomputers
Week 9
Interfacing with Remote Machines
Topics:
- tmux, ssh, public keys
- navigating the HPC
- good HPC citizenship
- SLURM scripts and commands
Assignments:
- Work through BDS Chapter 4 before class this week
- Assignment 7 - build and submit 3 separate jobs on the HPC
Resources
Video series on the CHPC
SLURM commands cheat sheet
SLURM Presentation from CHPC
How public key encryption actually works
Practice
Week 10
Interfacing with Remote Machines, Continued
Topics:
- Installing other software not found in “modules”
- File transfers
- Customizing your remote workspace
Assignments:
- Assignment 8 - Download and process SRA data on the CHPC
Resources:
sra-toolkit is available as a module on the CHPC, but you’ll need to configure it before use using
vdb-config -i
prefetch instructions
fastq-dump instructions from the Edwards Lab
fasterq-dump has basically replaced the previous, but if you want to use the old school method:
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip SRR_ID
- FileZilla is a free FTP client that really comes in handy for moving files to and from remote servers
Practice
- See if you can get itsxpress to run
Shell scripts
Week 11
TBD
Topics:
- Let’s use this time to explore topics of interest
- We can also talk about bioinformatics collaborations and your role as a data expert
Assignments:
- TBD
Resources
- Effectively collaborating as a bioinformaticist
- Very cool website that let’s you play with BINF tools interactively
Practice
- TBD
Week 12
Bioinformatics Shell Scripting
Topics:
- Turning a workflow into a script
- Bash script parameters ($1 $2 $3 …)
- if, then, else, fi
Assignments:
Work through BDS Chapter 12
Remember that “create a new project” script you wrote at the beginning of the semester?
- Turn it into an interactive script where the user provides the name of the project
- It should then generate a full project directory structure based on that name
Resources
- Using positional bash arguments in a script
- elif statements in bash scripts
Practice
Build a bash script that can:
- determine the file extension of fasta, fasta.gz, fastq, fastq.gz
- uses conditional statements to print the number of sequences in the file, regardless of format (as long as it’s one of those 4)
- this forum exchange might help
Putting it all together
Week 13
Composing Full Pipelines
Topics:
The duct tape of bioinformatics
Good pipelines need:
- Documentation
- Version control
- Validation
Assignments:
- Continue working through BDS Chapter 12
Resources
- Best practices in pipeline building and sharing
Week 14
Running a Pipeline on a Remote Machine
Topics:
- TBD - Depends on class project
Assignments:
- Case Study 3 - TBD - Depends on class project
Resources
Practice
Week 15
TBD - Depends on class project
Topics:
- TBD - Depends on class project
Assignments:
- Case Study 4 - TBD - Depends on class project
Week 16
Where to go from here?
Topics:
- Class discussion
Assignments:
- Assignment 10 - Reflection piece on what you’ve learned and what next steps you’ll take