A few words of wisdom

  • File extensions don’t actually mean anything. A .csv file doesn’t have to be comma-separated values. A .txt file could actually be a gzipped fasta.
  • Use relative file paths whenever possible. Each project has its own directory and file paths start from there.
  • There are usually 50 ways to accomplish the same task. There is no ‘right way’ as long as it works, is consistent, and is reproducible.
  • File/Dir names NEVER have spaces in them. File names should be useful to both you and the code you write. You should use a naming system that can be programmatically accessed.
    • sample_01_R1.fasta is a good name
    • Sample1R1.fasta is a bad name
    • sample 1 forwardreads.fasta is unacceptable and will get you kicked out of class

Bash programs

bash_programs purpose info
pwd where am I? https://ss64.com/bash/pwd.html
ls what stuff is here? https://ss64.com/bash/ls.html
cd change directory <path> https://ss64.com/bash/cd.html
cat stick stuff together and print to screen https://ss64.com/bash/cat.html
touch reset timestamp / make empty file https://ss64.com/bash/touch.html
nano bash text editor https://www.howtogeek.com/howto/42980/the-beginners-guide-to-nano-the-linux-command-line-text-editor/
cp copy <from> <to> https://ss64.com/bash/cp.html
mv move <from> <to> https://ss64.com/bash/mv.html
rm remove FOREVER https://ss64.com/bash/rm.html
cut pick fields (columns) -f field #s -d delimiter https://ss64.com/bash/cut.html
grep find lines in a file that match a pattern https://ss64.com/bash/grep.html
wget pull a file from a URL and save it as a file locally https://linuxize.com/post/wget-command-examples/
sed streaming editor https://www.tutorialspoint.com/sed/sed_quick_guide.htm
mkdir make new directory https://man7.org/linux/man-pages/man1/mkdir.1.html

Project structure

Every project you work on should strive to be self-contained. It should also document itself.

  • README file to explain project motivation, data sources, hypotheses, etc.
  • Workflow file that documents EVERYTHING that you’ve done, from acquiring raw data through all processing/analysis steps
  • Data should live inside the project (if not using cloud solutions)
  • File names should be intuitive and programmatic


Special characters in Bash

See this link for a list of special characters.

If any file name or pattern you’re trying to match contains one of these characters and you want it to be interpreted literally, you need to include a back-slash right in front of the character.

For example, the file Sample 1.fasta has a space. That’s a special character! So to point to this file in a script, you would have to specify Sample\ 1.fasta which interprets the space as literally just a space character instead of its special meaning.

If your search string includes a back-slash for some insane but likely reason, you’d have to escape it as well: File\1.fasta can only be addressed as File\\1.fasta


Manipulating ‘streams’

All terminal information is text-based and flows through 3 main ‘streams.’ These streams have names and numbers:

  • <stdin>
  • <stdout>
  • <stderr>

(You can have other streams, but only need them in special scenarios)

<stdin> The input text. Can come from keyboard entry, a file, or another program. This is what is typically fed into a program. AKA: 0>

<stdout> The output from a program. Typical default is to send it to the terminal screen. AKA: 1>

<stderr> The error info (if it exists). AKA: 2>

Streams can be redirected in various ways to make them more useful.

> This redirects the <stdout> to a file (overwrites any existing file). Equivalent to 1>

>> This redirects the <stdout> to a file (appends to bottom of any existing file). Equivalent to 1>>

| This redirects the <stdout> to a program. You can chain together many programs to process subsequent outputs in turn… a pipeline.

2> This redirects the <stderr> to a file (overwrites)

3> This redirects the <a-third-secret-stream> to a file (overwrites)


Wildcards and pattern matching

These wildcards and special patterns ARE DIFFERENT FROM standard Bash special characters and they are used in Regular Expressions for pattern matching…

symbol meaning
* matches 0 to infinite of ANYTHING
? or . depending on regex flavor matches up to one instance of any character
[1,2,3] matches exactly one instance of 1, 2, OR 3
x{4} the preceding character will be matched exactly 4 times
^ matches BEGINNING OF LINE
[^a] matches any character EXCEPT ‘a’
$ matches END OF LINE
\ ‘escape character’ any special character will be interpreted literally instead of its special meaning

File paths vs $PATH

In Bash, $ denotes ‘the value of a variable’

Thus, you have a variable named PATH, and $PATH is the literal contents of that variable. Your $PATH contains a colon-separated list of paths (locations) in your computer that Bash automatically knows about. Any programs (like ls, grep, etc.) that live somewhere in your $PATH can be executed from anywhere on your computer.

If you want to be able to call up a program like seqtk, once you download it, you need to either move it into an existing PATH directory, or add its location to your $PATH.

If you want to add a new directory to your PATH: See here


Fasta and Fastq files

Fasta (usually denoted by .fna .fa .fasta extension)

These ought to follow a consistent every-other-line format, but in practice sometimes the sequence lines are split over several lines… because life isn’t fair

>name

Sequence

Fastq (usually denoted by .fnq .fq .fastq extension)

These ought to follow a standard every-four-lines pattern.

@name

sequence

+

quality scores

Quality score interpretation:

The fact that sequence header (names) lines start with ‘@’ paired with the fact that ‘@’ is a valid quality score makes counting sequences more difficult than with a fasta file.

Programs we’ve installed

seqtk – This “sequence toolkit” contains a crapload of useful stuff for working with fasta and fastq files

Nifty things

import newimage.png (Gives a screenshot tool)