Chapter 7 Practice:

grep & sed


Here’s a gzipped fasta file.

Download and unzip it into a reasonable directory and see if you can get the following outputs from it using ‘sed’ and ‘grep’ (along with a few other tools like cut, sort, tr, etc.)

The code for the first practice problem is shown. After that, I’m only showing the desired output.


1. A sorted list of the genera found in “Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;”
# I'll show you the line of code for the first one:
cat Chapter_7_Practice_File_1.fa | grep ";Planococcaceae;." | sort -u | cut -d ";" -f 6

Bhargavaea
Caryophanon
Chryseomicrobium
Chungangia
Filibacter
Jeotgalibacillus
Kurthia
Lysinibacillus
Paenisporosarcina
Planococcus
Planomicrobium
Psychrobacillus
Rummeliibacillus
Solibacillus
Sporosarcina
Ureibacillus
Viridibacillus
2. A count of the number of times the genus “Aquincola” turns up in this data set
1
3. An alphabetical list of all the phyla that begin with the letter “A”
Acetothermia
Acidobacteria
Actinobacteria
Aminicenantes
Aquificae
Armatimonadetes
Atribacteria
4. How many DNA sequences in this file contain “AAAAA” ?
5533
5. Return just the DNA sequence for “Aquincola” with T’s replaced by U’s.
AUUGAACGCUGGCGGCAUGCCUUACACAUGCAAGUCGAACGGUAACGCGGGGCAACCUGGCGACGAGUGGCGAACGGGUGAGUAAUGCAUCGGAACGUGCCCAGAAGUGGGGGAUAGCCCGGCGAAAGCCGGAUUAAUACCGCAUGAGACCUGAGGGUGAAAGCGGGGGAUCGCAAGACCUCGCGCUUUUGGAGCGGCCGAUGUCAGAUUAGGUAGUUGGUGGGGUAAAGGCCUACCAAGCCGACGAUCUGUAGCUGGUCUGAGAGGACGACCAGCCACACUGGGACUGAGACACGGCCCAGACUCCUACGGGAGGCAGCAGUGGGGAAUUUUGGACAAUGGGCGCAAGCCUGAUCCAGCCAUGCCGCGUGCGGGAAGAAGGCCUUCGGGUUGUAAACCGCUUUUGUCGGGGAAGAAAAGCUCUGGGUUAAUACCCUGGGGUGAUGACGGUACCCGAAGAAUAAGCACCGGCUAACUACGUGCCAGCAGCCGCGGUAAUACGUAGGGUGCAAGCGUUAAUCGGAAUUACUGGGCGUAAAGCGUGCGCAGGCGGUUGUGUAAGACAGAUGUGAAAUCCCCGGGCUCAACCUGGGAACUGCAUUUGUGACUGCACAGCUGGAGUGCGGCAGAGGGGGAUGGAAUUCCGCGUGUAGCAGUGAAAUGCGUAGAUAUGCGGAGGAACACCGAUGGCGAAGGCAAUCCCCUGGGCCUGCACUGACGCUCAUGCACGAAAGCGUGGGGAGCAAACAGGAUUAGAUACCCUGGUAGUCCACGCCCUAAACGAUGUCAACUGGUUGUUGGGAGGGUUUCUUCUCAGUAACGAAGCUAACGCGUGAAGUUGACCGCCUGGGGAGUACGGCCGCAAGGUUGAAACUCAAAGGAAUUGACGGGGACCCGCACAAGCGGUGGAUGAUGUGGUUUAAUUCGAUGCAACGCGAAAAACCUUACCUACCCUUGACAUGCCAGGAAUCCUGCAGAGAUGUGGGAGUGCUCGAAAGAGAGCCUGGACACAGGUGCUGCAUGGCCGUCGUCAGCUCGUGUCGUGAGAUGUUGGGUUAAGUCCCGCAACGAGCGCAACCCUUGUCAUUAGUUGCUACGAAAGGGCACUCUAAUGAGACUGCCGGUGACAAACCGGAGGAAGGUGGGGAUGACGUCAGGUCCUCAUGGCCCUUAUGGGUAGGGCUACACACGUCAUACAAUGGCCGGUACAGAGGGCUGCCAACCCGCGAGGGGGAGCUAAUCCCAGAAAACCGGUCGUAGUCCGGAUCGCAGUCUGCAACUCGACUGCGUGAAGUCGGAAUCGCUAGUAAUCGCGGAUCAGCUUGCCGCGGUGAAUACGUUCCCGGGUCUUGUACACACCGCCCGUCACACCAUGGGAGCGGGUUCUGCCAGAAGUAGUUAGCCUAACCGCAAGGAGGGCGAUUACCACGGCAGGGUUCGUGACUGGGGUG
6. Which taxa contain the DNA sequence “TGTTGGGTTAAGTCCCCC” ?
Bacteria;Firmicutes;Clostridia;Clostridiales;Natranaerovirga;
Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Verminephrobacter;
Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Prosthecobacter;
7. Which taxa have DNA sequences that start with “AAAAA” ?
Bacteria;Cyanobacteria/Chloroplast;Chloroplast;Chloroplast;Chlorophyta;
Bacteria;Armatimonadetes;Chthonomonadetes;Chthonomonadales;Chthonomonadaceae;Chthonomonas/Armatimonadetes_gp3;
8. Which taxa have DNA sequences that end with “AAAAA” ?
Bacteria;Chloroflexi;Thermoflexia;Thermoflexales;Thermoflexaceae;Thermoflexus;
Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Kaistia;
Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae;Sphingobium;
Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Roseospira;
Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria;Family_I;GpI;
9. Combine all the DNA sequences into one long sequence stored on a single line and replace every insance of “GCC” with “---”

Too big to print the whole thing, but here’s a sample:

GAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGTCTCTTCGGAGATACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTAATCT---CTGCACATCGGGATAA---TGGGAAACTGGGTCTAATACCGAATAGGACCTCGAGGCGCAT---TTGTGGTGGAAAGCTTTTGCGGTGTGGGATGG---CGCG---TATCAGCTTGTTGGTGGGGTGACG---TACCAAGGCGACGACGGGTA---G---TGAGAGGGTGTCCG---ACACTGGGACTGAGATACG---CAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAA---TGATGCAGCGAC---GCGTGGGGGATGACGGNCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACCTGCAGAAGAAGCACCG---AACTACGT---AGCA---GCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAAACCGGGGGCTTAACCCTCGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCAC---GTAAACGGTGGGTACTAGGTGTGGGTTTCCTTCCTTGGGATCCGT---GTAGCTAACGCATTAAGTACCCC---TGGGGAGTACG---GCAAGGCTAAAACTCAAAGGAATTGACGGGG---CGCACAAGCGGCGGAGCATGTGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATGCACAGGAC---GGCAGAGATGTCGGTTCCCTTGTG---TGTGTGCAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCTCATGTT---AGCGGGTAAT---GGGGACTCGTGAGAGACT---GGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCAT---CCTTATGTCCAGGGCTTCACACATGCTACAATG---GGTACAAAGGGCTGCGAT---GCAAGGTTAAGCGAATCCTTTTAAA---GGTCTCAGTTCGGATCGGGGTCTGCAACTCGACCCCGTGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGG---TTGTACACACC---CGTCACGTCATGAAAGTCGGTAACACCCGAA---AGTG---TAACCTTTGGGAGGGAGCTGTCGAAGGTGGGATCGGCGATTGGGACGAAGTCGTGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAG---TAACACATGCAAGTCGAGCGGAAACGACACTAACAATCCTTCGGGTGCGTTAATGGGCGTCGAGCGGCGGACGGGTGAGTAAT---TAGGAAATT---TTGATGTGGGGGATAACCATTGGAAACGATGGCTAATACCGCATAAT---TACGG---AAAGAGGGGGACCTTCGG---TCTC...........
**10. Here’s another file. Download it for this last practice task. Write some code that finds each of the taxa (and their DNA sequences) listed in this new file. To clarify, this new file is a list of taxa and I want you to use it to find all the corresponding DNA entries in the first file.

To download this file, follow that link then right-click and “Save as” a txt file. OR…if you want a sneak peek at some of next week’s material, run the following line of code in your terminal while in your working directory for this practice set:

wget http://gzahn.github.io/binf-data-skills/Data/Chapter_7_Practice_File_2.txt

Note that your code should use the file, you shouldn’t be reading it and copy-pasting.

Too big to print the whole thing, but here’s a sample of the output:

>Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Corynebacteriaceae;Corynebacterium;
TCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCCTCAGCTTTTGTTGGGGTGCTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCTCGTACTTCGGGATAAGCTTGGGAAACTGGGTCTAATACCGGATAGGACCATCATTTAGTGTTGGTGGTGGAAAGTTTTTTCGGTACGAGATGAGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGGACGGCCACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGACAGGGACGAAGCTTTTGTGACGGTACCTGTATAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATACCGGGGCTTAACTCCGGAGCTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGGTGGGCGCTAGGTGTAGGGGTCTTCCACGACTTCTGTGCCGTAGCTAACGCATTAAGCGCCCCGCCTGGGGNGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGCTTGACATATACAGGACGGCTGCAGAGATGTAGTTTCCCTTGTGGTCTGTATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCTTATGTTGCCAGCACGTTATGGTGGGGACTCATGAGAGACTGCCGGGGTTAACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCCAGGGCTTCACACATGCTACAATGGTCGGTACAACGCGTTGCCAGCCCGTGAGGGTGAGCGAATCGCTGAAAGCCGGCCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACGTCATGAAAGTTGGTAACACCCGAAGCCAGTGGCCTAACCTTTTTTGGG
--
>Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Corynebacteriaceae;Corynebacterium;
TGCTTAACACATGCAAGTCGAACGGAAAGGCCTTGTGCTTGCACAAGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCACTGTGGGATAAGCCTGGGAAACTGGGTCTAATACCATATAGGACCGCATCTTGGATGGTGTGGTGGAAAGCTTTTGCGGTGTGGGATGAGCCTGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGGCGACGGGTATCCGGCCTGAGAGGGTGTACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGAAGGCCTTCGGGTTGTAAACTCCTTTCGCTATCGACGAAGCCTTCGGGTGACGGTAGGTAGATAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCCTAGGTGGTTTGTCGCGTCGTCTGTGAAATCCCGGGGCTTAACTTCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGGTGGGCGCTAGGTGTAGGGGGCTTCCACGTCTTCTGTGCCGTAGCTAACGCATTAAGCGCCCCGCCTGGGGAGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGCTTGACATATACAGGATCGGGCTAGAGATAGTCTTTCCCTTGTGGTCTGTATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCTTATGTTGCCAGCACGTTATGGTGGGAACTCATGAGAGACTGCCGGGGTTAACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCCAGGGCTTCACACATGCTACAATGGTCGATACAGTGGGCAGCGACATCGTAAGGTGGAGCGAATCCCTGAAAGTCGGCCTTAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACGTCATGAAAGTTGGTAACACCCGAAGCCAGTGGCCTAAACTTGTTAGGGAGCTGTCGAAGGTGGGATCGGCGATTGGGACGAAGTCGTAACAAGGTAGCC

……