Here’s a gzipped fasta file.
Download and unzip it into a reasonable directory and see if you can get the following outputs from it using ‘sed’ and ‘grep’ (along with a few other tools like cut, sort, tr, etc.)
The code for the first practice problem is shown. After that, I’m only showing the desired output.
# I'll show you the line of code for the first one:
cat Chapter_7_Practice_File_1.fa | grep ";Planococcaceae;." | sort -u | cut -d ";" -f 6
Bhargavaea
Caryophanon
Chryseomicrobium
Chungangia
Filibacter
Jeotgalibacillus
Kurthia
Lysinibacillus
Paenisporosarcina
Planococcus
Planomicrobium
Psychrobacillus
Rummeliibacillus
Solibacillus
Sporosarcina
Ureibacillus
Viridibacillus
1
Acetothermia
Acidobacteria
Actinobacteria
Aminicenantes
Aquificae
Armatimonadetes
Atribacteria
5533
AUUGAACGCUGGCGGCAUGCCUUACACAUGCAAGUCGAACGGUAACGCGGGGCAACCUGGCGACGAGUGGCGAACGGGUGAGUAAUGCAUCGGAACGUGCCCAGAAGUGGGGGAUAGCCCGGCGAAAGCCGGAUUAAUACCGCAUGAGACCUGAGGGUGAAAGCGGGGGAUCGCAAGACCUCGCGCUUUUGGAGCGGCCGAUGUCAGAUUAGGUAGUUGGUGGGGUAAAGGCCUACCAAGCCGACGAUCUGUAGCUGGUCUGAGAGGACGACCAGCCACACUGGGACUGAGACACGGCCCAGACUCCUACGGGAGGCAGCAGUGGGGAAUUUUGGACAAUGGGCGCAAGCCUGAUCCAGCCAUGCCGCGUGCGGGAAGAAGGCCUUCGGGUUGUAAACCGCUUUUGUCGGGGAAGAAAAGCUCUGGGUUAAUACCCUGGGGUGAUGACGGUACCCGAAGAAUAAGCACCGGCUAACUACGUGCCAGCAGCCGCGGUAAUACGUAGGGUGCAAGCGUUAAUCGGAAUUACUGGGCGUAAAGCGUGCGCAGGCGGUUGUGUAAGACAGAUGUGAAAUCCCCGGGCUCAACCUGGGAACUGCAUUUGUGACUGCACAGCUGGAGUGCGGCAGAGGGGGAUGGAAUUCCGCGUGUAGCAGUGAAAUGCGUAGAUAUGCGGAGGAACACCGAUGGCGAAGGCAAUCCCCUGGGCCUGCACUGACGCUCAUGCACGAAAGCGUGGGGAGCAAACAGGAUUAGAUACCCUGGUAGUCCACGCCCUAAACGAUGUCAACUGGUUGUUGGGAGGGUUUCUUCUCAGUAACGAAGCUAACGCGUGAAGUUGACCGCCUGGGGAGUACGGCCGCAAGGUUGAAACUCAAAGGAAUUGACGGGGACCCGCACAAGCGGUGGAUGAUGUGGUUUAAUUCGAUGCAACGCGAAAAACCUUACCUACCCUUGACAUGCCAGGAAUCCUGCAGAGAUGUGGGAGUGCUCGAAAGAGAGCCUGGACACAGGUGCUGCAUGGCCGUCGUCAGCUCGUGUCGUGAGAUGUUGGGUUAAGUCCCGCAACGAGCGCAACCCUUGUCAUUAGUUGCUACGAAAGGGCACUCUAAUGAGACUGCCGGUGACAAACCGGAGGAAGGUGGGGAUGACGUCAGGUCCUCAUGGCCCUUAUGGGUAGGGCUACACACGUCAUACAAUGGCCGGUACAGAGGGCUGCCAACCCGCGAGGGGGAGCUAAUCCCAGAAAACCGGUCGUAGUCCGGAUCGCAGUCUGCAACUCGACUGCGUGAAGUCGGAAUCGCUAGUAAUCGCGGAUCAGCUUGCCGCGGUGAAUACGUUCCCGGGUCUUGUACACACCGCCCGUCACACCAUGGGAGCGGGUUCUGCCAGAAGUAGUUAGCCUAACCGCAAGGAGGGCGAUUACCACGGCAGGGUUCGUGACUGGGGUG
Bacteria;Firmicutes;Clostridia;Clostridiales;Natranaerovirga;
Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Verminephrobacter;
Bacteria;Verrucomicrobia;Verrucomicrobiae;Verrucomicrobiales;Verrucomicrobiaceae;Prosthecobacter;
Bacteria;Cyanobacteria/Chloroplast;Chloroplast;Chloroplast;Chlorophyta;
Bacteria;Armatimonadetes;Chthonomonadetes;Chthonomonadales;Chthonomonadaceae;Chthonomonas/Armatimonadetes_gp3;
Bacteria;Chloroflexi;Thermoflexia;Thermoflexales;Thermoflexaceae;Thermoflexus;
Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Kaistia;
Bacteria;Proteobacteria;Alphaproteobacteria;Sphingomonadales;Sphingomonadaceae;Sphingobium;
Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Rhodospirillaceae;Roseospira;
Bacteria;Cyanobacteria/Chloroplast;Cyanobacteria;Family_I;GpI;
Too big to print the whole thing, but here’s a sample:
GAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGTCTCTTCGGAGATACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTAATCT---CTGCACATCGGGATAA---TGGGAAACTGGGTCTAATACCGAATAGGACCTCGAGGCGCAT---TTGTGGTGGAAAGCTTTTGCGGTGTGGGATGG---CGCG---TATCAGCTTGTTGGTGGGGTGACG---TACCAAGGCGACGACGGGTA---G---TGAGAGGGTGTCCG---ACACTGGGACTGAGATACG---CAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAA---TGATGCAGCGAC---GCGTGGGGGATGACGGNCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGCAAGTGACGGTACCTGCAGAAGAAGCACCG---AACTACGT---AGCA---GCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTTGTTCGTGAAAACCGGGGGCTTAACCCTCGGCGTGCGGGCGATACGGGCAGACTGGAGTACTGCAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCAC---GTAAACGGTGGGTACTAGGTGTGGGTTTCCTTCCTTGGGATCCGT---GTAGCTAACGCATTAAGTACCCC---TGGGGAGTACG---GCAAGGCTAAAACTCAAAGGAATTGACGGGG---CGCACAAGCGGCGGAGCATGTGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGTTTGACATGCACAGGAC---GGCAGAGATGTCGGTTCCCTTGTG---TGTGTGCAGGTGGTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCTCATGTT---AGCGGGTAAT---GGGGACTCGTGAGAGACT---GGGGTCAACTCGGAGGAAGGTGGGGATGACGTCAAGTCATCAT---CCTTATGTCCAGGGCTTCACACATGCTACAATG---GGTACAAAGGGCTGCGAT---GCAAGGTTAAGCGAATCCTTTTAAA---GGTCTCAGTTCGGATCGGGGTCTGCAACTCGACCCCGTGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGG---TTGTACACACC---CGTCACGTCATGAAAGTCGGTAACACCCGAA---AGTG---TAACCTTTGGGAGGGAGCTGTCGAAGGTGGGATCGGCGATTGGGACGAAGTCGTGTTTGATCCTGGCTCAGATTGAACGCTGGCGGCAG---TAACACATGCAAGTCGAGCGGAAACGACACTAACAATCCTTCGGGTGCGTTAATGGGCGTCGAGCGGCGGACGGGTGAGTAAT---TAGGAAATT---TTGATGTGGGGGATAACCATTGGAAACGATGGCTAATACCGCATAAT---TACGG---AAAGAGGGGGACCTTCGG---TCTC...........
To download this file, follow that link then right-click and “Save as” a txt file. OR…if you want a sneak peek at some of next week’s material, run the following line of code in your terminal while in your working directory for this practice set:
wget http://gzahn.github.io/binf-data-skills/Data/Chapter_7_Practice_File_2.txt
Note that your code should use the file, you shouldn’t be reading it and copy-pasting.
Too big to print the whole thing, but here’s a sample of the output:
>Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Corynebacteriaceae;Corynebacterium;
TCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGGAAAGGCCTCAGCTTTTGTTGGGGTGCTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCTCGTACTTCGGGATAAGCTTGGGAAACTGGGTCTAATACCGGATAGGACCATCATTTAGTGTTGGTGGTGGAAAGTTTTTTCGGTACGAGATGAGCCCGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGTCGACGGGTAGCCGGCCTGAGAGGGTGGACGGCCACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGACAGGGACGAAGCTTTTGTGACGGTACCTGTATAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCGTAGGTGGTTTGTCGCGTCGTCTGTGAAATACCGGGGCTTAACTCCGGAGCTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGGAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTAACTGACGCTGAGGAGCGAAAGCATGGGGAGCGAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGGTGGGCGCTAGGTGTAGGGGTCTTCCACGACTTCTGTGCCGTAGCTAACGCATTAAGCGCCCCGCCTGGGGNGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGCTTGACATATACAGGACGGCTGCAGAGATGTAGTTTCCCTTGTGGTCTGTATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCTTATGTTGCCAGCACGTTATGGTGGGGACTCATGAGAGACTGCCGGGGTTAACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCCAGGGCTTCACACATGCTACAATGGTCGGTACAACGCGTTGCCAGCCCGTGAGGGTGAGCGAATCGCTGAAAGCCGGCCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACGTCATGAAAGTTGGTAACACCCGAAGCCAGTGGCCTAACCTTTTTTGGG
--
>Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Corynebacteriaceae;Corynebacterium;
TGCTTAACACATGCAAGTCGAACGGAAAGGCCTTGTGCTTGCACAAGGTACTCGAGTGGCGAACGGGTGAGTAACACGTGGGTGATCTGCCCTGCACTGTGGGATAAGCCTGGGAAACTGGGTCTAATACCATATAGGACCGCATCTTGGATGGTGTGGTGGAAAGCTTTTGCGGTGTGGGATGAGCCTGCGGCCTATCAGCTTGTTGGTGGGGTAATGGCCTACCAAGGCGGCGACGGGTATCCGGCCTGAGAGGGTGTACGGACACATTGGGACTGAGATACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCGACGCCGCGTGGGGGATGAAGGCCTTCGGGTTGTAAACTCCTTTCGCTATCGACGAAGCCTTCGGGTGACGGTAGGTAGATAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGCGTAAAGAGCTCCTAGGTGGTTTGTCGCGTCGTCTGTGAAATCCCGGGGCTTAACTTCGGGCGTGCAGGCGATACGGGCATAACTTGAGTGCTGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGATGGCGAAGGCAGGTCTCTGGGCAGTTACTGACGCTGAGGAGCGAAAGCATGGGTAGCGAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGGTGGGCGCTAGGTGTAGGGGGCTTCCACGTCTTCTGTGCCGTAGCTAACGCATTAAGCGCCCCGCCTGGGGAGTACGGCCGCAAGGCTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGCGGAGCATGTGGATTAATTCGATGCAACGCGAAGAACCTTACCTGGGCTTGACATATACAGGATCGGGCTAGAGATAGTCTTTCCCTTGTGGTCTGTATACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCTTATGTTGCCAGCACGTTATGGTGGGAACTCATGAGAGACTGCCGGGGTTAACTCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGTCCAGGGCTTCACACATGCTACAATGGTCGATACAGTGGGCAGCGACATCGTAAGGTGGAGCGAATCCCTGAAAGTCGGCCTTAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAGTCGCTAGTAATCGCAGATCAGCAACGCTGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACGTCATGAAAGTTGGTAACACCCGAAGCCAGTGGCCTAAACTTGTTAGGGAGCTGTCGAAGGTGGGATCGGCGATTGGGACGAAGTCGTAACAAGGTAGCC
……