Working with files and paths in R

Usually, the data you analyze will be sitting in an excel or csv file. You will have to find it and import it into R using code. Also, you will probably need to output things from R like statistical tables, graphs, or cleaned data into new files. All of these tasks require you to be able to navigate your computer storage. Let’s take a look at a few things we can do within R:

This command shows us the “path” to our current working directory. A path is like a set of directions for how to get from the very root of your computer to the current folder you’re working in

getwd()
## [1] "/home/gzahn/Desktop/GIT_REPOSITORIES/gzahn.github.io/data-course/Repository/Code_Examples"

“/” represents the root of my entire filesystem, and each slash represents a new subdirectory.

You can list the files in my current working directory with the following:

list.files()[1:10] # just the first 10 to save space
##  [1] "animation.html"             "animation.Rmd"             
##  [3] "assign_letter_grades.R"     "badplot.jpg"               
##  [5] "better_than_excel.R"        "building_basic_models.R"   
##  [7] "cleaning_bird_data.R"       "custom_images_for_Points.R"
##  [9] "dada2_R_example_script.R"   "DNA_packages.R"

That command can be modified as well:

list.files(pattern = "x") # just filenames that have "x" in them
##  [1] "better_than_excel.R"         "dada2_R_example_script.R"   
##  [3] "exam2_review.R"              "Example_day1"               
##  [5] "Example_Project"             "excel_instructions.txt"     
##  [7] "function_example.R"          "handy_bash_aliases.txt"     
##  [9] "md_example.R"                "MeanAD_example.R"           
## [11] "plot_examples.R"             "plot_examples.Rmd"          
## [13] "plots_examples.html"         "plots_examples.Rmd"         
## [15] "ShortRead_package_example.R" "Vegan_Example"

You can search within any directory on your computer, by telling list.files() which “path” to search in:

list.files(path = "/home/gzahn/Desktop/Bioinformatics/")
## [1] "ENTREZ_QIIME"                        "ENTREZ_QIIME.zip"                   
## [3] "Fungal_Alignments"                   "install_old_R_Packages.R"           
## [5] "RDP_Training_Set_ITS2_Outgroups.zip"
list.files(path = "/home/gzahn/Desktop/Bioinformatics/",
           recursive = TRUE,
           pattern = ".nex")
## [1] "Fungal_Alignments/AF48v6dex3.nex"       
## [2] "Fungal_Alignments/combined214_nuc.nex"  
## [3] "Fungal_Alignments/nuc_5.8S_199_taxa.nex"
## [4] "Fungal_Alignments/nuc_SSU_211_taxa.nex"

Note how “recursive = TRUE” tells it to descend into subdirectories of a given path. Those 4 files live in the “Fungal_Alignments” subdirectory within that path.

Now a closely related function:

mypath <- "~/Desktop/GIT_REPOSITORIES/Data_Course/Data"
list.dirs(path = mypath, recursive = FALSE)
## [1] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/data-shell" 
## [2] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/Fastq_16S"  
## [3] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights"    
## [4] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/Messy_Take2"

You can save this list of directories in case you want to work with it later:

data_directories <- list.dirs(path = mypath, recursive = FALSE)
data_directories[3]
## [1] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights"
list.files(path = data_directories[3],full.names = TRUE)
## [1] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/2679884.csv"
## [2] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/2679921.csv"

You can ask questions about whether files or directories exist in a given location:

file.exists("/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/2679884.csv")
## [1] TRUE
dir.exists("/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/fights") # misspelled "flights"
## [1] FALSE

You can create and modify and peek inside files as well:

list.files(path = data_directories[3],full.names = TRUE)
## [1] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/2679884.csv"
## [2] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/2679921.csv"
file.create(file.path(data_directories[3],"testfile")) # Says "TRUE" if it worked
## [1] TRUE
list.files(path = data_directories[3],full.names = TRUE)
## [1] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/2679884.csv"
## [2] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/2679921.csv"
## [3] "/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/testfile"
# be careful using file.remove() ... it's permanent!
file.remove("/home/gzahn/Desktop/GIT_REPOSITORIES/Data_Course/Data/flights/testfile") # Says "TRUE" if it worked
## [1] TRUE

Here are some other functions you should play with:

file.rename()
file.append()
file.copy()
file.size()
readLines()

The most important thing is to make sure you know how file paths work!

Do you know what is going on with the next 4 lines of code?

getwd()
## [1] "/home/gzahn/Desktop/GIT_REPOSITORIES/gzahn.github.io/data-course/Repository/Code_Examples"
list.files()[1:10]
##  [1] "animation.html"             "animation.Rmd"             
##  [3] "assign_letter_grades.R"     "badplot.jpg"               
##  [5] "better_than_excel.R"        "building_basic_models.R"   
##  [7] "cleaning_bird_data.R"       "custom_images_for_Points.R"
##  [9] "dada2_R_example_script.R"   "DNA_packages.R"
list.files(path = "..",full.names = TRUE)
## [1] "../Assignments"   "../Code_Examples" "../Data"          "../Exercises"    
## [5] "../Tools"
list.files(path = "../Assignments")
##  [1] "Assignment_1"         "Assignment_10"        "Assignment_2"        
##  [4] "Assignment_3"         "Assignment_4"         "Assignment_5"        
##  [7] "Assignment_6"         "Assignment_7"         "Assignment_8"        
## [10] "Assignment_9"         "Assignment_DNA_Trees"

This is all VERY useful once you start working with hundreds or thousands of data files for a given project. If I want to search my entire computer desktop and all the folders inside of it for fasta DNA sequence files and find the ones that match a pattern in naming:

mypath <- "~/Desktop"
fastas <- list.files(mypath,recursive = TRUE,pattern = "*5.8S*.fasta$",full.names = TRUE) # any file that has "5.8S" in the name and ends with ".fasta"
fastas
## [1] "/home/gzahn/Desktop/UVU/Journal_Reviews/WNAN_2021/ITS1_all_5.8S.5_8S.fasta"
## [2] "/home/gzahn/Desktop/UVU/Teaching/Courses/Mycology/5.8S.5_8S.fasta"

Since R did all the searching and saved the location of those files, I can have it automatically read them in and work with them. For example:

fna <- ShortRead::readFasta(fastas)
ShortRead::sread(fna)
## DNAStringSet object of length 213:
##       width seq
##   [1]   158 AAACTTTCAACAACGGATCTCTTGGTTCTGGCA...TTCCGGGGGGCATGCCTGTTCGAGCGTCATTG
##   [2]   158 AAACTTTCAACAACGGATCTCTTGGTTCTGGCA...TTCCGGGGGGCATGCCTGTTCGAGCGTCATTA
##   [3]   352 AAACTTTCAACAACGGATCTCTTGGCTCTGGCA...CGGATCAGGTAGGGATACCCGCTGAACTTAAG
##   [4]   158 AAACTTTCAACAACGGATCTCTTGGTTCTGGCA...TTCCGGGGGGCATGCCTGTTCGAGCGTCATTA
##   [5]   158 AAACTTTCAACAACGGATCTCTTGGTTCTGGCA...TTCCGGGGGGCATGCCTGTTCGAGCGTCATTA
##   ...   ... ...
## [209]   158 AAACTTTCAACAACGGATCTCTTGGTTCTGGCA...TTCCGGAGGGCATGCCTGTCCGAGCGTCATTA
## [210]   158 AAACTTTCAACAACGGATCTCTTGGTTCTGGCA...TTCCGGAGGGCATGCCTGTTCGAGCGTCATTA
## [211]   158 AAACTTTCAACAACGGATCTCTTGGTTCTGGCA...TTCCGAAGGGCATGCCTGTTCGAGCGTCATTG
## [212]   158 AAACTTTCAACAACGGATCTCTTGGCTCTGGCA...TTCCGAAGGGCATGCCTGTCCGAGCGTCATTA
## [213]   158 CAACTTTCAACAACGGATCTCTTGGCTCTCGCA...TTCCGGAGGGCATGCCTGTTTGAGTGTCATGT

What is the absolute path to your Desktop on your computer?

Can you list all the files there?

Can you navigate around and find files in a different directory?