Regular expression are a powerful tool to search text files and extract and edit elements in those files. They can make quick work of formatting problems and highly repetitive tasks. If you find yourself repetitively clicking, cutting, pasting, deleting, etc. to clean up a problematic file, you might be able to solve your problem quickly with a regular expression.
First we will work with regular expressions in our text editor. I will be using Notepad++
\w
a single word character [letter,number or
_
]\d
a single number character [0-9]\t
a single tab space\s
a single space, tab, or line break\n
a single line break (or try \r
)WIth wildcards we can quickly swap spaces for line breaks, which can be useful when tidying list.
INPUT: larry curly moe
SEARCH: \s
REPLACE: \n
RESULT:
larry
curly
moe
add these on to any of the wildcards
\w+
one or more consecutive word characters
\w*
zero or more consecutive word
characters
\w{3}
exactly 3 consecutive word characters
\w{3,}
3 or more consecutive word
characters
\w{3,5}
3, 4, or 5 consecutive word
characters
INPUT: Phenotypic plasticity, the ability of a single genotype
to express different phenotypes depending on envi-
ronmental conditions, is a key determinant of organ-
ismal performance and greatly influences ecological
interactions. Its role in evolution is also increasingly
acknowledged (Bradshaw, 1965; Schlichting & Pig-
liucci, 1998; Agrawal, 2001; Pigliucci, 2001; West-
Eberhard, 2003; DeWitt & Scheiner, 2004; Van Snick
Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfen-
nig et al., 2010).
SEARCH: \n #find the line breaks. In my system I had to remove with both \n and \r
REPLACE: (nothing)
RESULT: Phenotypic plasticity, the ability of a single genotype to express different phenotypes depending on envi- ronmental conditions, is a key determinant of organ- ismal performance and greatly influences ecological interactions. Its role in evolution is also increasingly acknowledged (Bradshaw, 1965; Schlichting & Pig- liucci, 1998; Agrawal, 2001; Pigliucci, 2001; West- Eberhard, 2003; DeWitt & Scheiner, 2004; Van Snick Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfen- nig et al., 2010).
This still needs work. The word spcaing is variable and there are “-” in the text.
INPUT: from above
SEARCH: -\s+ #this will find the "-" and the spaces that follow
REPLACE: (nothing) # this will delete
RESULT: Phenotypic plasticity, the ability of a single genotype to express different phenotypes depending on environmental conditions, is a key determinant of organismal performance and greatly influences ecological interactions. Its role in evolution is also increasingly acknowledged (Bradshaw, 1965; Schlichting & Pigliucci, 1998; Agrawal, 2001; Pigliucci, 2001; WestEberhard, 2003; DeWitt & Scheiner, 2004; Van Snick Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfennig et al., 2010).
Now we just need to fix the line spacing
INPUT: from above
SEARCH: \s+
REPLACE: " " #input a single space
RESULT: Phenotypic plasticity, the ability of a single genotype to express different phenotypes depending on environmental conditions, is a key determinant of organismal performance and greatly influences ecological interactions. Its role in evolution is also increasingly acknowledged (Bradshaw, 1965; Schlichting & Pigliucci, 1998; Agrawal, 2001; Pigliucci, 2001; WestEberhard, 2003; DeWitt & Scheiner, 2004; Van Snick Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfennig et al., 2010).
*
quantifierINPUT: vanilla , chocolate, strawberry , rocky road, mint chocolate chip
SEARCH: \s*,\s*
REPLACE: , #note the replacement is "," and a space " "
RESULT: vanilla, chocolate, strawberry, rocky road, mint chocolate chip
.*
for “all the rest”INPUT: Phenotypic plasticity, the ability of a single genotype to express different phenotypes depending on environmental conditions, is a key determinant of organismal performance and greatly influences ecological interactions. Its role in evolution is also increasingly acknowledged (Bradshaw, 1965; Schlichting & Pigliucci, 1998; Agrawal, 2001; Pigliucci, 2001; WestEberhard, 2003; DeWitt & Scheiner, 2004; Van Snick Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfennig et al., 2010).
SEARCH: \s+\(.*\) #note that "(" and ")" are regex operators so we need to escape and use the literal character by adding the "\" in front of them
Replace: (nothing)
RESULT: Phenotypic plasticity, the ability of a single genotype to express different phenotypes depending on environmental conditions, is a key determinant of organismal performance and greatly influences ecological interactions. Its role in evolution is also increasingly acknowledged.
Working with genus and species names
INPUT:
Haemorhous cassinii
Haemorhous purpureus
Haemorhous mexicanus
SEARCH: (\w)\w+ (\w+) #capture first letter, skip rest of word and space, capture all of next word
REPLACE: \1_\2 # paste capture element 1 add an "_" and paste capture element 2.
RESULT:
H_cassinii
H_purpureus
H_mexicanus
Regular expressions can also be implemented in the UNIX shell. This is can be particularly useful when handling very large files like those generated in genomics analyses.
For some basic operation in the shell see - Navigating the UNIX shell
In the shell there are several different basic tools that use regular
expressions including:
- grep
for searching text
- sed
to transform text - find and replace functions
- awk
functions similar to grep and sed, but a different
approach to implementation
Note that the way that regular expressions are coded in each of these is different than the preceding examples.
For this example I downloaded the complete transcript sequences of the Mallard Duck Anas platyrhynchus.
Open the shell in the terminal of Rstudio. If you are using a windows
machine, be sure to set the options foe the terminal to
Git Bash
. Navigate to the folder containing the FASTA.
Now we will copy every line starting this “>” to a new file called Duck_headers.txt
#The file is compressed, so first we will unzip it:
gunzip GCF_015476345.1_ZJU1.0_rna.fna.gz
#Then we can use grep to pull all of the headers
grep '^>' GCF_015476345.1_ZJU1.0_rna.fna > Duck_headers.txt
head Duck_header.txt
RESULT:
>NM_001285955.2 Anas platyrhynchos acireductone dioxygenase 1 (ADI1), mRNA
>NM_001285957.1 Anas platyrhynchos ribosomal protein S7 (RPS7), mRNA
>NM_001289841.1 Anas platyrhynchos ornithine decarboxylase antizyme 1 (OAZ1), mRNA
>NM_001291578.1 Anas platyrhynchos antizyme inhibitor 1 (AZIN1), mRNA
>NM_001310341.1 Anas platyrhynchos TNF superfamily member 13b (TNFSF13B), mRNA
>NM_001310342.1 Anas platyrhynchos RAB11A, member RAS oncogene family (RAB11A), mRNA
>NM_001310343.1 Anas platyrhynchos fatty acid binding protein 2 (FABP2), mRNA
>NM_001310344.1 Anas platyrhynchos free fatty acid receptor 3 (LOC101802760), mRNA
>NM_001310345.1 Anas platyrhynchos growth hormone 1 (GH1), mRNA
>NM_001310346.1 Anas platyrhynchos ras homolog family member A (RHOA), mRNA
sed
to clean up the headers#Here we can capture specific elements of the pattern with
\(
and \)
sed 's/\(\>\w*\..\).*.(\(...*\)).*/\1 \2/' Duck_headers.txt > num_symbol.txt
head num_symbol.txt
>NM_001285955.2 ADI1
>NM_001285957.1 RPS7
>NM_001289841.1 OAZ1
>NM_001291578.1 AZIN1
>NM_001310341.1 TNFSF13B
>NM_001310342.1 RAB11A
>NM_001310343.1 FABP2
>NM_001310344.1 LOC101802760
>NM_001310345.1 GH1
>NM_001310346.1 RHOA\
sed
to find a gene by nameWhat if we want the full sequence of a specific gene in this file. We
can use sed
to capture a range of lines from the FASTA
#Here I modified the FASTA to add a space between transcripts. Then I selected the line between the first line containing "BCO2" and nest line beginning with a space. I used a pipe "|" to pass the output from the first to the second command.
sed 's/>/ \n>/g' GCF_015476345.1_ZJU1.0_rna.fna | sed -n '/BCO2/,/^ /p' > BCO2.txt
head BCO2.txt