Regular Expressions

Regular expression are a powerful tool to search text files and extract and edit elements in those files. They can make quick work of formatting problems and highly repetitive tasks. If you find yourself repetitively clicking, cutting, pasting, deleting, etc. to clean up a problematic file, you might be able to solve your problem quickly with a regular expression.

First we will work with regular expressions in our text editor. I will be using Notepad++

Wildcards

\w a single word character [letter,number or _]
\d a single number character [0-9]
\t a single tab space
\s a single space, tab, or line break
\n a single line break (or try \r)

WIth wildcards we can quickly swap spaces for line breaks, which can be useful when tidying list.

INPUT: larry curly moe
SEARCH: \s
REPLACE: \n
RESULT: 
larry 
curly
moe

Quantifiers

add these on to any of the wildcards
\w+ one or more consecutive word characters
\w* zero or more consecutive word characters
\w{3} exactly 3 consecutive word characters
\w{3,} 3 or more consecutive word characters
\w{3,5} 3, 4, or 5 consecutive word characters

Repair text cut and pasted from a pdf

INPUT: Phenotypic plasticity, the ability of a single genotype 
to  express  different  phenotypes  depending  on  envi-
ronmental conditions, is a key determinant of organ-
ismal performance and greatly influences ecological 
interactions. Its role in evolution is also increasingly 
acknowledged  (Bradshaw,  1965;  Schlichting  &  Pig-
liucci,  1998;  Agrawal,  2001; Pigliucci,  2001;  West-
Eberhard, 2003; DeWitt & Scheiner, 2004; Van Snick 
Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfen-
nig  et  al.,  2010).

SEARCH: \n #find the line breaks. In my system I had to remove with both \n and \r

REPLACE: (nothing)

RESULT: Phenotypic plasticity, the ability of a single genotype   to  express  different  phenotypes  depending  on  envi-  ronmental conditions, is a key determinant of organ-  ismal performance and greatly influences ecological   interactions. Its role in evolution is also increasingly   acknowledged  (Bradshaw,  1965;  Schlichting  &  Pig-  liucci,  1998;  Agrawal,  2001; Pigliucci,  2001;  West-  Eberhard, 2003; DeWitt & Scheiner, 2004; Van Snick   Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfen-  nig  et  al.,  2010).

This still needs work. The word spcaing is variable and there are “-” in the text.

INPUT: from above

SEARCH: -\s+ #this will find the "-" and the spaces that follow

REPLACE: (nothing) # this will delete 

RESULT: Phenotypic plasticity, the ability of a single genotype  to  express  different  phenotypes  depending  on  environmental conditions, is a key determinant of organismal performance and greatly influences ecological  interactions. Its role in evolution is also increasingly  acknowledged  (Bradshaw,  1965;  Schlichting  &  Pigliucci,  1998;  Agrawal,  2001; Pigliucci,  2001;  WestEberhard, 2003; DeWitt & Scheiner, 2004; Van Snick  Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfennig  et  al.,  2010).

Now we just need to fix the line spacing

INPUT: from above

SEARCH: \s+

REPLACE: " " #input a single space

RESULT: Phenotypic plasticity, the ability of a single genotype to express different phenotypes depending on environmental conditions, is a key determinant of organismal performance and greatly influences ecological interactions. Its role in evolution is also increasingly acknowledged (Bradshaw, 1965; Schlichting & Pigliucci, 1998; Agrawal, 2001; Pigliucci, 2001; WestEberhard, 2003; DeWitt & Scheiner, 2004; Van Snick Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfennig et al., 2010).

The zero or more `*` quantifier

INPUT: vanilla , chocolate, strawberry ,   rocky road,   mint chocolate chip

SEARCH: \s*,\s*

REPLACE: ,  #note the replacement is "," and a space " "

RESULT: vanilla, chocolate, strawberry, rocky road, mint chocolate chip

`.*` for “all the rest”

INPUT: Phenotypic plasticity, the ability of a single genotype  to  express  different  phenotypes  depending  on  environmental conditions, is a key determinant of organismal performance and greatly influences ecological  interactions. Its role in evolution is also increasingly  acknowledged  (Bradshaw,  1965;  Schlichting  &  Pigliucci,  1998;  Agrawal,  2001; Pigliucci,  2001;  WestEberhard, 2003; DeWitt & Scheiner, 2004; Van Snick  Gray & Stauffer, 2004; Fusco & Minelli, 2010; Pfennig  et  al.,  2010).

SEARCH: \s+\(.*\) #note that "(" and ")" are regex operators so we need to escape and use the literal character by adding the "\" in front of them

Replace: (nothing)

RESULT: Phenotypic plasticity, the ability of a single genotype to express different phenotypes depending on environmental conditions, is a key determinant of organismal performance and greatly influences ecological interactions. Its role in evolution is also increasingly acknowledged.

Capturing and repalcing text elements

Working with genus and species names

INPUT: 
Haemorhous cassinii 
Haemorhous purpureus 
Haemorhous mexicanus
SEARCH: (\w)\w+ (\w+) #capture first letter, skip rest of word and space, capture all of next word
REPLACE: \1_\2 # paste capture element 1 add an "_" and paste capture element 2. 
RESULT:
H_cassinii  
H_purpureus 
H_mexicanus

Regular expression and very large file in the shell

Regular expressions can also be implemented in the UNIX shell. This is can be particularly useful when handling very large files like those generated in genomics analyses.

For some basic operation in the shell see - Navigating the UNIX shell

In the shell there are several different basic tools that use regular expressions including:
- grep for searching text
- sed to transform text - find and replace functions
- awk functions similar to grep and sed, but a different approach to implementation

Note that the way that regular expressions are coded in each of these is different than the preceding examples.

Using grep to pull headers from a FASTA

For this example I downloaded the complete transcript sequences of the Mallard Duck Anas platyrhynchus.

Open the shell in the terminal of Rstudio. If you are using a windows machine, be sure to set the options foe the terminal to Git Bash. Navigate to the folder containing the FASTA.

Now we will copy every line starting this “>” to a new file called Duck_headers.txt

#The file is compressed, so first we will unzip it: 

gunzip GCF_015476345.1_ZJU1.0_rna.fna.gz

#Then we can use grep to pull all of the headers

grep '^>' GCF_015476345.1_ZJU1.0_rna.fna > Duck_headers.txt

head Duck_header.txt

RESULT: 

>NM_001285955.2 Anas platyrhynchos acireductone dioxygenase 1 (ADI1), mRNA
>NM_001285957.1 Anas platyrhynchos ribosomal protein S7 (RPS7), mRNA
>NM_001289841.1 Anas platyrhynchos ornithine decarboxylase antizyme 1 (OAZ1), mRNA
>NM_001291578.1 Anas platyrhynchos antizyme inhibitor 1 (AZIN1), mRNA
>NM_001310341.1 Anas platyrhynchos TNF superfamily member 13b (TNFSF13B), mRNA
>NM_001310342.1 Anas platyrhynchos RAB11A, member RAS oncogene family (RAB11A), mRNA
>NM_001310343.1 Anas platyrhynchos fatty acid binding protein 2 (FABP2), mRNA
>NM_001310344.1 Anas platyrhynchos free fatty acid receptor 3 (LOC101802760), mRNA
>NM_001310345.1 Anas platyrhynchos growth hormone 1 (GH1), mRNA
>NM_001310346.1 Anas platyrhynchos ras homolog family member A (RHOA), mRNA

Using `sed` to clean up the headers

#Here we can capture specific elements of the pattern with \( and \)

sed 's/\(\>\w*\..\).*.(\(...*\)).*/\1 \2/' Duck_headers.txt > num_symbol.txt

head num_symbol.txt

>NM_001285955.2 ADI1
>NM_001285957.1 RPS7
>NM_001289841.1 OAZ1
>NM_001291578.1 AZIN1
>NM_001310341.1 TNFSF13B
>NM_001310342.1 RAB11A
>NM_001310343.1 FABP2
>NM_001310344.1 LOC101802760
>NM_001310345.1 GH1
>NM_001310346.1 RHOA\

Using `sed` to find a gene by name

What if we want the full sequence of a specific gene in this file. We can use sed to capture a range of lines from the FASTA

#Here I modified the FASTA to add a space between transcripts. Then I selected the line between the first line containing "BCO2" and nest line beginning with a space. I used a pipe "|" to pass the output from the first to the second command. 

sed 's/>/ \n>/g' GCF_015476345.1_ZJU1.0_rna.fna | sed -n '/BCO2/,/^ /p' > BCO2.txt

head BCO2.txt

Regular Expressions

Matthew Toomey

8/16/2022

Wildcards

Quantifiers

Repair text cut and pasted from a pdf

The zero or more `*` quantifier

`.*` for “all the rest”

Capturing and repalcing text elements

Regular expression and very large file in the shell

Using grep to pull headers from a FASTA

Using `sed` to clean up the headers

Using `sed` to find a gene by name

Regular Expressions

Matthew Toomey

8/16/2022

Wildcards

Quantifiers

Repair text cut and pasted from a pdf

The zero or more * quantifier

.* for “all the rest”

Capturing and repalcing text elements

Regular expression and very large file in the shell

Using grep to pull headers from a FASTA

Using sed to clean up the headers

Using sed to find a gene by name

The zero or more `*` quantifier

`.*` for “all the rest”

Using `sed` to clean up the headers

Using `sed` to find a gene by name