Regular Expressions and Pattern Matching

Parse files of data and information:fastaembl / genbank formathtml (web-pages)user input to programsSo Why Regex?Check formatFind illegal characters (validation)Search for sequences motifs

Simple Patternsplace regex between pair of forward slashes (/ /).try:#!/usr/bin/perlwhile () {if (/abc/) {print “1 >> $_”;}}Run the script.Type in something that contains abc:abcfoobarType in something that doesn't:fgh cba foobarab c foobarprint statement is returned if abc is matched within the typed input.

Can also match strings from files.Simple Patterns (2)genomes_desc.txt contains a few text lines containing information aboutthree genomes.try:#!/usr/bin/perlopen IN, “

Flexible matchingThere are many characters with special meanings – metacharacters.star (*) matches any number of instances/ab*c/ => 'a' followed by zero or more 'b' followed by 'c'=> abc or abbbbbbbc or acplus (+) matches at least one instance/ab+c/ => 'a' followed by one or more 'b' followed by 'c'=> abc or abbc or abbbbbbbbbbbbbbc NOT acquestion mark (?) matches zero or one instance/ab?c/ => 'a' followed by 0 or 1 'b' followed by 'c'=> abc or ac

More General QuantifiersMatch a character a specific number or range of instances{x} will match x number of instances./ab{3}c/ => abbbc{x,y} will match between x and y instances./a{2,4}bc/ => aabc or aaabc or aaaabc{x,} will match x+ instances./abc{3,}/ => abccc or abccccccccc orabccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

More metacharactersdot (.) refers to any character even tab (\t) and space but not newline (\n)./a.*c/ => 'a' followed by any number of any characters followed by 'c'

EscapingBut I want to use these symbols in my regex!?!to use a * , + , ? or . in the pattern when not a metacharacter, need to'escape' them with a backslash./C\. elegans/ => C. elegans only/C. elegans/ => Ca , Cb , C3 , C> , C. , etc...The 'delimitor' of the regex, forward slash “/”, and the 'escape'character, backslash “\”, are also metacharacters. These need to beescaped if required in regex.Important when trying to match URLs and email addresses./joe\.bloggs\@darwin\.co\.uk//www\.envgen\.nox\.ac\.uk\/biolinux\.html/

Using metacharacters.The file nemaglobins.embl contains 21 embl database files that contain aglobin protein within their sequence.try:#!/usr/bin/perl$count;open IN, “

Grouping PatternsCan group patterns in parentheses “()”.Useful when coupled with quantifiers/elegans+/ => eleganssssssssssssss/(elegans)+/ => eleganselegans...elegans1 2 n/eleg(ans){4}/ => elegansansansans1 2 3 4

AlternativesWant either this pattern or that pattern.Two ways:1.) the vertical bar '|' either the left side matches or the right side matches/(human|mouse|rat)/ => any string with human or mouse or rat.Combine with previous examples:/Fugu( |\t)+rubripes/ matches if Fugu and rubripes areseperated by any mixture of spaces and tabs

2.) character class is a list of characters within '[]'. It will match anysingle character within the class./[wxyz1234\t]/ => any of the nine.a range can be specified with '-'/[w-z1-4\t]/ => as aboveto match a hyphen it must be first in the class/[-a-zA-Z]/ => any letter character and a hyphennegating a character with '^' /[^z]/ => any character except z/[âbc]/ => any character except a or b or c

Other Shortcuts\d => any digit [0-9]\w => any “word” character [A-Za-z0-9_]\s => any white space [\t\n\r\f ]\D => any character except a digit [^\d]\W => any character except a “word” character [^\w]\S => any character except a white space [^\s]Can use any of these in conjunction with quantifiers,/\s*/ => any amount of white space

Using alternatives to find a hydrophobic region...try:open IN, "< nippo_sigpept.fsa" or die;while () {}if (/>/) { #a header line$count++; #keep running total of sequence number}else { #not a headerif (/[VILMFWCA]{8,}/) {$match++;}}print "Hydrophobic region found in $match sequences from$count\n";Could also have used /(V|I|L|M|F|W|C|A){8,}/

Revisited?So far matching against $_Binding OperatorThe binding operator “=~”matches the pattern on right against the string onleft.Usually add the m operator (optional).$sumthing = 'Ascaris suum is a nematode';if ($sumthing=~m/suum.*nematode/) {}print “this organism infects pigs!\n”;

Word BoundaryAnchors (2)\b matches the start or end of a word./\bmus\b/ would match mus but not musculus/la\b/ => Drosophila but not Plasmodium/\btes/ => Comamonas testosteroni but not Pantroglodytes\b ignores newline character.Be careful with full stops they're characters too!

Memory VariablesAble to extract sections of the pattern match and store in a variable.Anything stored in parentheses “()” is written into a special variable.The first instance is $1, the second $2, the fourth $4 and so on.Extract from file:Organism: Homo sapiens...Extract from Perl script:while ($line=) {if ($line=~m/Organism:\s(\w)+\s(\w)+/) {$genus=$1; #stores Homo$species=$2; #stores sapiens}}

SubstitutionsAble to replace a pattern within a string with another string.Use the “s” operators/abc/xyz/ => find abc and replace with xyzBy default only the first instance of a match.Using 'g' modifier (global) will find and replace all instances.$line = 'abccdcbabc';$line =~ s/abc/xyz/g;print $line; #produces xyzcdcbxyz;Run dna2rna.plNow look at dna2rna.pl1 2

dna2rna.pl#!/usr/bin/perlprint "Enter DNA sequence\n";while ($line = ) {chomp $line; #remove trailing \n}if ($line=~m/[ÂGCT]/i){#case insensitive infered by 'i'#modifierprint "your sequence contained an invalid nucleotide:$&\nPlease try again\n";#'$&' is a special variable which stores what the#regular expression matched. Don't worry about it for now.}else {}$line=~s/t/u/g; #replace all lower case 't'$line=~s/T/U/g; #replace all upper case 'T'print "The RNA sequence is:\n$line\n";print “Try again or ctrl C to quit\n”;

EMBL file revisitedusing shortcuts and anchors to help make more robust:if (/AC .*/) { #that's three spacescan be rewritten as;if (/ÂC\s{3}(.*)\n$/){ #more certain to return what you want$accession=$1; #now have info stored to use later.}

Now Its Your Turn :o)nemaglobins.embl contains entries for complete cds of nematode sequences.Foreach entry print the ACcession, OrganiSm name and AGCT content ofthe SeQuence.Output should read:Accession: AC00000 Species: Toxocara canis A: 34 G: 65 C: 24 T: 75 Hints:The lines of interest are AC, OS, and SQ.Three regular expressions - one for each query.Use a series of if and elsif loops to search for regular expressions.Print when matched.Bonus point - remove the semi-colon from the accession id.Shout if need help.

Regular Expressions and Pattern Matching

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?