Perl Crash Course: Basic Regular Expressions

by Nadia Kozievich
revision: Vinny Alves


Introduction

A regular expression (or regex) is a simple, rather mindless way of matching a series of symbols to a pattern you have in mind. The origins of regular expressions lie in automata theory and formal language theory, both of which are part of theoretical computer science.

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For example, Perl, Ruby and Tcl have a powerful regular expression engine built directly into their syntax. Several utilities provided by Unix distributions including the editor ed and the filter grep where the first to popularize the concept of regular expressions.


What They Are

Regular expressions are a syntax, implemented in Perl and certain other environments, making it not only possible but easy to do some of the following:

#Complex string comparisons
$string =~ m/log2008/; # m before the first slash is the "match" operator, not required if using / / as delimiters

#Complex string selections
$string =~ m/log(date)txt/;
$date = $1;

#Complex string replacements
$string =~ tr/originaltext/newtext/; # tr before first slash is "translate" operator.

Perl's regular expression syntax is actually a derivation of the POSIX implementation, resulting in considerable similarities between the two. Let's start with a simple example of a Perl-based regular expression:

/food/

Notice that the string food is enclosed between two forward slashes. Just as with POSIX regular expressions, you can build a more complex string through the use of quantifiers:

/fo+/

This will match fo followed by one or more characters. Some potential matches include food, fool, and fo4. Here is another example of using a quantifier:

/fo{2,4}/

This matches f followed by two to four occurrences of o. Some potential matches include fool, fooool, and foosball.

Note that you can use just about anything you want as a delimiter. If using / /, then the leading m (like m/ /) is not required. Here are some more examples:

m| |;

m[ ];

m{ };

Doing String Comparisons

We start with string comparisons because they're the easiest, and yet most of what's contained here is applicable in selecting and replacing text.

Quantifiers

If you want to look for strings containing one or more instances of the letter p, strings containing at least two p's, or even strings with the letter p as their beginning or ending character. Here are several examples of these characters:

  • p+ matches any string containing at least one p.
  • p* matches any string containing zero or more p's.
  • p? matches any string containing zero or one p.
  • p{2} matches any string containing a sequence of two p's.
  • p{2,3} matches any string containing a sequence of two or three p's.
  • p{2,} matches any string containing a sequence of at least two p's.
  • p$ matches any string with p at the end of it.
  • ^p matches any string beginning with p.

Now for some examples:

$string =~ m/\s*rem/i;   # true text contains 0 or more spaces followed by rem or REM.
                                  # The trailing i specifies case insensitivity
$string =~ m/^\S{1,8}\.\S{0,3}/;   # check for DOS 8.3 filename.
                                                 # \S means non-space characters. More on that later.

Simple String Comparisons

The most basic string comparison is

$string =~ m/log2008/;

The above returns true if string $string contains substring "log2008", false otherwise. If you want only those strings where the log2008 appears at the very beginning, you could write the following: $string =~ m/^log2008/;

Similarly, the $ operator indicates "end of string". If you wanted to find out if the sought text was the very last text in the string, you could write this:

$string =~ m/log2008$/;

Now, if you want the comparison to be true only if $string contains log2008 and nothing but log2008, simply anchor it like this:

$string =~ m/^log2008$/;

Now what if you want the comparison to be case insensitive? All you do is add the letter i after the ending delimiter:

$string =~ m/^log2008$/i;


Using Simple "Wildcards" and "Repetitions"

Calling these "wildcards" may actually conflict with the theoretical grammar and syntax of Perl, but in fact is the most intuitive way to think of it, and will not lead to any coding mistakes.

.   Match any character (except newline)
\w  Match "word" character (alphanumeric plus "_")
\W  Match non-word character
\s  Match whitespace character (spaces, tabs, form-feeds, etc)
\S  Match non-whitespace character
\d  Match digit character
\D  Match non-digit character
\t  Match tab
\n  Match newline
\r  Match carriage return
\f  Match form-feed
\a  Match alarm (bell, beep, etc)
\e  Match escape
\b  Match word boundary
\B  Match non-boundary
\021  Match octal char ( in this case 21 octal)
\xf0  Match hex char ( in this case f0 hexidecimal)





Using Groups ( ) in Matching

Groups are regular expression characters surrounded by parentheses. They have two major uses:

  • To allow alternative phrases as in /(log2008|log2009|log2007)/i.
  • As a means of retrieving selected text in selection, translation and substitution.

Powerful regular expressions can be made with groups At its simplest, you can match either all lowercase or name case like this:

if ($string =~ m/(L|l)og (day|month).txt/){
        print "Found the log description!\n";
}

Detect all strings containing vowels

if ($string =~ m/(A|E|I|O|U|a|e|i|o|u)/){
        print "Vowels!\n";
}

Detect if the line starts with any of three Brazilian presidents:

if ($string =~ m/^(Lula|Itamar|Sarney)/i){
        print "$string\n"
};

Using Character Classes [ ]

Character classes have three main advantages:

Shorthand notation, as [AEIOUY] instead of (A|E|I|O|U|Y).

Character Ranges, such as [A-Z].

One to one mapping from one class to another, as in tr/[a-z]/[A-Z].

An uparrow (^) immediately following the opening square bracket means "Anything but these characters", and effectively negates the character class. For instance, to match anything that is not a vowel, do this:

if ($string =~ /[^AEIOUYaeiou]/){
        print "This string contains a non-vowel";
}

Contrast to this:

if ($string !~ /[AEIOUYaeiou]/){
        print "This string contains no vowels at all";
}

Print all people whose name begins with A through E

if ($string =~ m/^[A-E]/){
        print "$string\n";
}


Matching: Putting it All Together

Print everyone whose last name is Lula, Itamar or Sarney. Each element of list is first name, blank, last name, and possibly more blanks and more info after the last name.

if ($string =~ m/^\S+\s+(Lula|Itamar|Sarney)/i){
        print "$string\n"
};

Print every line with a valid phone number.

if ($string =~ m/[\)\s\-]\d{3}-\d{4}[\s\.\,\?]/){
          print "Phone line: $string\n";
}

Symbol Explanations

char	meaning
=~	find a string that matches
!~	find a string that doesn't match
^	beginning of string
$	end of string
.	any character except newline
*	match 0 or more times
+	match 1 or more times
?	match 0 or 1 times; or: shortest match
|	alternative
( )	grouping; “storing”
[ ]	set of characters
{ }	repetition modifier
\	escape character


Examples

expression	matches...
abc		abc (that exact character sequence, but anywhere in the string)
^abc		abc at the beginning of the string
abc$		abc at the end of the string
a|b		either of a and b
^abc|abc$	the string abc at the beginning or at the end of the string
ab{2,4}c	an a followed by two, three or four b’s followed by a c
ab{2,}c		an a followed by at least two b’s followed by a c
ab*c		an a followed by any number (zero or more) of b’s followed by a c
ab+c		an a followed by one or more b’s followed by a c
ab?c		an a followed by an optional b followed by a c; that is, either abc or ac
a.c		an a followed by any single character (not newline) followed by a c
a\.c		a.c exactly
[abc]		any one of a, b and c
[Aa]bc		either of Abc and abc
[abc]+		any (nonempty) string of a’s, b’s and c’s (such as a, abba, acbabcacaa)
[^abc]+		any (nonempty) string which does not contain any of a, b and c (such as defg)
\d\d		any two decimal digits, such as 42; same as \d{2}
\w+		a “word”: a nonempty sequence of alphanumeric characters and low lines (underscores), such as foo and 12bar8 and foo_1
100\s*mk	the strings 100 and mk optionally separated by any amount of white space (spaces, tabs, newlines)
abc\b		when followed by a word boundary (e.g. in abc! but not in abcd)
perl\B		when not followed by a word boundary (e.g. in perlert but not in perl stuff)





« Gettin' jiggy wit it | TOC | Control Structures »

3 comments to Perl Crash Course: Basic Regular Expressions

  • Dave Doyle

    Very nice. Although if I can contribute: You may want to make some distinctions regarding the ‘.’ since, without any modifiers to the regular expression (which you do not yet discuss), it will not match “\n”.

  • Dave Doyle

    Whoops. I see you actually did mention the ‘.’ not matching newlines in the examples but the line “. Match any character” doesn’t. That’s the one I meant.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>