I've been reading about finite state machines lately, in order to better understand how lexical transducers works. It's gotten me thinking about string algorithms and the nitty gritty details of regular expressions (regexes). Since I've been doing some Perl hacking lately, I picked up a copy of Mastering Algorithms with Perl and started reading the chapter on string algorithms. It's got some practical advice about optimizing regular expressions:
* consider anchoring matches if applicable
* avoid alternation with the vertical bar
* avoid needless repetition quantifiers
* if you use alternation, combine the zero-width positive lookahead assertion with a character class
* leading or trailing wildcards are unnecessary
Wise words, indeed. But to understand the "why" behind these tips, you've got to understand how regexes are implemented. Basically, regexes are grounded in the concept of a finite state machine.
A finite state machine (FSM) has a finite number of states, including a start and end state, with transitions from one to another. An FSM performs matches by walking through a string one character at a time and changing the state of the finite machine as it goes alone. If you look for a pattern with an anchor (say, at the beginning of a line), the finite state machine will find that it cannot transitions through its states immediately as it processes each line instead of having to go through nearly all of the characters. If the lines in a text are short, it's not as much of a problem, but if they're long, performance will suffer with the unanchored regex.
But here's where things get interesting. Most regex engines go beyond strict finite machines in their implementation, as the authors of the Perl algorithms book observe:
"Perl's regular exprssions aren't, strictly speaking, regular. They're "superregular"—they include tricks that can't be implemented with the theoretical basis of regular expressions, a deterministic finite automaton [...] One of these tricks is backreferences: \1, \2. Strict regular expressions would not know how to refer back to what has been matched they have no memory of what they have seen."
Okay, but why are these backreferences necessary? One linguistic example springs to mind: reduplication ("a morphological process by which the root or stem of a word, or only part of it, is repeated"). You can try searching for example of reduplication in the Mayan language Tzeltal (also spelled 'Tseltal') using Perl regexes. With the permission of the author, Brian Stross, I put up a regular expression search interface to two text collections: Demons and Monsters and Love in the Armpit.
Let's say you wanted to search one of the texts for all of the CVC words (words of the pattern consonant-vowel-consonant). You could try this regex (where the pattern occurs between word boundaries, indicated by \b, and vowels and consonants are treated as character classes, with a non-vowel, non-whitespace character class for consonants):
\b([^aeiou\W][aeiou][^aeiou\W])\b
You'll get matches for words like puy 'snail' or sok 'with'. But let's say you wanted to find words that have a repeated syllable (e.g., jimjim 'whirling'). You could try using this regex (which uses the backreference \1 to refer back to whatever match is found for the part of the regex between parentheses):
\b([^aeiou\W][aeiou][^aeiou\W])\1\b
Of course, as an alternative, you could try finding reduplication in texts the old-fashioned way, by hand, toiling for hours. But regexes are a whole lot easier.