There Is No Such Thing As The Regex
By Artyom Bologov
I have a new challenge for myself: solving RosettaCode only using ed
scripts.
Not even sed
, plain old ed
.
This makes me try to run as many versions of this venerable editor.
Or at least read the manuals—it's hard to get BSD/Mac ed while on Guix.
Several things I noticed:
- CLI flags differ dramatically between versions, both in meaning and quantity, with 10 for GNU ed and 1 for Plan 9 ed.
- Command lists are slightly incompatible too—GNU ed
z
command isb
in Plan 9 ed, for example. - And the regex syntax is not really portable, with many essential features missing from some versions.
I'm saying "essential", but who's to decide?
Apparently, GNU, BSD, MacOS, and Plan 9 ed
maintainers chose their regex version consciously.
And, apparently, they thought their version is The Regex.
But there's no such thing as The Regex. There are only heaps of slightly incompatible syntaxes and extensions to the basic idea. And there's no basic idea either, as I'll try to show.
I'll use the chrestomathic example: parsing email addresses. Don't expect the RFC-compliant pattern, rather let's simplify this to essentials. Email address comes in three parts:
- A user name with almost anything in it.
- And an alphanumeric and dashed domain name with periods.
- Separated by an at sign.
This should be enough to showcase the features of most dialects. But (here's a spoiler) it might be too simple for some regex to shine.
Regular Languages and Kleene Star
So they wanted to describe languages in the middle of previous century. The definition from Wikipedia is: "[Regular language] is a formal language that can be defined by a regular expression." Not making things clearer. Luckily, Wikipedia defines it more rigorously below. Here's my summary:
- Ø
- Empty language, nothingness.
- a
- Some symbol.
- {a...}
- A set of symbols.
- X*
- Something with Kleene star, repeated zero or more times.
- ∪ and •
- Union and concatenation, set operations on sub-languages.
Quite different from the popular idea of regex as we know it, right? The only comforting resemblance is Kleene star. Let's try to define a regular expression describing emails in it.
Notice that these ellipses are just me being lazy and not wanting to list out every possible character. Because that's what happens when you mess with math people. No mercy.
"Regular expressions" as they initially were... look quite alien. But they'll fix it in the course of history, right?
Thompson & Basic POSIX/UNIX Regex
Ken Thompson fixed regex and turned them into something we're familliar with.
The transitionary state can be seen in the
Regular Expression Search Algorithm paper.
There's still • operator, even if used internally.
∪ turns into a vertical bar |
.
And Kleene star remains as is.
That's it really.
What followed is so-called Basic Regular Expressions. Notable additions:
- Period
.
as the placeholder for any single char. - Character classes delimited by square brackets.
-
^
for start of line and$
for end of line. - Curly braces for "match N to M times" scenario.
- Parentheses for regex grouping.
- Kleene star is stil there, of course.
One problematic thing is that one has to mark braces and parentheses by backslash. To highlight their special status. What about square brackets? No backslash. Inconsistent and error-prone.
Here's how POSIX Basic Regular Expressions version of email address might look like:
A lot of backslashes, ugh.
It's only intuitive that one might want less backslashes.
One superpower of BRE, though: backreferences!
\n
(backslash with a number) means referencing the previous regex group.
The parenthesized one.
Most of the flawors below lost this ability for some reason.
A shame.
Plan 9 Regex
These are closer to initial Thompson ones, but add some quality of life features:
-
+
for "one or more match". -
?
for "one or zero matches". - No backslashes!
What Plan 9 regex miss is curly braces and spelled-out char classes. Yes, no quantified matches and convenient Unicode-aware matching. So overall Plan 9 regex are a syntactic evolution over BRE, but they miss a lot of semantic necessities. For whatever reason.
Extended POSIX Regex
The synthesis (not necessarily historical, but rather ideological) of BREs and Plan 9: ERE. Extended Regular Expressions. No backslashes, curly braces are back, as are char classes. That's where we get closer to conventional regex as we know them:
GNU & Emacs Regex
GNU regex (as present in GNU ed, which I'm currently using, and in GNU Emacs as the main dissemination source) are extending over POSIX ones. Mainly via backslash sequences. Lots of them. Notable highlights:
- \w
- For any word (alphanumeric and underscore) char.
- \W
- For any NON-word char.
- \b
- For word boundary.
- \B
- For inside-word char boundary.
- \s
- For spaces.
- \S
- For NON-spaces.
You probably got the idea—some letter for char class, and its capital version for inversion or special case. No, wait, there are lots of special cases. I'm not even sure what's the heuristic. But still, lots of special sequences for fun and profit:
Our email pattern gets shorter and shorter, which is a win. (Ignore the fact that I'm excluding dashes from domain names now.) But it's also confined to GNU software, which sucks. Because there is other software. Like...
SQL & Postgres
SQL has these weird LIKE
patterns:
But these are unrelated to regex, right? They look so alien, after all. Unfortunately, there are systems where this alien syntax is valid regex. Like Postgres.
Postgres regex, as they call it, is a frivolous mix of POSIX and SQL LIKE.
Period (.
), our trusted friend, is missing in Postgres syntax, and is replaced by underscore (_
).
A soul-calming .*
sequence is %
instead.
The rest is plain POSIX EREs.
Scheme Regex (SRFI 115)
Scheme (and Lisps in general) are famous for the Domain-Specific Languages. You can build a subset of a language that maps 1-to-1 to the problem domain. Which is a blessing, isn't it? But this also gives rise to really complex and Turing-complete languages one can conjure. Like Scheme Regular Expressions (SRE) described in SRFI 115.
These encode regular expressions as parenthesised lists. The patterns form trees of expressions that are parseable but somewhat verbose. And, I bet, they don't look like the idea of The Regex:
Here's your portion of parentheses disgust for today. No need to thank me.
Perl Regex
I'll skip this section, because I'm unfamiliar with Perl. But what I know is that Perl regex/string processing it Turing-complete. They basically are Extended Extended Extended POSIX regex. Pretty with the basics, yet syntactically horrifying with advanced things (like lookaheads and recursive matching.)
I'm not listing an email example here, because it's going to look the same as GNU one. See? No longer a useful litmus test!
Modern Regex: Python, JS, .NET
Most modern languages and runtimes have their own regex libraries/syntax. Some (JS) even go as far as having a dedicated language syntax for regex literals. Given the ubiquity of regex, they must be similar, transferable, and still powerful, right? Only the latter is true.
See this StackOverflow question for a staggering showcase of the syntaxes modern RE systems boast. I'm linking to it in part because my example with email is no longer useful. Most modern RE systems will have something similar(ly short) to Perl/GNU regex.
Post Modern Regex: Any Pattern-Matching Language?
What about PEG? Parsing Expression Grammar, I mean. The thing specifying parsers. A parser for email might read:
Which looks almost like POSIX ERE, right? Yet it's called PEG. So one has two options here:
- Say that regexes are what looks like ones and embrace PEG.
- Say that regexes are what's called that (including Scheme Regex) and reject PEG regex-yness.
Either way, regex proves to be a complicated concept with no set syntax/definition.
And any other pattern-matching language has a chance to be called regex, really.
Be it scanf
patterns, BNF/yacc parsers, or templating engines.
The Regex, Really?
Given the variety of syntaxes listed in this post, there seem to be no basic properties to regex.
The only thing making appearance everywhere is Kleene star.
The only one.
Other "obvious" and "basic" things like .
and {}
are often missing or re-imagined.
So, is there The Regex?
No.
A longer answer would be:
There's no definitive spec/version/implementation. Some implementations, like Perl and GNU, are more successful/useful/influential. Some syntactic features appear more than the others. But there's no consensus and there are lots of weird outliers.Use whatever you like and don't let anyone tell you it is not regex. (Just remember to mention which syntax/implementation you use when sharing your sed one-liners with people.)