Andys Extended Regular Expressions

Download from http://www.nyangau.org/ere/ere.zip.

This document describes what ERE features are supported, and with what syntax, by Andys Extended Regular Expression Handler. This code module is used by programs such as Andys Source Code Folding Editor and Andys Binary Folding Editor.

The source to this module is available for you to use in your own programs, should you wish to do so.

The level of functionality is most closely modelled on EREs as supported by the UNIX egrep and awk programs. Some enhancements introduced by POSIX, those typically found in UNIX ex and vi, and most found in GNU software are supported too.

Here follows an explanation of supported features, although the hard core user might prefer to go straight to the BNF definition :-

Null ERE

The empty or null ERE is a valid ERE. It matches nothing, successfully.

Concatenation

Two EREs, when written one after another form a new ERE which matches whatever the first did, followed by whatever the second did. Obvious really.

Basic characters

Most characters in an ERE can specified just by typing them.

abc    is the ERE which matches a then b then c

Certain characters have special meanings, and may need to be 'escaped' by preceeding them with a backslash.

1\*2    is the ERE which matches 1 then * then 2

Backslash can be used to describe characters which are difficult to type :-

\n    is a way of saying newline           ASCII 10
\t    is a way of saying tab               ASCII  9
\r    is a way of saying carriage return   ASCII 13
\b    is a way of saying backspace         ASCII  8
\f    is a way of saying formfeed
\e    is a way of saying escape            ASCII 27
\x07  is a way of saying bel               ASCII  7

Note: Support for this notation is an extension over the normal ERE support provided by most programs.

In most applications that use EREs, the searching is done line by line. So the line itself won't actually include any newline or carriage return characters, thus rendering \n and \r useless. Programs could in theory offer multi-line searching.

How do you get a real backslash in an ERE - easy, escape it with a backslash :-

c:\\config.sys   matches that most famous of DOS files

This way of specifying characters starting with a backslash is pretty much the C and C++ way of doing things. Strictly speaking, according to POSIX, if 0x2e was put in an ERE it should mean 'any character' (ASCII character 0x2e is '.', which means any character, see later), but this code treats it as match a full stop.

Sometimes the backslash is used to introduce a special ERE feature. For example \< means 'start of word' (see later). The character after the backslash is generally chosen to something that you wouldn't need to 'escape', so this syntax shouldn't cause any problems. It does mean you shouldn't generally escape a character with a backslash if you don't really need to.

Beware EBCDIC systems: The \xHH notation will define the character with EBCDIC numerical representation HH hex.

Bracket elements

The name 'bracket element' is something I've invented to describe those things which identify characters or sets of characters within bracket expressions.

A bracket element can be just a basic character.

It can be a range of characters, which is written as follows :-

a-z    is the set of characters in the range 'a' to 'z'
a-j-t  is the set of characters in the range 'a' to 't'

It can be the set of character which pass a given test. ie: it can be a POSIX character class :-

[:alnum:]   means all alphanumeric characters
[:lower:]   means all lowercase characters
... etc.

You might think that [:alnum:] is the same as A-Za-z0-9, and it is - providing you are using the ASCII character set. POSIX introduces these named specifications in an attempt to cover mulitple character sets (such as EBCDIC).

POSIX collating symbols and POSIX equivelence classes are not supported. The code recognises when these are used and generates a suitable error message.

When using character ranges on an EBCDIC system, an error will result if you try to define ranges not entirely contained within the A-Z, a-z or 0-9 intervals. Characters in these intervals are considered to follow on from each other (eg: 'A' then 'B' etc.), despite there being discontiguities in their EBCDIC numerical values.

Bracket expressions

A bracket expression contains a set of bracket elements, and these define the set of acceptable characters, or possibly the set of non-acceptable characters.

[abc]   matches any one character which can be a, b or c
[^abc]  matches any one character which isn't a, b or c

How do you get a -, a ^ or a ] in a bracket element or bracket expression? Easy, just escape it with a backslash.

Note: Most ERE handlers allow you to make the first character in the bracket expression a - or ], and not to treat them as 'range' or 'end of bracket expression'. This code however, requires you to escape these characters with a blackslash.

[\])}]  matches ] or ) or }

Shorthands

The GNU people defined the term 'word constituent character' to include any alphanumeric character, or the underscore _ character. \w is a shorthand for any word consitiuent character, and \W for any non word constituent character.

\d is a shorthand for any digit, and \D for any non-digit.

\s is a shorthand for any white-space, and \S for any non-white-space.

\w  is thus equivelent to [[:alnum:]_] or [A-Za-z0-9_]
\W  is thus equivelent to [^[:alnum:]_] or [^A-Za-z0-9_]
\d  is thus equivelent to [[:digit:]_] or [0-9]
\D  is thus equivelent to [^[:digit:]_] or [^0-9]
\s  is thus equivelent to [[:space:]_] or [ \t]
\S  is thus equivelent to [^[:space:]_] or [^ \t]

Just saves some typing I suppose.

These shorthands aren't valid within bracket expressions.

Any character

Sometimes you want to be able to match any character, without caring what it actually is. The dot . character means exactly that :-

a.b   matches a, followed by any character, followed by b

Letters and dots used together make an excellent crossword cheat tool.

Not a character

A Microsoft extension to EREs is to allow you to say 'any character except this character' by preceeding it with a tilde ~.

a~bc  is the ERE which matches a, then not b, then c
      this might match aXc, aYc, but not abc

The only merit in this ERE extension is that it is shorter to type than the equivelent bracket expression :-

~a   is shorter to type than [^a]

Anchors

The following 'anchors' are supported. If the specific condition is satisfied they 'match' the zero length string where they are encountered. Put another way, they match nothing and don't consume any characters in the string being searched.

^       matches if we are at the start of the string
$       matches if we are at the end of the string
\`      GNU alternative way of writing ^
\'      GNU alternative way of writing $
\<      matches if we are at the start of a word (or the whole string)
\>      matches if we are at the end of a word (or the whole string)
\B      matches if we are within a word
\y      matches if we are at the start or end of a word (or whole string)

Here 'word' is according to the GNU definition as given above.

These anchors allow you write useful EREs like :-

^xyz     xyz, only if its at the start of the line
\<fred   which matches fred, freddy but not alfred

You can also write silly EREs like :-

x$y  which can't ever match
     because the end of the string can't be followed by anything!

Repetitions

EREs may match repetitions of things. For example :-

ab?c      matches a, then zero or one occurrance of b, then c
ab+c      matches a, then one or more occurrances of b, then c
ab*c      matches a, then zero or more occurrances of b, then c
ab{M}c    matches a, then M occurrances of b, then c
ab{M,}c   matches a, then M or more occurrances of b, then c
ab{M,N}c  matches a, then between M and N occurrances of b, then c

In the above, M and N are numbers given in decimal, where if both are given, M must be <= N. If too large numbers are used, the resulting ERE may be too complex for the ERE handler to process.

Some useful examples of repetitions :-

[A-Za-z_][A-Za-z0-9_]*  matches any legal C or C++ identifier
\w{10,}                 at least 10 word constituent characters
\<[0-9]{5}\>            a 5 digit number

It doesn't make sense to repeat the null ERE, you can match nothing as many times as you like!

Alternation

You can search for one thing or another using the | symbol :-

fred|bill      matches fred or bill
fred|bill|rob  matches any one of these 3 names

Remember, to get a real pipe-bar | symbol, you'll need to escape it with a backslash.

Nested EREs

Brackets may be used to group EREs into sub-EREs, so that operators such as the repetitions or alternation operator may be applied to them.

(frog|toad)+  matches one or more occurance frog or toad

Nesting may be performed several levels deep.

Note: Normal non-Extended Regular Expressions, as used by UNIX grep and sed, typically require you to identify sub-REs using backslash escaped brackets. EREs however, use the brackets on their own to delimit sub-EREs. Backslash escaped brackets match bracket characters in the string.

Backreferences

Every time a nested ERE is matched, what it matched is recorded. You can do a 'backreference' later in the main ERE which essentially says that whatever was matched before must be matched again.

Upto the first 9 nested EREs may be backreferenced.

(a|b)(c|d)\2\1  will match acca, or adda or bccb or bddb

The \2 says the second nested ERE must be matched again.

Backreferences can be done within nested EREs to backreference nested nested EREs.

frog|((a|bc)d\1)  matches frog or ada or bcdbc

Note that in the above example, adbc doesn't match. Backreferences are literal. It is a repeat of what was matched, not the nested EREs which matched it.

Programs such as grep and vi use $ and $ to delimit backreference-able parts of a regular expression. In EREs, this notation denotes real brackets, as ( and ) is already taken to mean nested ERE.

Substitutions

Sometimes EREs are used to find or match information, and sometimes they are used as the 'find' part of a 'find and replace' operation. Programs providing 'find and replace' are provided with the extent of each match, and also the extents of the top level sub-EREs within them. So a replacement pattern can easily refer to these, (typically by using the backreference \N notation in the replacement string). eg:

find                   ([0-9]+),"(~"*)"
replace-with           set \2 = \1
given-input            23,"Age"         
gives-output           set Age = 23

Consider using the ERE (.+)(.+) against the string abcd. Clearly there will be a match of length 4 characters, as an ERE matches the longest string it can. What isn't clear is what \1 and \2 are in the substitution, afterwards. eg: We could have any one of :-

\1 = a    \2 = bcd
\1 = ab   \2 = cd
\1 = abc  \2 = d

Which you get is undefined in this implementation of EREs.

Upto the first 9 nested EREs may be referenced.

BNF definition

Some basics just before we start :-

<ch>      ::= any typable character
<digit>   ::= any digit from '0' to '9'
<digits>  ::= <digit> { <digit> }
<digit19> ::= any digit from '1' to '9'
<hex>     ::= any digit or letter from 'a' to 'f' or 'A' to 'F'

Basic character definitions :-

<c> ::= '\n'                    newline
      | '\t'                    tab
      | '\r'                    carriage return
      | '\b'                    backspace
      | '\f'                    formfeed
      | '\e'                    escape
      | '\x' <hex> ( <hex> )    specify character in hex, one or two hex digits
      | '\' <ch>                other character, escaping any special meaning
      | <ch>                    normal character

Bracket element (something which goes in a bracket expression) :-

<be> ::= <c>                    a character
       | <c> '-' <c>            a range of characters
       | '[:alnum:]'            POSIX alphanumeric characters
       | '[:alpha:]'            POSIX alphabetic characters
       | '[:blank:]'            POSIX space and tab characters
       | '[:cntrl:]'            POSIX control characters
       | '[:digit:]'            POSIX numeric characters
       | '[:graph:]'            POSIX printable and visible (non-space) chars
       | '[:lower:]'            POSIX lowercase characters
       | '[:print:]'            POSIX alphanumeric characters
       | '[:punct:]'            POSIX punctuation characters
       | '[:space:]'            POSIX whitespace characters
       | '[:upper:]'            POSIX uppercase characters
       | '[:xdigit:]'           POSIX hexadecimal digits

       | '[.' ??? '.]'          POSIX collating symbols and
       | '[=' ??? '=]'          POSIX equivelence classes
                                ARE NOT SUPPORTED

Bracket expression :-

<bx> ::= [ '^' ] { <be> }       defines a set of acceptable characters
                                or a set of non-acceptable (if '^' present)

Extended Regular Expression :-

<re> ::=                        empty regular expression
       | <c>                    character
       | '~' <c>                not specified character
                                shorthand for '[^' <c> ']'
       | '\w'                   matches any 'word consituent' character
                                shorthand for '[[:alnum:]_]'
       | '\W'                   matches any non 'word consituent' character
                                shorthand for '[^[:alnum:]_]'
       | '.'                    matches any character (but not end of line)
       | '[' <bx> ']'           matches characters in the bracket expression
       | '^'                    matches empty string at the start of the 'line'
       | '$'                    matches empty string at the end of the 'line'
       | '\`'                   synonym for '^'
       | '\''                   synonym for '$'
       | '\<'                   matches empty string at the start of a 'word'
       | '\>'                   matches empty string at the end of a 'word'
       | '\B'                   matches empty string within 'word'
       | '\y'                   matches empty string at start or end of 'word'
                                shorthand for '(\<|\>)'
                                note: not '\b', as this clashes with backspace
       | <re> <re>              2 <re>'s concatenated form a <re>
       | '(' <re> ')'           nested regular expression
       | '\' <digit19>          backreference to nested regular expression
       | <re> '?'               zero or one occurrance of <re>
       | <re> '+'               one or more occurrances of <re>
       | <re> '*'               zero or more occurrances of <re>
       | <re> '{' <digits> '}'  matches M occurances of <re>
       | <re> '{' <digits> ',}' matches at least M occurances of <re>
       | <re> '{' <digits> ',' <digits> '}'
                                matches between M and N occurances of <re>
       | <re> '|' <re>          matches one <re> or the other

Versions of ERE Handler

The original version of this code was written in 1990 and it supported a subset of the features listed in this document. Despite its complexity, no bugs were ever found in this initial version.

Since then, the POSIX guys starting drawing up specifications, various tools using EREs added some handy extensions, and the GNU guys added quite a few useful extras.

In June 1998, the ERE Handler was significantly improved, introducing support for :-

Escaped specification of awkward characters (such as \t).
POSIX character classes (such as [:digit:]).
POSIX collating symbols (they are explicitly disallowed).
POSIX equivelence classes (they are explicitly disallowed).
GNU style word constituent and non word constituent characters (ie: \w and \W).
MS style 'not a character' (as in ~x).
Anchors in positions other than the start or end of an ERE.
GNU style aliases for start and end of line (ie: \` and \' meaning ^ and $).
UNIX ex and vi and GNU style start and end of a word anchors (as in \< and \>).
GNU style within word and edge of word anchors (as in \B and \y).
Counted repetitions (via re{M,N} syntax).
Support for backreferences (at multiple levels too).
Relaxed rules as to what constitutes a valid ERE, better error strings.
More complicated EREs (ie: size of ERE, levels of nesting etc.).
Faster searching (although not as fast as GNU regex etc.).

Support for \d, \D, \s and \S was added in August 2002.

A bug with more than 9 backreferences was fixed in November 2004.

If you use a program using the ERE Handler, which pre-dates this version, it is worth re-acquiring a later version to pick up these features.

Copying

This ERE module, including its source code, are public domain. Caveat Emptor.

This documentation is written by the ERE Handler author, Andy Key

andy.z.key@googlemail.com