Download from
http://www.nyangau.org/ere/ere.zip
.
This document describes what ERE features are supported, and with what syntax, by Andys Extended Regular Expression Handler. This code module is used by programs such as Andys Source Code Folding Editor and Andys Binary Folding Editor.
The source to this module is available for you to use in your own programs, should you wish to do so.
The level of functionality is most closely modelled on EREs as supported
by the UNIX egrep
and awk
programs.
Some enhancements introduced by POSIX, those typically found in UNIX
ex
and vi
, and most found in GNU software are
supported too.
Here follows an explanation of supported features, although the hard core user might prefer to go straight to the BNF definition :-
The empty or null ERE is a valid ERE. It matches nothing, successfully.
Two EREs, when written one after another form a new ERE which matches whatever the first did, followed by whatever the second did. Obvious really.
Most characters in an ERE can specified just by typing them.
abc is the ERE which matches a then b then c
Certain characters have special meanings, and may need to be 'escaped' by preceeding them with a backslash.
1\*2 is the ERE which matches 1 then * then 2
Backslash can be used to describe characters which are difficult to type :-
\n is a way of saying newline ASCII 10 \t is a way of saying tab ASCII 9 \r is a way of saying carriage return ASCII 13 \b is a way of saying backspace ASCII 8 \f is a way of saying formfeed \e is a way of saying escape ASCII 27 \x07 is a way of saying bel ASCII 7
Note: Support for this notation is an extension over the normal ERE support provided by most programs.
In most applications that use EREs, the searching is done line by line.
So the line itself won't actually include any newline or carriage return
characters, thus rendering \n
and \r
useless.
Programs could in theory offer multi-line searching.
How do you get a real backslash in an ERE - easy, escape it with a backslash :-
c:\\config.sys matches that most famous of DOS files
This way of specifying characters starting with a backslash is pretty
much the C and C++ way of doing things.
Strictly speaking, according to POSIX, if 0x2e
was put in an ERE
it should mean 'any character' (ASCII character 0x2e is '.', which means any
character, see later), but this code treats it as match a full stop.
Sometimes the backslash is used to introduce a special ERE feature.
For example \<
means 'start of word' (see later).
The character after the backslash is generally chosen to something that
you wouldn't need to 'escape', so this syntax shouldn't cause any
problems.
It does mean you shouldn't generally escape a character with a backslash
if you don't really need to.
Beware EBCDIC systems: The \xHH
notation will define the
character with EBCDIC numerical representation HH hex.
The name 'bracket element' is something I've invented to describe those things which identify characters or sets of characters within bracket expressions.
A bracket element can be just a basic character.
It can be a range of characters, which is written as follows :-
a-z is the set of characters in the range 'a' to 'z' a-j-t is the set of characters in the range 'a' to 't'
It can be the set of character which pass a given test. ie: it can be a POSIX character class :-
[:alnum:] means all alphanumeric characters [:lower:] means all lowercase characters ... etc.
You might think that [:alnum:]
is the same as
A-Za-z0-9
, and it is - providing you are using the ASCII
character set.
POSIX introduces these named specifications in an attempt to cover
mulitple character sets (such as EBCDIC).
POSIX collating symbols and POSIX equivelence classes are not supported. The code recognises when these are used and generates a suitable error message.
When using character ranges on an EBCDIC system, an error will result if
you try to define ranges not entirely contained within the A-Z
,
a-z
or 0-9
intervals.
Characters in these intervals are considered to follow on from each other
(eg: 'A' then 'B' etc.), despite there being discontiguities in their EBCDIC
numerical values.
A bracket expression contains a set of bracket elements, and these define the set of acceptable characters, or possibly the set of non-acceptable characters.
[abc] matches any one character which can be a, b or c [^abc] matches any one character which isn't a, b or c
How do you get a -
, a ^
or a ]
in a
bracket element or bracket expression?
Easy, just escape it with a backslash.
Note: Most ERE handlers allow you to make the first character
in the bracket expression a -
or ]
, and not to
treat them as 'range' or 'end of bracket expression'.
This code however, requires you to escape these characters with a
blackslash.
[\])}] matches ] or ) or }
The GNU people defined the term 'word constituent character' to
include any alphanumeric character, or the underscore _
character.
\w
is a shorthand for any word consitiuent character, and
\W
for any non word constituent character.
\d
is a shorthand for any digit, and \D
for any
non-digit.
\s
is a shorthand for any white-space, and \S
for
any non-white-space.
\w is thus equivelent to [[:alnum:]_] or [A-Za-z0-9_] \W is thus equivelent to [^[:alnum:]_] or [^A-Za-z0-9_] \d is thus equivelent to [[:digit:]_] or [0-9] \D is thus equivelent to [^[:digit:]_] or [^0-9] \s is thus equivelent to [[:space:]_] or [ \t] \S is thus equivelent to [^[:space:]_] or [^ \t]
Just saves some typing I suppose.
These shorthands aren't valid within bracket expressions.
Sometimes you want to be able to match any character, without caring
what it actually is.
The dot .
character means exactly that :-
a.b matches a, followed by any character, followed by b
Letters and dots used together make an excellent crossword cheat tool.
A Microsoft extension to EREs is to allow you to say
'any character except this character' by preceeding it with a tilde
~
.
a~bc is the ERE which matches a, then not b, then c this might match aXc, aYc, but not abc
The only merit in this ERE extension is that it is shorter to type than the equivelent bracket expression :-
~a is shorter to type than [^a]
The following 'anchors' are supported. If the specific condition is satisfied they 'match' the zero length string where they are encountered. Put another way, they match nothing and don't consume any characters in the string being searched.
^ matches if we are at the start of the string $ matches if we are at the end of the string \` GNU alternative way of writing ^ \' GNU alternative way of writing $ \< matches if we are at the start of a word (or the whole string) \> matches if we are at the end of a word (or the whole string) \B matches if we are within a word \y matches if we are at the start or end of a word (or whole string)
Here 'word' is according to the GNU definition as given above.
These anchors allow you write useful EREs like :-
^xyz xyz, only if its at the start of the line \<fred which matches fred, freddy but not alfred
You can also write silly EREs like :-
x$y which can't ever match because the end of the string can't be followed by anything!
EREs may match repetitions of things. For example :-
ab?c matches a, then zero or one occurrance of b, then c ab+c matches a, then one or more occurrances of b, then c ab*c matches a, then zero or more occurrances of b, then c ab{M}c matches a, then M occurrances of b, then c ab{M,}c matches a, then M or more occurrances of b, then c ab{M,N}c matches a, then between M and N occurrances of b, then c
In the above, M and N are numbers given in decimal, where if both are given, M must be <= N. If too large numbers are used, the resulting ERE may be too complex for the ERE handler to process.
Some useful examples of repetitions :-
[A-Za-z_][A-Za-z0-9_]* matches any legal C or C++ identifier \w{10,} at least 10 word constituent characters \<[0-9]{5}\> a 5 digit number
It doesn't make sense to repeat the null ERE, you can match nothing as many times as you like!
You can search for one thing or another using the |
symbol :-
fred|bill matches fred or bill fred|bill|rob matches any one of these 3 names
Remember, to get a real pipe-bar |
symbol, you'll need to
escape it with a backslash.
Brackets may be used to group EREs into sub-EREs, so that operators such as the repetitions or alternation operator may be applied to them.
(frog|toad)+ matches one or more occurance frog or toad
Nesting may be performed several levels deep.
Note: Normal non-Extended Regular Expressions,
as used by UNIX grep
and sed
,
typically require you to identify sub-REs using backslash escaped brackets.
EREs however, use the brackets on their own to delimit sub-EREs.
Backslash escaped brackets match bracket characters in the string.
Every time a nested ERE is matched, what it matched is recorded. You can do a 'backreference' later in the main ERE which essentially says that whatever was matched before must be matched again.
Upto the first 9 nested EREs may be backreferenced.
(a|b)(c|d)\2\1 will match acca, or adda or bccb or bddb
The \2
says the second nested ERE must be matched again.
Backreferences can be done within nested EREs to backreference nested nested EREs.
frog|((a|bc)d\1) matches frog or ada or bcdbc
Note that in the above example, adbc
doesn't match.
Backreferences are literal.
It is a repeat of what was matched, not the nested EREs which matched it.
Programs such as grep
and vi
use
\(
and \)
to delimit backreference-able parts of
a regular expression.
In EREs, this notation denotes real brackets, as (
and
)
is already taken to mean nested ERE.
\N
notation in the replacement string).
eg:
find ([0-9]+),"(~"*)" replace-with set \2 = \1 given-input 23,"Age" gives-output set Age = 23
Consider using the ERE (.+)(.+)
against the string
abcd
.
Clearly there will be a match of length 4 characters, as an ERE matches
the longest string it can.
What isn't clear is what \1
and \2
are in the
substitution, afterwards.
eg: We could have any one of :-
\1 = a \2 = bcd \1 = ab \2 = cd \1 = abc \2 = d
Which you get is undefined in this implementation of EREs.
Upto the first 9 nested EREs may be referenced.
BNF definition
Some basics just before we start :-
<ch> ::= any typable character <digit> ::= any digit from '0' to '9' <digits> ::= <digit> { <digit> } <digit19> ::= any digit from '1' to '9' <hex> ::= any digit or letter from 'a' to 'f' or 'A' to 'F'
Basic character definitions :-
<c> ::= '\n' newline | '\t' tab | '\r' carriage return | '\b' backspace | '\f' formfeed | '\e' escape | '\x' <hex> ( <hex> ) specify character in hex, one or two hex digits | '\' <ch> other character, escaping any special meaning | <ch> normal character
Bracket element (something which goes in a bracket expression) :-
<be> ::= <c> a character | <c> '-' <c> a range of characters | '[:alnum:]' POSIX alphanumeric characters | '[:alpha:]' POSIX alphabetic characters | '[:blank:]' POSIX space and tab characters | '[:cntrl:]' POSIX control characters | '[:digit:]' POSIX numeric characters | '[:graph:]' POSIX printable and visible (non-space) chars | '[:lower:]' POSIX lowercase characters | '[:print:]' POSIX alphanumeric characters | '[:punct:]' POSIX punctuation characters | '[:space:]' POSIX whitespace characters | '[:upper:]' POSIX uppercase characters | '[:xdigit:]' POSIX hexadecimal digits | '[.' ??? '.]' POSIX collating symbols and | '[=' ??? '=]' POSIX equivelence classes ARE NOT SUPPORTED
Bracket expression :-
<bx> ::= [ '^' ] { <be> } defines a set of acceptable characters or a set of non-acceptable (if '^' present)
Extended Regular Expression :-
<re> ::= empty regular expression | <c> character | '~' <c> not specified character shorthand for '[^' <c> ']' | '\w' matches any 'word consituent' character shorthand for '[[:alnum:]_]' | '\W' matches any non 'word consituent' character shorthand for '[^[:alnum:]_]' | '.' matches any character (but not end of line) | '[' <bx> ']' matches characters in the bracket expression | '^' matches empty string at the start of the 'line' | '$' matches empty string at the end of the 'line' | '\`' synonym for '^' | '\'' synonym for '$' | '\<' matches empty string at the start of a 'word' | '\>' matches empty string at the end of a 'word' | '\B' matches empty string within 'word' | '\y' matches empty string at start or end of 'word' shorthand for '(\<|\>)' note: not '\b', as this clashes with backspace | <re> <re> 2 <re>'s concatenated form a <re> | '(' <re> ')' nested regular expression | '\' <digit19> backreference to nested regular expression | <re> '?' zero or one occurrance of <re> | <re> '+' one or more occurrances of <re> | <re> '*' zero or more occurrances of <re> | <re> '{' <digits> '}' matches M occurances of <re> | <re> '{' <digits> ',}' matches at least M occurances of <re> | <re> '{' <digits> ',' <digits> '}' matches between M and N occurances of <re> | <re> '|' <re> matches one <re> or the other
The original version of this code was written in 1990 and it supported a subset of the features listed in this document. Despite its complexity, no bugs were ever found in this initial version.
Since then, the POSIX guys starting drawing up specifications, various tools using EREs added some handy extensions, and the GNU guys added quite a few useful extras.
In June 1998, the ERE Handler was significantly improved, introducing support for :-
\t
).
[:digit:]
).
\w
and \W
).
~x
).
\`
and \'
meaning
^
and $
).
ex
and vi
and GNU style start and end of
a word anchors (as in \<
and \>
).
\B
and \y
).
re{M,N}
syntax).
regex
etc.).
Support for \d
, \D
, \s
and
\S
was added in August 2002.
A bug with more than 9 backreferences was fixed in November 2004.
If you use a program using the ERE Handler, which pre-dates this version, it is worth re-acquiring a later version to pick up these features.
This ERE module, including its source code, are public domain. Caveat Emptor.