7.4.1 Regexp Syntax Summary
This is a quick-reference summary of the regular expression syntax
used in Monotone.
Quoting
\x- where x is non-alphanumeric is a literal x
\Q...\E- treat enclosed characters as literal
Characters
\a- alarm, that is, the BEL character (hex 07)
\cx- “control-x”, where x is any character
\e- escape (hex 1B)
\f- formfeed (hex 0C)
\n- newline (hex 0A)
\r- carriage return (hex 0D)
\t- tab (hex 09)
\ddd- character with octal code ddd, or backreference
\xhh- character with hex code hh
\x{hhh...}- character with hex code hhh...
Character Types
.- any character except newline;
in dotall mode, any character whatsoever
\C- one byte, even in UTF-8 mode (best avoided)
\d- a decimal digit
\D- a character that is not a decimal digit
\h- a horizontal whitespace character
\H- a character that is not a horizontal whitespace character
\p{xx}- a character with the xx property
\P{xx}- a character without the xx property
\R- a newline sequence
\s- a whitespace character
\S- a character that is not a whitespace character
\v- a vertical whitespace character
\V- a character that is not a vertical whitespace character
\w- a “word” character
\W- a “non-word” character
\X- an extended Unicode sequence
‘\d’, ‘\D’, ‘\s’, ‘\S’, ‘\w’, and ‘\W’
recognize only ASCII characters.
General category property codes for ‘\p’ and ‘\P’
C- Other
Cc- Control
Cf- Format
Cn- Unassigned
Co- Private use
Cs- Surrogate
L- Letter
Ll- Lower case letter
Lm- Modifier letter
Lo- Other letter
Lt- Title case letter
Lu- Upper case letter
L&- Ll, Lu, or Lt
M- Mark
Mc- Spacing mark
Me- Enclosing mark
Mn- Non-spacing mark
N- Number
Nd- Decimal number
Nl- Letter number
No- Other number
P- Punctuation
Pc- Connector punctuation
Pd- Dash punctuation
Pe- Close punctuation
Pf- Final punctuation
Pi- Initial punctuation
Po- Other punctuation
Ps- Open punctuation
S- Symbol
Sc- Currency symbol
Sk- Modifier symbol
Sm- Mathematical symbol
So- Other symbol
Z- Separator
Zl- Line separator
Zp- Paragraph separator
Zs- Space separator
Script names for ‘\p’ and ‘\P’
Arabic,
Armenian,
Balinese,
Bengali,
Bopomofo,
Braille,
Buginese,
Buhid,
Canadian_Aboriginal,
Cherokee,
Common,
Coptic,
Cuneiform,
Cypriot,
Cyrillic,
Deseret,
Devanagari,
Ethiopic,
Georgian,
Glagolitic,
Gothic,
Greek,
Gujarati,
Gurmukhi,
Han,
Hangul,
Hanunoo,
Hebrew,
Hiragana,
Inherited,
Kannada,
Katakana,
Kharoshthi,
Khmer,
Lao,
Latin,
Limbu,
Linear_B,
Malayalam,
Mongolian,
Myanmar,
New_Tai_Lue,
Nko,
Ogham,
Old_Italic,
Old_Persian,
Oriya,
Osmanya,
Phags_Pa,
Phoenician,
Runic,
Shavian,
Sinhala,
Syloti_Nagri,
Syriac,
Tagalog,
Tagbanwa,
Tai_Le,
Tamil,
Telugu,
Thaana,
Thai,
Tibetan,
Tifinagh,
Ugaritic,
Yi.
Character Classes
[...]- positive character class
[^...]- negative character class
[x-y]- range (can be used for hex characters)
[[:xxx:]]- positive POSIX named set
[[:^xxx:]]- negative POSIX named set
alnum- alphanumeric
alpha- alphabetic
ascii- 0-127
blank- space or tab
cntrl- control character
digit- decimal digit
graph- printing, excluding space
lower- lower case letter
print- printing, including space
punct- printing, excluding alphanumeric
space- whitespace
upper- upper case letter
word- same as ‘\w’
xdigit- hexadecimal digit
In PCRE, POSIX character set names recognize only ASCII
characters. You can use ‘\Q...\E’ inside a character class.
Quantifiers
?- 0 or 1, greedy
?+- 0 or 1, possessive
??- 0 or 1, lazy
*- 0 or more, greedy
*+- 0 or more, possessive
*?- 0 or more, lazy
+- 1 or more, greedy
++- 1 or more, possessive
+?- 1 or more, lazy
{n}- exactly n
{n,m}- at least n, no more than m, greedy
{n,m}+- at least n, no more than m, possessive
{n,m}?- at least n, no more than m, lazy
{n,}- n or more, greedy
{n,}+- n or more, possessive
{n,}?- n or more, lazy
Anchors and Simple Assertions
\b- word boundary
\B- not a word boundary
^- start of subject
also after internal newline in multiline mode
\A- start of subject
$- end of subject
also before newline at end of subject
also before internal newline in multiline mode
\Z- end of subject
also before newline at end of subject
\z- end of subject
\G- first matching position in subject
Match Point Reset
\K- reset start of match
Alternation
- expr
|expr|expr...
Capturing
(...)- capturing group
(?<name>...)- named capturing group (like Perl)
(?'name'...)- named capturing group (like Perl)
(?P<name>...)- named capturing group (like Python)
(?:...)- non-capturing group
(?|...)- non-capturing group; reset group numbers for
capturing groups in each alternative
Atomic Groups
(?>...)- atomic, non-capturing group
Comment
(?#....)- comment (not nestable)
Option Setting
(?i)- caseless
(?J)- allow duplicate names
(?m)- multiline
(?s)- single line (dotall)
(?U)- default ungreedy (lazy)
(?x)- extended (ignore white space)
(?-...)- unset option(s)
Lookahead and Lookbehind Assertions
(?=...)- positive look ahead
(?!...)- negative look ahead
(?<=...)- positive look behind
(?<!...)- negative look behind
Each top-level branch of a look behind must be of a fixed length.
Backreferences
\n- reference by number (can be ambiguous)
\gn- reference by number
\g{n}- reference by number
\g{-n}- relative reference by number
\k<name>- reference by name (like Perl)
\k'name'- reference by name (like Perl)
\g{name}- reference by name (like Perl)
\k{name}- reference by name (like .NET)
(?P=name)- reference by name (like Python)
Subroutine References (possibly recursive)
(?R)- recurse whole pattern
(?n)- call subpattern by absolute number
(?+n)- call subpattern by relative number
(?-n)- call subpattern by relative number
(?&name)- call subpattern by name (like Perl)
(?P>name)- call subpattern by name (like Python)
Conditional Patterns
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
(?(n)...- absolute reference condition
(?(+n)...- relative reference condition
(?(-n)...- relative reference condition
(?(<name>)...- named reference condition (like Perl)
(?('name')...- named reference condition (like Perl)
(?(name)...- named reference condition (PCRE only)
(?(R)...- overall recursion condition
(?(Rn)...- specific group recursion condition
(?(R&name)...- specific recursion condition
(?(DEFINE)...- define subpattern for reference
(?(assert)...- assertion condition
Backtracking Control
The following act immediately they are reached:
(*ACCEPT)- force successful match
(*FAIL)- force backtrack; synonym ‘(*F)’
The following act only when a subsequent match failure causes a backtrack to
reach them. They all force a match failure, but they differ in what happens
afterwards. Those that advance the start-of-match point do so only if the
pattern is not anchored.
(*COMMIT)- overall failure, no advance of starting point
(*PRUNE)- advance to next starting character
(*SKIP)- advance start to current matching position
(*THEN)- local failure, backtrack to next alternation
Newline Conventions
These are recognized only at the very start of the pattern or after a
‘(*BSR_...)’ option.
(*CR)
(*LF)
(*CRLF)
(*ANYCRLF)
(*ANY)
What ‘\R’ Matches
These are recognized only at the very start of the pattern or after a
‘(*...)’ option that sets the newline convention.
(*BSR_ANYCRLF)
(*BSR_UNICODE)