Monday, August 14, 2017

POSIX Extended Regular Expressions for Oracle

POSIX or Portable Operating System Interface for uniX is a set of standards that defines some of the functionality supported by the UNIX operating system. The POSIX standard has three sets of standards. BRE for Basic, ERE for Extended and SRE for Simple Regular Expressions. Most modern regular expressions are extensions of the ERE, also the Oracle regular expression uses these standards only.

Oracle does not completely support the POSIX ERE standard. The POSIX standard states that it is illegal to back reference a character, which is not a metacharacter. Oracle supports this and simply ignores the backslash. For e.g., the string b is not a metacharacter, when it is placed prefixing with a backslash \b, it is similar to the literal b. This means that all POSIX ERE standardized regular expressions can be used with Oracle, but not all the regular expressions, working with Oracle may be supported by fully POSIX ERE supported system.
Regular Expression Metacharacters in Oracle
Metacharacters are similar to the string literals, but with a special meaning which is used to identify the textual material of the given pattern and to process it using the regular expressions. The below topics defines the different operators in Oracle.
POSIX Metacharacters in Oracle
The below list of metacharacters supports the use of regular expressions passed to the SQL regular expression condition and functions. These metacharacters acknowledge to the POSIX standard.
POSIX Metacharacters List
Metacharacter
Description
*
Matches zero or more occurrences.
?
Matches zero or one occurrence.
+
Matches one or more occurrences.
|
Matches any one of the alternatives. This is similar to the OR operator.
.
Matches any character in the database character set except for Null and the new line character.
\
Any metacharacter followed by the backslash symbol is treated as a string literal to search for it. Using double backslash symbol (\\) treats the symbol backslash (\) as a string literal.
\n
This is the backreference expression where n is an integer between 1 and 9, matching the nth reference enclosed between the parenthesis preceding \n.
^
Matches the character in the beginning of the line in a string by default. In multiline mode, it matches the beginning of any line in the source string.
$
Matches the character at the end of the line in a string by default. In multiline mode, it matches the end of any line in the source string.
(…)
This is the grouping expression which treats the expression within the parenthesis as a group. This can be a character literal or an expression with operators.
[…]
This is the matching expression which specifies a list that matches any of the matches present in the list from the source string.
[^…]
This is the non-matching expression which specifies a list that does not match with any of the matches present in the list from the source string.
[. Element .]
This is the collating element operator in the POSIX standard. This operator lets us consider the multi-character collating element to be a single character. For e.g., the string ch comprises of two characters in English, whereas if the language Traditional Spanish is defined in the locale, it will be considered as a single character.
[[: Class :]]
Matches any character belonging to the specified character class. For e.g., the class [[:alpha:]] matches all the alphabets in the source string. The below table defines all the classes from the POSIX standard.

Class
Description
[[:alnum:]]
Matches all alphanumeric characters.
[[:alpha:]]
Matches all alphabetic characters.
[[:blank:]]
Matches all blank space characters.
[[:cntrl:]]
Matches all non-printing control characters.
[[:digit:]]
Matches all numeric digits.
[[:xdigit:]]
Matches all hexadecimal characters.
[[:punct:]]
Matches all punctuation characters.
[[:upper:]]
Matches all upper case alphabets.
[[:lower:]]
Matches all lower case alphabets.
[[:graph:]]
Matches all [[:punct:]], [[:upper:]], [[:lower:]], and [[:digit:]] characters.
[[:print:]]
Matches all printable characters.
[[:space:]]
Matches all space characters like carriage return, newline, form feed and vertical tab.
[a-z]
Matches all lower case alphabets. This is similar to [[:lower:]]. To match a set of lower case alphabets, specify a start and an end range. For e.g., [a-m] matches any lower case alphabet between the range a and in the source string.
[A-Z]
Matches all upper case alphabets. This is similar to [[:upper:]]. To match a set of upper case alphabets, specify a start and an end range. For e.g., [A-D] matches any uppercase alphabet between the range A and in the source string.
[0-9]
Matches all numeric digits. This is similar to [[:digits:]]. To match a set of digits, specify a start and an end range. For e.g., [0-5] matches the digits between the range 0 and 5 in the source string.
[A-Za-z0-9]
Matches all the alphanumeric characters. This is similar to [[:alnum:]]. The combination can be changed as per the requirement like [A-Z0-9], [a-mA-N], [0-7a-oA-H], etc.
[=Class=]
This is the character equivalence class matching all the characters of the same equivalence class in the current locale. For e.g., the expression [=n=] searches for all the characters in the same class like N and ñ from the source string El Niño in a Spanish locale.
{m}
Matches exactly m times.
{m,}
Matches at least m times.
{m, n}
Matches at least m times, but not more than n times.


Thank you,
Boobal Ganesan

No comments:

Post a Comment