Thursday, January 24, 2008

using regular expressions in Oracle

I needed to select rows of data from a column where some of the records begin with a date in the form:
YYYYMMDD

The other records in the column had random text strings. For example:
N MESSAGE
- -------
1 20080124: Hello. I have written to you today to compliment you
2 I have a problem with one of the products you sell.

I thought it would be perfect to use regular expressions to pluck out only the records I needed. Luckily, Oracle now supports for regular expressions as of version 10G:
Writing Better SQL Using Regular Expressions

I needed to first create the regular expression to match the date. In simple terms, I want to match the eight-digit string.

First, here is a regular expression that matches any digit, 0-9:
[0-9]

Secondly, we can expand this match to include the entire eight digit date:
[0-9]\{8\}

The escaped braces repeat the initial match for any number eight ("8") times. I will then put a carat (^) in front of the regular expression to match the beginning of the pattern buffer. The pattern buffer in this case is the MESSAGE column in our table:
^[0-9]\{8\}

OK! So we have our regular expression. Let's put it in a SELECT statement. We will use Oracle 10Gs new regexp_like condition to perform a regular expression match:
SELECT n,message
FROM message_table
WHERE regexp_like (message, '^[0-9]\{8\}');

N MESSAGE
- -------
1 20080124: Hello. I have written to you today to compliment you

Sweet! This works.

There is a slight problem. I did notice a substantial performance penalty for using regexp_like. For a very small table (<20,000 rows), it took about a minute to return the results of the search. Eeek. A normal WHERE clause came back in less than a second.

If I have any further information regarding performance tuning using regular expressions in Oracle, I will let you know.

Here are some more complex examples

TAG

4 comments:

Seech said...
This comment has been removed by the author.
Seech said...

FWIW, with SQL Server this would work:

SELECT n,message
FROM message_table
WHERE IsDate(Left(message, 8))

If Oracle has a similar function, it might give better performance than using regexp_like (it would also be more accurate, if there's a possiblity that the records with "random text strings" could have a 'message' starting with 8 "random" digits.)

That said, the Oracle regexp_like function is darn nifty. In SQL Server, you'd have to write a UDF to get that functionality.

Cacasodo said...

Good call. I'll check this out Monday when I'm back in the office.

I assume SQL Svr's UDF is a user-defined field?

Seech said...

User Defined Function.

You could write one that serves as wrapper around VBScript.RegExp. Or in SS 2005 (or later) using the .NET CLR stuff. But I'd expect it to be s-l-o-w, compared to a native SQL function.

Feel free to drop me a line or ask me a question.