imitatio creatio co we łbie piszczy

15Jul/09Off

PL/Perl regex vs builtin regex_replace in postgres

Maybe you (like me) wondered which one is faster... to use builtin regexp_replace function or to use Perl's regex engine via PL/Perl.

Here you are - this is a test I made (PostgreSQL 8.3.7, Perl 5.8.8, on CentOS 5,3)

First we have a plperl function:

CREATE OR REPLACE FUNCTION test_perl( TEXT ) RETURNS TEXT AS
$code$
my $str = shift;
$str =~ s{[[:space:]]+}{}g;
$str =~ s{^\+48}{};
$str =~ s{[^[:digit:]]+}{#}g;
$str =~ s{#$}{};
$str =~ s{^([[:digit:]]{3})#([[:digit:]]{3})#([[:digit:]]{3})$}{$1$2$3};
return $str if $str;
return undef;
$code$
LANGUAGE plperl IMMUTABLE STRICT;

Then we have it's regexp_replace equivalent (sorry - it's not pretty, but it does exactly the same thing, five replace operations ):


CREATE OR REPLACE FUNCTION test_rr( TEXT ) RETURNS TEXT AS
$code$
SELECT nullif( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace(
$1, '[[:space:]]+', '', 'g' ), '^\+48', '' ),
'[^[:digit:]]+', '#', 'g' ), '#$', '' ),
'^([[:digit:]]{3})#([[:digit:]]{3})#([[:digit:]]{3})$', '\1\2\3' ), '' )
$code$
LANGUAGE SQL IMMUTABLE STRICT;

Test run:

=> select count(*) from testtable where test_rr(telefon) is not null;
Time: 7051.000 ms
Time: 7629.000 ms
Time: 7534.000 ms
=> select count(*) from testtable where test_perl(telefon) is not null;
Time: 4087.000 ms
Time: 3931.000 ms
Time: 4056.000 ms

It appears that Perl regex engine is almost twice faster than the builtin one.

I will follow with more tests, because I have a very intensive usage of regular expressions here... Probably it will be even faster to move proessing out of the database :-(

The use of database as do-everything place is tempting, especially in postgres - but maybe it's not the right tool for the job.


Update: yes, we moved processing out of the database. Flat files processed with regular expressions... Works like a charm :)