imitatio creatio co we łbie piszczy

15Jul/09Off

PL/Perl regex vs builtin regex_replace in postgres

Maybe you (like me) wondered which one is faster... to use builtin regexp_replace function or to use Perl's regex engine via PL/Perl.

Here you are - this is a test I made (PostgreSQL 8.3.7, Perl 5.8.8, on CentOS 5,3)

First we have a plperl function:

CREATE OR REPLACE FUNCTION test_perl( TEXT ) RETURNS TEXT AS
$code$
my $str = shift;
$str =~ s{[[:space:]]+}{}g;
$str =~ s{^\+48}{};
$str =~ s{[^[:digit:]]+}{#}g;
$str =~ s{#$}{};
$str =~ s{^([[:digit:]]{3})#([[:digit:]]{3})#([[:digit:]]{3})$}{$1$2$3};
return $str if $str;
return undef;
$code$
LANGUAGE plperl IMMUTABLE STRICT;

Then we have it's regexp_replace equivalent (sorry - it's not pretty, but it does exactly the same thing, five replace operations ):


CREATE OR REPLACE FUNCTION test_rr( TEXT ) RETURNS TEXT AS
$code$
SELECT nullif( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace(
$1, '[[:space:]]+', '', 'g' ), '^\+48', '' ),
'[^[:digit:]]+', '#', 'g' ), '#$', '' ),
'^([[:digit:]]{3})#([[:digit:]]{3})#([[:digit:]]{3})$', '\1\2\3' ), '' )
$code$
LANGUAGE SQL IMMUTABLE STRICT;

Test run:

=> select count(*) from testtable where test_rr(telefon) is not null;
Time: 7051.000 ms
Time: 7629.000 ms
Time: 7534.000 ms
=> select count(*) from testtable where test_perl(telefon) is not null;
Time: 4087.000 ms
Time: 3931.000 ms
Time: 4056.000 ms

It appears that Perl regex engine is almost twice faster than the builtin one.

I will follow with more tests, because I have a very intensive usage of regular expressions here... Probably it will be even faster to move proessing out of the database :-(

The use of database as do-everything place is tempting, especially in postgres - but maybe it's not the right tool for the job.


Update: yes, we moved processing out of the database. Flat files processed with regular expressions... Works like a charm :)

5Sep/08Off

Postgres partitioning performance – rules vs triggers

Rafal Pietrak asked a question about postgres performance in partitioning scenarios.

The problem is, in classical partitioning approach you decide into which partition put the data basing only on the inserted data itself.

But we consider also situation when you want to make this decision basing on current database content.

For example we have some "driving" or "routing" table which tells us which partition is currently active.

Please read the above post for more background.

I prepared 4 test cases, for all combinations of rule versus trigger and static versus dynamic aka table-driven partitioning.

Test was performed on PostgreSQL 8.3.3 on Linux, commodity desktop box.

To make things short, here are the results of two test runs (links point to test scripts):
/what is measured: INSERT of 10000 rows/

Partitioning with RULEs, no dynamic routing:
2444.293 ms 2516.314 ms

Partitioning with RULEs, with dynamic routing:
42380.037 ms 39248.666 ms

Partitioning with TRIGGER, no dynamic routing:
14512.787 ms 14669.310 ms

Partitioning with TRIGGER, with dynamic routing:
13486.808 ms 13904.370 ms

Conclusion:

If you have to do some database lookup to decide which partition data belongs to, use a trigger on master table.
If you have a well defined static set of rules, use PostgreSQL rule system.