imitatio creatio co we łbie piszczy

15Jul/09Off

PL/Perl regex vs builtin regex_replace in postgres

Maybe you (like me) wondered which one is faster... to use builtin regexp_replace function or to use Perl's regex engine via PL/Perl.

Here you are - this is a test I made (PostgreSQL 8.3.7, Perl 5.8.8, on CentOS 5,3)

First we have a plperl function:

CREATE OR REPLACE FUNCTION test_perl( TEXT ) RETURNS TEXT AS
$code$
my $str = shift;
$str =~ s{[[:space:]]+}{}g;
$str =~ s{^\+48}{};
$str =~ s{[^[:digit:]]+}{#}g;
$str =~ s{#$}{};
$str =~ s{^([[:digit:]]{3})#([[:digit:]]{3})#([[:digit:]]{3})$}{$1$2$3};
return $str if $str;
return undef;
$code$
LANGUAGE plperl IMMUTABLE STRICT;

Then we have it's regexp_replace equivalent (sorry - it's not pretty, but it does exactly the same thing, five replace operations ):


CREATE OR REPLACE FUNCTION test_rr( TEXT ) RETURNS TEXT AS
$code$
SELECT nullif( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace(
$1, '[[:space:]]+', '', 'g' ), '^\+48', '' ),
'[^[:digit:]]+', '#', 'g' ), '#$', '' ),
'^([[:digit:]]{3})#([[:digit:]]{3})#([[:digit:]]{3})$', '\1\2\3' ), '' )
$code$
LANGUAGE SQL IMMUTABLE STRICT;

Test run:

=> select count(*) from testtable where test_rr(telefon) is not null;
Time: 7051.000 ms
Time: 7629.000 ms
Time: 7534.000 ms
=> select count(*) from testtable where test_perl(telefon) is not null;
Time: 4087.000 ms
Time: 3931.000 ms
Time: 4056.000 ms

It appears that Perl regex engine is almost twice faster than the builtin one.

I will follow with more tests, because I have a very intensive usage of regular expressions here... Probably it will be even faster to move proessing out of the database :-(

The use of database as do-everything place is tempting, especially in postgres - but maybe it's not the right tool for the job.


Update: yes, we moved processing out of the database. Flat files processed with regular expressions... Works like a charm :)

5Sep/08Off

Postgres partitioning performance – rules vs triggers

Rafal Pietrak asked a question about postgres performance in partitioning scenarios.

The problem is, in classical partitioning approach you decide into which partition put the data basing only on the inserted data itself.

But we consider also situation when you want to make this decision basing on current database content.

For example we have some "driving" or "routing" table which tells us which partition is currently active.

Please read the above post for more background.

I prepared 4 test cases, for all combinations of rule versus trigger and static versus dynamic aka table-driven partitioning.

Test was performed on PostgreSQL 8.3.3 on Linux, commodity desktop box.

To make things short, here are the results of two test runs (links point to test scripts):
/what is measured: INSERT of 10000 rows/

Partitioning with RULEs, no dynamic routing:
2444.293 ms 2516.314 ms

Partitioning with RULEs, with dynamic routing:
42380.037 ms 39248.666 ms

Partitioning with TRIGGER, no dynamic routing:
14512.787 ms 14669.310 ms (but see the update below)

Partitioning with TRIGGER, with dynamic routing:
13486.808 ms 13904.370 ms

Conclusion:

If you have to do some database lookup to decide which partition data belongs to, use a trigger on master table.
If you have a well defined static set of rules, use PostgreSQL rule system.

Update (2010/06/15)

depesz wrote an article in which he noticed some problems with above test case.

First, he points out that the trigger in "Partitioning with TRIGGER, no dynamic routing" example is horribly written. And he is right. In this particular case, partition assignment logic is so simple it can be expressed in 5 lines instead of 1005 lines of code. This makes trigger-based solution an order of magnitude faster.

Second, he says that bulk inserts are not so common - so measuring them is not-so-relevant. This heavily depends on the flavour of database you are working on. For OLTP systems, depesz's statement is obviously true, but for analytic (OLAP and BI) systems, bulk inserts are more common.

Third, in general I agree that rules should be avoided. Triggers are more readable and understood by most database-literate programmers. Rules ... well the first rule of using PostgreSQL rules is "DO NOT USE RULES".

(end of update)