imitatio creatio co we łbie piszczy

15Jul/09Off

PL/Perl regex vs builtin regex_replace in postgres

Maybe you (like me) wondered which one is faster... to use builtin regexp_replace function or to use Perl's regex engine via PL/Perl.

Here you are - this is a test I made (PostgreSQL 8.3.7, Perl 5.8.8, on CentOS 5,3)

First we have a plperl function:

CREATE OR REPLACE FUNCTION test_perl( TEXT ) RETURNS TEXT AS
$code$
my $str = shift;
$str =~ s{[[:space:]]+}{}g;
$str =~ s{^\+48}{};
$str =~ s{[^[:digit:]]+}{#}g;
$str =~ s{#$}{};
$str =~ s{^([[:digit:]]{3})#([[:digit:]]{3})#([[:digit:]]{3})$}{$1$2$3};
return $str if $str;
return undef;
$code$
LANGUAGE plperl IMMUTABLE STRICT;

Then we have it's regexp_replace equivalent (sorry - it's not pretty, but it does exactly the same thing, five replace operations ):


CREATE OR REPLACE FUNCTION test_rr( TEXT ) RETURNS TEXT AS
$code$
SELECT nullif( regexp_replace( regexp_replace( regexp_replace( regexp_replace( regexp_replace(
$1, '[[:space:]]+', '', 'g' ), '^\+48', '' ),
'[^[:digit:]]+', '#', 'g' ), '#$', '' ),
'^([[:digit:]]{3})#([[:digit:]]{3})#([[:digit:]]{3})$', '\1\2\3' ), '' )
$code$
LANGUAGE SQL IMMUTABLE STRICT;

Test run:

=> select count(*) from testtable where test_rr(telefon) is not null;
Time: 7051.000 ms
Time: 7629.000 ms
Time: 7534.000 ms
=> select count(*) from testtable where test_perl(telefon) is not null;
Time: 4087.000 ms
Time: 3931.000 ms
Time: 4056.000 ms

It appears that Perl regex engine is almost twice faster than the builtin one.

I will follow with more tests, because I have a very intensive usage of regular expressions here... Probably it will be even faster to move proessing out of the database :-(

The use of database as do-everything place is tempting, especially in postgres - but maybe it's not the right tool for the job.


Update: yes, we moved processing out of the database. Flat files processed with regular expressions... Works like a charm :)

11Jun/08Off

variable length positive lookbehind in perl regex

perlre says it does not support lookbehind matches with arbitrary length.

here is a workaround


#!perl -l -w
use strict;
#
print my $txt = 'Alice has a fish. It is a nice fish.
Bob has a dog. John has a cat.
Line 3 is not important.
Filip has a perl.
---------------------------';
#
=pod WE WANT THIS but this gives Perl error.
while ( $txt =~ m{(?<=$behind_re)($match_re)(?=$ahead_re)}g ) {
print $&;
}
=cut
#
my $behind_re = qr/[A-Z][a-z]+ has a /; # Filip has a
my $match_re = qr/\w+/i; # perl
my $ahead_re = qr/\./; # .
#
while ( $txt =~ m{($behind_re)}g ) { # we look for behind match globally,
if ( $txt =~ m{\G($match_re)(?=$ahead_re)} ) { # we search for the rest anchored at each found position
print $&; # voila
}
}
#

21Mar/08Off

explain analyze totals

Same as depesz and Greg, but using user-defined function:


CREATE FUNCTION ea_totals(text) RETURNS text AS $body$
my $result;
my $com = shift;
$com =~ /^\s*explain\s+/i or die qq{Not an explain query\n};
my $rv = spi_exec_query($com);
$rv->{status} eq 'SPI_OK_UTILITY' or die qq{Not a proper explain?\n};
for (map { $_->{'QUERY PLAN'} } @{$rv->{rows}}) {
my $string = ' ' x 10;
if ( /actual time=\d+\.\d*\.\.(\d+\.\d*) rows=\d+ loops=(\d+)/ ) {
$string = sprintf("%10.1f", $1 * $2);
}
$result .= "$string $_ \n";
}
return $result;
$body$ LANGUAGE plperl;

Usage:

SELECT * FROM ea_totals( $$ explain analyze query here $$);

18Feb/08Off

windows/unix unicode hell

Task: convert legal UTF8 into Windows native unicode (lilttle-endian, UTF-16).

Solution: UTF8ToUnicode.pl

This way you will identical file as if you opened input with Notepad and saved as "Unicode".

Tagged as: , , 2 Comments