imitatio creatio co we łbie piszczy

9Jun/101

Import zbiorów TERYT do bazy danych SQL

Główny Urząd Statystyczny administruje bazą danych podziału administracyjnego Polski, z dokładnością do ulic i nazw miejscowych.

Baza ta nazywa się TERYT i jest udostępniana publicznie w formie plików XML, w serwisie stat.gov.pl.

W tym artykule pokażę jak zaimportować te dane do bazy relacyjnej (na przykładzie PostgreSQL oraz Microsoft SQL Server).

Pobieranie

Najpierw musimy pobrać aktualną wersję plików TERYT. Poniżej polecenia wget, oczywiście można też uzyć przeglądarki i zapisać pliki pod odpowiednimi nazwami.

wget "http://www.stat.gov.pl/broker/access/prefile/downloadPreFile.jspa?id=147" -O teryt_wmrodz.zip
wget "http://www.stat.gov.pl/broker/access/prefile/downloadPreFile.jspa?id=203" -O teryt_terc.zip
wget "http://www.stat.gov.pl/broker/access/prefile/downloadPreFile.jspa?id=205" -O teryt_simc.zip
wget "http://www.stat.gov.pl/broker/access/prefile/downloadPreFile.jspa?id=222" -O teryt_ulic.zip

Transformacja do CSV

Następnie należy przetłumaczyć te zbiory do formatu tabelarycznego.
Możemy użyć do tego celu skryptu w języku sed z tej strony.

Uruchomienie (bash):


for zb in wmrodz terc simc ulic; do
    unzip -p teryt_$zb.zip | sed -n -f teryt.sed > teryt_$zb.csv
done

Powinno to stworzyć 4 pliki CSV.

Utworzenie tabel

Chcemy zaimportować dane do relacyjnej bazy danych, więc na początek musimy ją stworzyć :-)

Poniżej jest DDL tworzący odpowiednie tabele. Starałem się używać ANSI SQL żeby kod działał w różnych bazach. Starałem się też, aby wszystkie tabele miały odpowiednie klucze (główne i obce).

create table teryt_wmrodz(
        rm varchar(2) not null primary key,
        nazwa_rm varchar(30) not null unique,
        stan_na varchar(10) not null
);
create table teryt_terc(
        wojewodztwo varchar(2) not null,
        powiat varchar(2),
        gmina varchar(2),
        rodz varchar(2),
        nazwa varchar(100) not null,
        nazdod varchar(100) not null,
        stan_na varchar(10) not null,
        constraint teryt_terc_key unique ( wojewodztwo, powiat, gmina, rodz ),
        constraint teryt_terc_key2 unique ( wojewodztwo, powiat, gmina, nazdod )
);
create table teryt_simc(
        wojewodztwo varchar(2) not null,
        powiat varchar(2) not null,
        gmina varchar(2) not null,
        rodz_gmi varchar(2) not null,
        RM varchar(2) not null,
        MZ varchar(2) not null,
        nazwa varchar(100) not null,
        sym varchar(10) not null primary key,
        sympod varchar(10) not null references teryt_simc,
        stan_na varchar(10) not null
);
create table teryt_ulic(
        wojewodztwo varchar(2) not null,
        powiat varchar(2) not null,
        gmina varchar(2) not null,
        rodz_gmi varchar(2) not null,
        symbol varchar(10) not null references teryt_simc (sym),
        sym_ul varchar(10) not null,
        cecha varchar(10) not null,
        NAZWA_1 varchar(100) not null,
        NAZWA_2 varchar(100),
        stan_na varchar(10)  not null,
        constraint teryt_ulic_pkey primary key (symbol,sym_ul),
        constraint teryt_ulic_fkey_teryt_terc foreign key (wojewodztwo,powiat,gmina,rodz_gmi) references teryt_terc (wojewodztwo, powiat, gmina, rodz)
);

Ładowanie danych do bazy

Można użyć dowolnej bazy relacyjnej i narzędzia do ładowania; ja testowałem to na dwóch:

PostgreSQL

Import przez zwykłe COPY, poniżej polecenia:

truncate teryt_terc, teryt_wmrodz, teryt_ulic, teryt_simc;
SET client_encoding TO 'UTF-8';
\copy teryt_wmrodz from 'teryt_wmrodz.csv' with csv delimiter '|'
\copy teryt_terc from 'teryt_terc.csv' with csv delimiter '|'
\copy teryt_simc from 'teryt_simc.csv' with csv delimiter '|'
\copy teryt_ulic from 'teryt_ulic.csv' with csv delimiter '|'

MSSQL

Import za pomocą DTSWizard, dostępny spod SSMS.

Nie będę tu wrzucał screenów z SSMS, import jest prosty, jednak po nim trzeba upewnić się, że kodowanie znaków jest poprawne.

Celebracja

Voilà! Możemy się "cieszyć" danymi TERYT w formie relacyjnej. Chcesz wiedzieć jakie są dzielnice Krakowa? Ile jest w Polsce ulic imienia Antoniego Malczewskiego? Ile miejscowości o nazwie na literę K jest w województwie zachodniopomorskim? Proszę bardzo! Toż to dziecinnie proste, wystarczy znać SQL.

23Apr/100

pgbouncer mini HOWTO + benchmark

pgbouncer is a lightweight connection pooler for PostgreSQL.

I've decided to write this mini howto in order to prove that pgbouncer:
- is easy to install and configure
- is really useful, even in minimal setups (same machine as postgres, 10 clients).

As a prerequisite, we will need:
- postgres up and running (well, that's what pgbouncer was made for).
- standard set of development tools needed to compile C programs (gcc+make+binutils).
I assume you already have these.

My Linux distro is Ubuntu 9.10, kernel/software versions:

filip@srv:~$ uname -r
2.6.27.7-smp
filip@srv:~$ pg_config --version
PostgreSQL 8.4.3
filip@srv:~$ gcc --version
gcc (Ubuntu 4.4.1-4ubuntu9) 4.4.1

Installing pgbouncer

First we need libevent (event notification library). Luckilly, in recent Ubuntu version it's packaged, so let's use it:

filip@srv:~$ apt-cache search libevent
libevent-1.4-2 - An asynchronous event notification library
libevent-core-1.4-2 - An asynchronous event notification library (core)
libevent-dev - Development libraries, header files and docs for libevent
(...)

We need libevent-dev package, to have header and include files needed for pgbouncer compilation.

filip@srv:~$ sudo apt-get install libevent-dev
(...) done.

OK so now we have libevent installed. Next we go for pgbouncer itself.

It's not packaged for Ubuntu at the moment, so we have to compile it from sources. I chose to install to /usr/local - simply because I had no better idea. YMMV.

Download, unpack and configure:

filip@srv:~/src$ wget http://pgfoundry.org/frs/download.php/2608/pgbouncer-1.3.2.tgz
(...) `pgbouncer-1.3.2.tgz' saved [166756/166756]
filip@srv:~/src$ tar xzf pgbouncer-1.3.2.tgz
filip@srv:~/src$ cd pgbouncer-1.3.2/
filip@srv:~/src/pgbouncer-1.3.2$ ./configure --prefix=/usr/local
(...)
configure: creating ./config.status
config.status: creating config.mak
config.status: creating include/config.h

OK it's configured, lets compile and install:

filip@srv:~/src/pgbouncer-1.3.2$ make
(...)
make[1]: Nothing to be done for `all'.
make[1]: Leaving directory `/home/filip/src/pgbouncer-1.3.2/doc'
filip@srv:~/src/pgbouncer-1.3.2$ sudo make install
(...)

Pgbouncer is now installed.

All in one command, for your convenience:

wget http://pgfoundry.org/frs/download.php/2608/pgbouncer-1.3.2.tgz \
&& tar xzf pgbouncer-1.3.2.tgz \
&& cd pgbouncer-1.3.2 \
&& ./configure --prefix=/usr/local \
&& make && sudo make install

Configuring pgbouncer

Now let's create a config file. We do this by copying distributed ini file sample to /etc:

filip@srv:~$ sudo cp /usr/local/share/doc/pgbouncer/pgbouncer.ini /etc/pgbouncer.ini

Next we edit this file and configure needed options.

Most important is pool_mode. I choose transaction pooling mode because it gives most performance boost for normal postgresql usage.

As for connection settings, we have pgbouncer on port 6432, routing connections to port 5432 (my postgres is on same host and standard port).

File paths are adjusted for typical Ubuntu setup. pgbouncer will be run from postgres account, so we just point auth_file to pg_auth.

Here is my pgbouncer.ini file:

[databases]
* = port=5432
[pgbouncer]
logfile = /var/log/postgresql/pgbouncer.log
pidfile = /var/log/postgresql/pgbouncer.pid
listen_addr = *
listen_port = 6432
unix_socket_dir = /var/run/postgresql
auth_type = trust
auth_file = /var/lib/postgresql/8.4/main/global/pg_auth
admin_users = postgres
stats_users = postgres
pool_mode = transaction
server_reset_query = DISCARD ALL;
server_check_query = select 1
server_check_delay = 10
max_client_conn = 1000
default_pool_size = 20
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1

Starting pgbouncer

Now let's start the beast:

filip@srv:~$ sudo su - postgres
postgres@srv:~$ pgbouncer -d /etc/pgbouncer.ini
2010-04-23 18:37:05.969 20068 LOG File descriptor limit: 10000 (H:15000), max_client_conn: 1000, max fds possible: 1010

OK, now let's check if this really works. Connect to admin console (virtual db "pgbouncer"):

filip@srv:~$ psql -Upostgres -p6432 pgbouncer
psql (8.4.3, server 8.0/bouncer)
WARNING: psql version 8.4, server version 8.0.
         Some psql features might not work.
Type "help" for help.

postgres@pgbouncer=#

It works!.

Benchmarking

Now let's do some testing. We use good'ol' pgbench.

Create and initialize test database with scale 10:

filip@srv:~$ createdb bench
filip@srv:~$ /usr/lib/postgresql/8.4/bin/pgbench -i -s 10 bench
(...)
vacuum...done.

Test with 10 clients, separate connection for each transaction, direct connection to postgres, run test for one minute:

(mandela)filip@ratel:~$ /usr/lib/postgresql/8.4/bin/pgbench -c 10 -C -T 60 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 10
duration: 60 s
number of transactions actually processed: 1528
tps = 25.410170 (including connections establishing)
tps = 53.186072 (excluding connections establishing)

Now the same test, but connecting via pgbouncer:

(mandela)filip@ratel:~$ /usr/lib/postgresql/8.4/bin/pgbench -c 10 -C -T 60 -p 6432 bench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 10
duration: 60 s
number of transactions actually processed: 2601
tps = 43.308068 (including connections establishing)
tps = 55.619391 (excluding connections establishing)

Looks promising.

Now both tests repeated several times, just bare results for brevity:


(direct)
tps = 25.583194 (including connections establishing)
tps = 55.247968 (excluding connections establishing)
(pgbouncer)
tps = 51.769025 (including connections establishing)
tps = 73.188059 (excluding connections establishing)
(direct)
tps = 25.857126 (including connections establishing)
tps = 64.090508 (excluding connections establishing)
(pgbouncer)
tps = 61.633963 (including connections establishing)
tps = 87.375610 (excluding connections establishing)
(direct)
tps = 21.134134 (including connections establishing)
tps = 50.005559 (excluding connections establishing)
(pgbouncer)
tps = 50.122482 (including connections establishing)
tps = 74.693641 (excluding connections establishing)
(direct)
tps = 18.925272 (including connections establishing)
tps = 49.249117 (excluding connections establishing)
(pgbouncer)
tps = 63.616117 (including connections establishing)
tps = 94.977040 (excluding connections establishing)
(direct)
tps = 22.444140 (including connections establishing)
tps = 43.382705 (excluding connections establishing)
(pgbouncer)
tps = 68.886017 (including connections establishing)
tps = 102.644402 (excluding connections establishing)
(direct)
tps = 19.979776 (including connections establishing)
tps = 52.215144 (excluding connections establishing)
(pgbouncer)
tps = 57.047613 (including connections establishing)
tps = 85.300031 (excluding connections establishing)

Lets make it more readable, and calculate performance gain (pgbench vs direct):

direct vs pgbouncer pgbench results - non persistent connections
Test # direct incl connections direct excl connections pgbouncer incl connections pgbouncer excl connections
1 25,41 53,19 43,31 55,62
2 25,58 55,25 51,77 73,19
3 25,86 64,09 61,63 87,38
4 21,13 50,01 50,12 74,69
5 18,93 49,25 63,62 94,98
6 22,44 43,38 68,89 102,64
7 19,98 52,22 57,05 85,3
AVG 22,76 52,48 56,63 81,97
PGBOUNCER GAIN PERCENT 148,78% 56,19%

We can see that - including time consumed by connections handling - pgbouncer gives about 150% speedup compared to raw postgres.

56,19% is also a very interesting result - this difference probably comes from postgres session initialization, but maybe pgbouncer handles connections/disconnections faster than postgresql itself.

These results are very good but of course they are heavily influenced by pgbench "-C" switch (separate connection for each transaction). Let's see what are the results for pgbench without "-C":

direct vs pgbouncer pgbench results - persistent connections
Test # direct incl conn direct excl conn pgbouncer incl conn pgbouncer excl conn
1 77,64 77,78 55,67 55,7
2 73,16 73,38 79,59 79,6
3 80,91 81,03 67,45 67,46
4 61,8 61,94 78,97 79,02
5 79,45 79,57 80,66 80,7
AVG 74,59 74,74 72,47 72,5
PGBOUNCER GAIN PERCENT -2,85% -3,00%

As you see, for persistent connections there is no gain - even a small overhead.

BTW, both test were executed like this:

filip@srv:~$ for n in 1 2 3 4 5 ; do echo "$n (direct)"; /usr/lib/postgresql/8.4/bin/pgbench -c 10 -T 60 bench   2>&1 | grep tps; echo "$n (pgbouncer)"; /usr/lib/postgresql/8.4/bin/pgbench -c 10 -T 60 -p 6432 bench 2>&1 | grep tps; done

Note: this was on a very weak machine - desktop class PC from circa 2006.

Conclusion and disclamer

Using pgbench is definitely a good idea if you have clients connecting many times repeatedly.

These result would probably be very similar using any other decent connection pooler.

I am not an expert on TPC-B and I also do not take responsibility for any damage made to your system and/or database by the code written above.

I do not guarantee that you achieve same results - maybe the whole test was crippled and useless.

As always, please let me know if you see any errors / omissions in the article.

8Mar/100

dynamic SQL parameters in PL/PgSQL functions

It pays to read documentation.

From PostgreSQL 8.3 to 8.4 there was one interesting addon for plpgsql: ability to pass parameter values directly into EXECUTE.

Of course depesz wrote about it. And I did read it; But still (by routine) I was using something like:

sql_query := 'SELECT foo FROM bar JOIN baz USING (barbaz) WHERE baz.id = ANY (__PARAM__::integer[])';
sql_query := replace(sql_query, '__PARAM__', quote_nullable(my_param);
EXECUTE sql_query INTO my_foos;

While it can be written in more elegant and error-prone way:

EXECUTE 'SELECT foo FROM bar JOIN baz USING (barbaz) WHERE baz.id = ANY ($1)'
INTO my_foos
USING my_param;

5Mar/10Off

postgres schemagrep function

From time to time, I need to "grep" database schema to locate some database objects.

One possibility is to dump the whole schema to file (pg_dump -s DBNAME) and use your text editor to browse it. It is quite handy and I actually use it a lot.

But sometimes, it is more convenient to have this "grep-like" possibility directly from psql.

So... here it is, the "schemagrep" function.
Please note that argument for this function is a regex pattern, so you can look for almost everything.
The search is case-insensitive.


begin;

create or replace function schemagrep_relkind( "char" ) returns text as $$
select case $1
    when 'r' then 'TABLE'
    when 'i' then 'INDEX'
    when 'S' then 'SEQUENCE'
    when 'v' then 'VIEW'
    when 'c' then 'COMPOSITE TYPE'
    when 't' then 'TOAST TABLE'
end
$$ language sql immutable strict;

drop function if exists schemagrep( text );
create function schemagrep( text ) returns table(
    schema name,
    object_name name,
    match_type text,
    psql_hint text
) as $$
select x.*, case when x.match_type ~ 'FUNCTION' then E'\\df+ ' when x.match_type ~ 'COMMENT' then E'\\d+ ' else E'\\d ' end || nspname || '.' || relname
from (
    select n.nspname, c.relname, schemagrep_relkind(c.relkind) || ' NAME' as match_type
    from pg_class c, pg_namespace n where c.relname ~* $1 and c.relnamespace = n.oid
  union all
    select distinct n.nspname, c.relname, schemagrep_relkind(c.relkind) || ' ATTRIBUTE'
    from pg_class c, pg_attribute a, pg_namespace n where a.attrelid = c.oid and c.relkind <> 'v' and a.attname ~* $1 and c.relnamespace = n.oid
  union all
    select n.nspname, proname, 'FUNCTION DEFINITION'
    from pg_proc, pg_namespace n where prosrc ~* $1 and pronamespace = n.oid
  union all
    select schemaname, viewname, 'VIEW DEFINITION'
    from pg_views where definition ~* $1
  union all
    select n.nspname, relname, 'CLASS CONSTRAINT'
    from pg_class c, pg_constraint cx, pg_namespace n where cx.conrelid = c.oid and cx.consrc ~* $1 and relnamespace = n.oid
  union all
    select distinct n.nspname, c.relname, 'COMMENT'
    from pg_class c, pg_description d, pg_namespace n where d.objoid = c.oid and d.description ~* $1 and c.relnamespace = n.oid
) x
$$ language sql stable strict;

grant execute on function schemagrep(text) to public;

commit;

Test it by running

SELECT * FROM schemagrep( 'anything' );

For me, it works like this:

testdb=# select * from schemagrep('account');
 schema |      object_name      |     match_type      |            psql_hint
--------+-----------------------+---------------------+----------------------------------
 public | pgbench_accounts      | TABLE NAME          | \d public.pgbench_accounts
 public | pgbench_accounts_pkey | INDEX NAME          | \d public.pgbench_accounts_pkey
 public | sp_acc_createaccount  | FUNCTION DEFINITION | \df+ public.sp_acc_createaccount
(3 rows)

That's it! Of course the function can be extended or modified to your needs, because it's in pure SQL :-)

8Feb/10Off

iterate over a table in PostgreSQL vs SAS

I have struggled how to achieve something like this in SAS.

(you can read it as pseudocode but it's real PL/PgSQL):

DO $this$
DECLARE my_table name;
BEGIN;
    -- iterate over table names
    FOR my_table IN SELECT nazwa FROM tabele
    LOOP
        sql := 'CREATE TABLE XXX_backup AS SELECT * FROM XXX';
        sql := replace(sql, 'XXX', my_table);
        execute sql;
    END LOOP;
END;
$this$ language plpgsql;

After consulting a collegue [thx Ludwik] and exploring this helpful doc:
Performing Multiple Statements for Each Record in a SAS® Data Set, tadam:

%MACRO COPYTABLES(SCANFILE,SCANFIELD);
*get number of tables;
	DATA _NULL_;
		IF 0 THEN SET &SCANFILE NOBS=X;
		CALL SYMPUT('RECCOUNT',X);
		STOP;
	RUN;
*iterate;
	%DO I=1 %TO &RECCOUNT;
*get table name;
	DATA _NULL_;
		SET &SCANFILE (FIRSTOBS=&I);
		CALL SYMPUT('TABLENAME',&SCANFIELD);
		STOP;
	RUN;
*copy;
	DATA &TABLENAME._backup;
		SET &TABLENAME;
	RUN;
	%END;
%MEND COPYSETS;

%COPYSETS(tabele,nazwa);

When you switch from procedural coding to macro coding, you have twist your brain upside down...
I had to read a white paper (SAS docs lacking examples) and scratch my head for a moment before writing this simple procedural loop in SAS 4GL.

Now tell me that SAS isn't weird ;-)