Import zbiorów TERYT do bazy danych SQL
Główny Urząd Statystyczny administruje bazą danych podziału administracyjnego Polski, z dokładnością do ulic i nazw miejscowych.
Baza ta nazywa się TERYT i jest udostępniana publicznie w formie plików XML, w serwisie stat.gov.pl.
W tym artykule pokażę jak zaimportować te dane do bazy relacyjnej (na przykładzie PostgreSQL oraz Microsoft SQL Server).
Pobieranie
Najpierw musimy pobrać aktualną wersję plików TERYT. Poniżej polecenia wget, oczywiście można też uzyć przeglądarki i zapisać pliki pod odpowiednimi nazwami.
wget "http://www.stat.gov.pl/broker/access/prefile/downloadPreFile.jspa?id=147" -O teryt_wmrodz.zip
wget "http://www.stat.gov.pl/broker/access/prefile/downloadPreFile.jspa?id=203" -O teryt_terc.zip
wget "http://www.stat.gov.pl/broker/access/prefile/downloadPreFile.jspa?id=205" -O teryt_simc.zip
wget "http://www.stat.gov.pl/broker/access/prefile/downloadPreFile.jspa?id=222" -O teryt_ulic.zip
Transformacja do CSV
Następnie należy przetłumaczyć te zbiory do formatu tabelarycznego.
Możemy użyć do tego celu skryptu w języku sed z tej strony.
Uruchomienie (bash):
for zb in wmrodz terc simc ulic; do
unzip -p teryt_$zb.zip | sed -n -f teryt.sed > teryt_$zb.csv
done
Powinno to stworzyć 4 pliki CSV.
Utworzenie tabel
Chcemy zaimportować dane do relacyjnej bazy danych, więc na początek musimy ją stworzyć :-)
Poniżej jest DDL tworzący odpowiednie tabele. Starałem się używać ANSI SQL żeby kod działał w różnych bazach. Starałem się też, aby wszystkie tabele miały odpowiednie klucze (główne i obce).
create table teryt_wmrodz(
rm varchar(2) not null primary key,
nazwa_rm varchar(30) not null unique,
stan_na varchar(10) not null
);
create table teryt_terc(
wojewodztwo varchar(2) not null,
powiat varchar(2),
gmina varchar(2),
rodz varchar(2),
nazwa varchar(100) not null,
nazdod varchar(100) not null,
stan_na varchar(10) not null,
constraint teryt_terc_key unique ( wojewodztwo, powiat, gmina, rodz ),
constraint teryt_terc_key2 unique ( wojewodztwo, powiat, gmina, nazdod )
);
create table teryt_simc(
wojewodztwo varchar(2) not null,
powiat varchar(2) not null,
gmina varchar(2) not null,
rodz_gmi varchar(2) not null,
RM varchar(2) not null,
MZ varchar(2) not null,
nazwa varchar(100) not null,
sym varchar(10) not null primary key,
sympod varchar(10) not null references teryt_simc,
stan_na varchar(10) not null
);
create table teryt_ulic(
wojewodztwo varchar(2) not null,
powiat varchar(2) not null,
gmina varchar(2) not null,
rodz_gmi varchar(2) not null,
symbol varchar(10) not null references teryt_simc (sym),
sym_ul varchar(10) not null,
cecha varchar(10) not null,
NAZWA_1 varchar(100) not null,
NAZWA_2 varchar(100),
stan_na varchar(10) not null,
constraint teryt_ulic_pkey primary key (symbol,sym_ul),
constraint teryt_ulic_fkey_teryt_terc foreign key (wojewodztwo,powiat,gmina,rodz_gmi) references teryt_terc (wojewodztwo, powiat, gmina, rodz)
);
Ładowanie danych do bazy
Można użyć dowolnej bazy relacyjnej i narzędzia do ładowania; ja testowałem to na dwóch:
PostgreSQL
Import przez zwykłe COPY, poniżej polecenia:
truncate teryt_terc, teryt_wmrodz, teryt_ulic, teryt_simc;
SET client_encoding TO 'UTF-8';
\copy teryt_wmrodz from 'teryt_wmrodz.csv' with csv delimiter '|'
\copy teryt_terc from 'teryt_terc.csv' with csv delimiter '|'
\copy teryt_simc from 'teryt_simc.csv' with csv delimiter '|'
\copy teryt_ulic from 'teryt_ulic.csv' with csv delimiter '|'
MSSQL
Import za pomocą DTSWizard, dostępny spod SSMS.
Nie będę tu wrzucał screenów z SSMS, import jest prosty, jednak po nim trzeba upewnić się, że kodowanie znaków jest poprawne.
Celebracja
Voilà! Możemy się "cieszyć" danymi TERYT w formie relacyjnej. Chcesz wiedzieć jakie są dzielnice Krakowa? Ile jest w Polsce ulic imienia Antoniego Malczewskiego? Ile miejscowości o nazwie na literę K jest w województwie zachodniopomorskim? Proszę bardzo! Toż to dziecinnie proste, wystarczy znać SQL.
pgbouncer mini HOWTO + benchmark
pgbouncer is a lightweight connection pooler for PostgreSQL.
I've decided to write this mini howto in order to prove that pgbouncer:
- is easy to install and configure
- is really useful, even in minimal setups (same machine as postgres, 10 clients).
As a prerequisite, we will need:
- postgres up and running (well, that's what pgbouncer was made for).
- standard set of development tools needed to compile C programs (gcc+make+binutils).
I assume you already have these.
My Linux distro is Ubuntu 9.10, kernel/software versions:
filip@srv:~$ uname -r 2.6.27.7-smp filip@srv:~$ pg_config --version PostgreSQL 8.4.3 filip@srv:~$ gcc --version gcc (Ubuntu 4.4.1-4ubuntu9) 4.4.1
Installing pgbouncer
First we need libevent (event notification library). Luckilly, in recent Ubuntu version it's packaged, so let's use it:
filip@srv:~$ apt-cache search libevent libevent-1.4-2 - An asynchronous event notification library libevent-core-1.4-2 - An asynchronous event notification library (core) libevent-dev - Development libraries, header files and docs for libevent (...)
We need libevent-dev package, to have header and include files needed for pgbouncer compilation.
filip@srv:~$ sudo apt-get install libevent-dev (...) done.
OK so now we have libevent installed. Next we go for pgbouncer itself.
It's not packaged for Ubuntu at the moment, so we have to compile it from sources. I chose to install to /usr/local - simply because I had no better idea. YMMV.
Download, unpack and configure:
filip@srv:~/src$ wget http://pgfoundry.org/frs/download.php/2608/pgbouncer-1.3.2.tgz (...) `pgbouncer-1.3.2.tgz' saved [166756/166756] filip@srv:~/src$ tar xzf pgbouncer-1.3.2.tgz filip@srv:~/src$ cd pgbouncer-1.3.2/ filip@srv:~/src/pgbouncer-1.3.2$ ./configure --prefix=/usr/local (...) configure: creating ./config.status config.status: creating config.mak config.status: creating include/config.h
OK it's configured, lets compile and install:
filip@srv:~/src/pgbouncer-1.3.2$ make (...) make[1]: Nothing to be done for `all'. make[1]: Leaving directory `/home/filip/src/pgbouncer-1.3.2/doc' filip@srv:~/src/pgbouncer-1.3.2$ sudo make install (...)
Pgbouncer is now installed.
All in one command, for your convenience:
wget http://pgfoundry.org/frs/download.php/2608/pgbouncer-1.3.2.tgz \ && tar xzf pgbouncer-1.3.2.tgz \ && cd pgbouncer-1.3.2 \ && ./configure --prefix=/usr/local \ && make && sudo make install
Configuring pgbouncer
Now let's create a config file. We do this by copying distributed ini file sample to /etc:
filip@srv:~$ sudo cp /usr/local/share/doc/pgbouncer/pgbouncer.ini /etc/pgbouncer.ini
Next we edit this file and configure needed options.
Most important is pool_mode. I choose transaction pooling mode because it gives most performance boost for normal postgresql usage.
As for connection settings, we have pgbouncer on port 6432, routing connections to port 5432 (my postgres is on same host and standard port).
File paths are adjusted for typical Ubuntu setup. pgbouncer will be run from postgres account, so we just point auth_file to pg_auth.
Here is my pgbouncer.ini file:
[databases] * = port=5432 [pgbouncer] logfile = /var/log/postgresql/pgbouncer.log pidfile = /var/log/postgresql/pgbouncer.pid listen_addr = * listen_port = 6432 unix_socket_dir = /var/run/postgresql auth_type = trust auth_file = /var/lib/postgresql/8.4/main/global/pg_auth admin_users = postgres stats_users = postgres pool_mode = transaction server_reset_query = DISCARD ALL; server_check_query = select 1 server_check_delay = 10 max_client_conn = 1000 default_pool_size = 20 log_connections = 1 log_disconnections = 1 log_pooler_errors = 1
Starting pgbouncer
Now let's start the beast:
filip@srv:~$ sudo su - postgres postgres@srv:~$ pgbouncer -d /etc/pgbouncer.ini 2010-04-23 18:37:05.969 20068 LOG File descriptor limit: 10000 (H:15000), max_client_conn: 1000, max fds possible: 1010
OK, now let's check if this really works. Connect to admin console (virtual db "pgbouncer"):
filip@srv:~$ psql -Upostgres -p6432 pgbouncer
psql (8.4.3, server 8.0/bouncer)
WARNING: psql version 8.4, server version 8.0.
Some psql features might not work.
Type "help" for help.
postgres@pgbouncer=#
It works!.
Benchmarking
Now let's do some testing. We use good'ol' pgbench.
Create and initialize test database with scale 10:
filip@srv:~$ createdb bench filip@srv:~$ /usr/lib/postgresql/8.4/bin/pgbench -i -s 10 bench (...) vacuum...done.
Test with 10 clients, separate connection for each transaction, direct connection to postgres, run test for one minute:
(mandela)filip@ratel:~$ /usr/lib/postgresql/8.4/bin/pgbench -c 10 -C -T 60 bench starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 10 query mode: simple number of clients: 10 duration: 60 s number of transactions actually processed: 1528 tps = 25.410170 (including connections establishing) tps = 53.186072 (excluding connections establishing)
Now the same test, but connecting via pgbouncer:
(mandela)filip@ratel:~$ /usr/lib/postgresql/8.4/bin/pgbench -c 10 -C -T 60 -p 6432 bench starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 10 query mode: simple number of clients: 10 duration: 60 s number of transactions actually processed: 2601 tps = 43.308068 (including connections establishing) tps = 55.619391 (excluding connections establishing)
Looks promising.
Now both tests repeated several times, just bare results for brevity:
(direct) tps = 25.583194 (including connections establishing) tps = 55.247968 (excluding connections establishing) (pgbouncer) tps = 51.769025 (including connections establishing) tps = 73.188059 (excluding connections establishing) (direct) tps = 25.857126 (including connections establishing) tps = 64.090508 (excluding connections establishing) (pgbouncer) tps = 61.633963 (including connections establishing) tps = 87.375610 (excluding connections establishing) (direct) tps = 21.134134 (including connections establishing) tps = 50.005559 (excluding connections establishing) (pgbouncer) tps = 50.122482 (including connections establishing) tps = 74.693641 (excluding connections establishing) (direct) tps = 18.925272 (including connections establishing) tps = 49.249117 (excluding connections establishing) (pgbouncer) tps = 63.616117 (including connections establishing) tps = 94.977040 (excluding connections establishing) (direct) tps = 22.444140 (including connections establishing) tps = 43.382705 (excluding connections establishing) (pgbouncer) tps = 68.886017 (including connections establishing) tps = 102.644402 (excluding connections establishing) (direct) tps = 19.979776 (including connections establishing) tps = 52.215144 (excluding connections establishing) (pgbouncer) tps = 57.047613 (including connections establishing) tps = 85.300031 (excluding connections establishing)
Lets make it more readable, and calculate performance gain (pgbench vs direct):
| Test # | direct incl connections | direct excl connections | pgbouncer incl connections | pgbouncer excl connections |
|---|---|---|---|---|
| 1 | 25,41 | 53,19 | 43,31 | 55,62 |
| 2 | 25,58 | 55,25 | 51,77 | 73,19 |
| 3 | 25,86 | 64,09 | 61,63 | 87,38 |
| 4 | 21,13 | 50,01 | 50,12 | 74,69 |
| 5 | 18,93 | 49,25 | 63,62 | 94,98 |
| 6 | 22,44 | 43,38 | 68,89 | 102,64 |
| 7 | 19,98 | 52,22 | 57,05 | 85,3 |
| AVG | 22,76 | 52,48 | 56,63 | 81,97 |
| PGBOUNCER GAIN PERCENT | 148,78% | 56,19% |
We can see that - including time consumed by connections handling - pgbouncer gives about 150% speedup compared to raw postgres.
56,19% is also a very interesting result - this difference probably comes from postgres session initialization, but maybe pgbouncer handles connections/disconnections faster than postgresql itself.
These results are very good but of course they are heavily influenced by pgbench "-C" switch (separate connection for each transaction). Let's see what are the results for pgbench without "-C":
| Test # | direct incl conn | direct excl conn | pgbouncer incl conn | pgbouncer excl conn |
|---|---|---|---|---|
| 1 | 77,64 | 77,78 | 55,67 | 55,7 |
| 2 | 73,16 | 73,38 | 79,59 | 79,6 |
| 3 | 80,91 | 81,03 | 67,45 | 67,46 |
| 4 | 61,8 | 61,94 | 78,97 | 79,02 |
| 5 | 79,45 | 79,57 | 80,66 | 80,7 |
| AVG | 74,59 | 74,74 | 72,47 | 72,5 |
| PGBOUNCER GAIN PERCENT | -2,85% | -3,00% |
As you see, for persistent connections there is no gain - even a small overhead.
BTW, both test were executed like this:
filip@srv:~$ for n in 1 2 3 4 5 ; do echo "$n (direct)"; /usr/lib/postgresql/8.4/bin/pgbench -c 10 -T 60 bench 2>&1 | grep tps; echo "$n (pgbouncer)"; /usr/lib/postgresql/8.4/bin/pgbench -c 10 -T 60 -p 6432 bench 2>&1 | grep tps; done
Note: this was on a very weak machine - desktop class PC from circa 2006.
Conclusion and disclamer
Using pgbench is definitely a good idea if you have clients connecting many times repeatedly.
These result would probably be very similar using any other decent connection pooler.
I am not an expert on TPC-B and I also do not take responsibility for any damage made to your system and/or database by the code written above.
I do not guarantee that you achieve same results - maybe the whole test was crippled and useless.
As always, please let me know if you see any errors / omissions in the article.
dynamic SQL parameters in PL/PgSQL functions
It pays to read documentation.
From PostgreSQL 8.3 to 8.4 there was one interesting addon for plpgsql: ability to pass parameter values directly into EXECUTE.
Of course depesz wrote about it. And I did read it; But still (by routine) I was using something like:
sql_query := 'SELECT foo FROM bar JOIN baz USING (barbaz) WHERE baz.id = ANY (__PARAM__::integer[])';
sql_query := replace(sql_query, '__PARAM__', quote_nullable(my_param);
EXECUTE sql_query INTO my_foos;
While it can be written in more elegant and error-prone way:
EXECUTE 'SELECT foo FROM bar JOIN baz USING (barbaz) WHERE baz.id = ANY ($1)'
INTO my_foos
USING my_param;
postgres schemagrep function
From time to time, I need to "grep" database schema to locate some database objects.
One possibility is to dump the whole schema to file (pg_dump -s DBNAME) and use your text editor to browse it. It is quite handy and I actually use it a lot.
But sometimes, it is more convenient to have this "grep-like" possibility directly from psql.
So... here it is, the "schemagrep" function.
Please note that argument for this function is a regex pattern, so you can look for almost everything.
The search is case-insensitive.
begin;
create or replace function schemagrep_relkind( "char" ) returns text as $$
select case $1
when 'r' then 'TABLE'
when 'i' then 'INDEX'
when 'S' then 'SEQUENCE'
when 'v' then 'VIEW'
when 'c' then 'COMPOSITE TYPE'
when 't' then 'TOAST TABLE'
end
$$ language sql immutable strict;
drop function if exists schemagrep( text );
create function schemagrep( text ) returns table(
schema name,
object_name name,
match_type text,
psql_hint text
) as $$
select x.*, case when x.match_type ~ 'FUNCTION' then E'\\df+ ' when x.match_type ~ 'COMMENT' then E'\\d+ ' else E'\\d ' end || nspname || '.' || relname
from (
select n.nspname, c.relname, schemagrep_relkind(c.relkind) || ' NAME' as match_type
from pg_class c, pg_namespace n where c.relname ~* $1 and c.relnamespace = n.oid
union all
select distinct n.nspname, c.relname, schemagrep_relkind(c.relkind) || ' ATTRIBUTE'
from pg_class c, pg_attribute a, pg_namespace n where a.attrelid = c.oid and c.relkind <> 'v' and a.attname ~* $1 and c.relnamespace = n.oid
union all
select n.nspname, proname, 'FUNCTION DEFINITION'
from pg_proc, pg_namespace n where prosrc ~* $1 and pronamespace = n.oid
union all
select schemaname, viewname, 'VIEW DEFINITION'
from pg_views where definition ~* $1
union all
select n.nspname, relname, 'CLASS CONSTRAINT'
from pg_class c, pg_constraint cx, pg_namespace n where cx.conrelid = c.oid and cx.consrc ~* $1 and relnamespace = n.oid
union all
select distinct n.nspname, c.relname, 'COMMENT'
from pg_class c, pg_description d, pg_namespace n where d.objoid = c.oid and d.description ~* $1 and c.relnamespace = n.oid
) x
$$ language sql stable strict;
grant execute on function schemagrep(text) to public;
commit;
Test it by running
SELECT * FROM schemagrep( 'anything' );
For me, it works like this:
testdb=# select * from schemagrep('account');
schema | object_name | match_type | psql_hint
--------+-----------------------+---------------------+----------------------------------
public | pgbench_accounts | TABLE NAME | \d public.pgbench_accounts
public | pgbench_accounts_pkey | INDEX NAME | \d public.pgbench_accounts_pkey
public | sp_acc_createaccount | FUNCTION DEFINITION | \df+ public.sp_acc_createaccount
(3 rows)
That's it! Of course the function can be extended or modified to your needs, because it's in pure SQL :-)
iterate over a table in PostgreSQL vs SAS
I have struggled how to achieve something like this in SAS.
(you can read it as pseudocode but it's real PL/PgSQL):
DO $this$
DECLARE my_table name;
BEGIN;
-- iterate over table names
FOR my_table IN SELECT nazwa FROM tabele
LOOP
sql := 'CREATE TABLE XXX_backup AS SELECT * FROM XXX';
sql := replace(sql, 'XXX', my_table);
execute sql;
END LOOP;
END;
$this$ language plpgsql;
After consulting a collegue [thx Ludwik] and exploring this helpful doc:
Performing Multiple Statements for Each Record in a SAS® Data Set, tadam:
%MACRO COPYTABLES(SCANFILE,SCANFIELD);
*get number of tables;
DATA _NULL_;
IF 0 THEN SET &SCANFILE NOBS=X;
CALL SYMPUT('RECCOUNT',X);
STOP;
RUN;
*iterate;
%DO I=1 %TO &RECCOUNT;
*get table name;
DATA _NULL_;
SET &SCANFILE (FIRSTOBS=&I);
CALL SYMPUT('TABLENAME',&SCANFIELD);
STOP;
RUN;
*copy;
DATA &TABLENAME._backup;
SET &TABLENAME;
RUN;
%END;
%MEND COPYSETS;
%COPYSETS(tabele,nazwa);
When you switch from procedural coding to macro coding, you have twist your brain upside down...
I had to read a white paper (SAS docs lacking examples) and scratch my head for a moment before writing this simple procedural loop in SAS 4GL.
Now tell me that SAS isn't weird ;-)