pgloader import / mysql to postgresql
When loading old (Django) projects in K8S, we've decided to give PostgreSQL a go as default database. Here are some notes that aid in importing.
- After looking around, pgloader seems to be the right tool for the job. Feature and stability wise it beats any other solution. And it's available on recent Debian/Ubuntu.
- We need access to the remote PostgreSQL db; because pgloader will not provide an SQL dump (for reasons). (For larger databases this makes sense, but for smaller ones doing an scp(1) of an SQL dump file would've been easier. *Shrug*)
- Index names preferably need to be made unique. See the steps below.
- The schema name needs to be adjusted. Explained below.
- It's worth looking through the default casting rules; especially if you're using non-standard datatypes. (For instance: unsigned smallints will get converted to regular integers, as postgres does not do unsigned types. If you're not using that 16th bit, you could save space.)
Dump before commencing
You'll want to dump the database first (our example source database is
called project_db
). For smaller ones, this could be:
# mysqldump --defaults-file=/etc/mysql/debian.cnf project_db \
--skip-extended-insert --routines \
>/root/dump-first.sql
# mysqldump --defaults-file=/etc/mysql/debian.cnf project_db \
--skip-extended-insert --routines --compatible=postgresql \
>/root/dump-first-compat.sql
Unique index names
You'll need to ensure that index names are unique in the source database. You could let pgloader make all names unique, but chances are your ORM (Django) already did this for you.
-- List duplicate index names
SELECT DISTINCT t.iname FROM (
SELECT s1.table_name, s1.index_name AS iname
FROM information_schema.statistics s1
INNER JOIN information_schema.statistics s2 ON
s2.index_name = s1.index_name AND s2.table_name <> s1.table_name
WHERE s1.index_name <> 'PRIMARY' AND s1.table_schema = DATABASE()) t
ORDER BY iname;
-- Create "fix" statements
SELECT CONCAT('ALTER TABLE ', tname, ' RENAME INDEX ', iname, ' TO ',
tname, '_00000000_', iname, ';') AS cmd
FROM (
SELECT DISTINCT table_name AS tname, index_name AS iname
FROM information_schema.statistics
WHERE table_schema = 'project_db' AND index_name <> 'PRIMARY' AND
NOT index_name REGEXP '_.*[0-9]') t
ORDER BY tname, iname;
-- Run the above, and then check again and fix as appropriate
Create a pgloader command-file and import
In this example project_db
is the source, acme_project_db
is the
destination and project
is the project. The schema alteration is
explained further on.
The import step is a matter of creating the command-file and letting
pgloader do its magic. Put the following in import.pgloader
:
LOAD DATABASE
FROM mysql://debian-sys-maint@localhost/project_db
INTO postgresql://acme_project@remotedb/acme_project_db
-- Check unique index names first
WITH preserve index names
-- pgloader after 3.4 can do this with a WITH-statement
BEFORE LOAD DO
$$ DROP SCHEMA IF EXISTS project CASCADE; $$
ALTER SCHEMA 'project_db' RENAME TO 'project'
AFTER LOAD DO
$$ ALTER DATABASE acme_project_db SET search_path TO project, public; $$
;
Execute it. It may emit some WARNINGs, but it should not produce any ERRORs. If it does, you'll need to go back and fix those.
# pgloader import.pgloader
Checking for success
Do a dump on the destination DB and do basic checks.
$ pg_dump -Ustolon -W -hlocalhost acme_project_db --inserts \
>dump-postgres.sql
Check if total dump sizes are somewhat equal, and do an INSERT count.
# grep --text ^INSERT /root/dump-first-compat.sql | wc -l
8966
$ grep --text ^INSERT dump-postgres.sql | wc -l
8966
Schema name change explanation
In MySQL, a schema is the same as a database. In PostgreSQL they are not equal. pgloader will import your data into a schema with the same name as your database by default.
That schema is not searched by default. So after import, you'll be
looking at 0 relations in your psql
shell, until you switch
schema-search.
acme_project_db=# \d
Did not find any relations.
acme_project_db=# SET search_path TO project, public;
SET
acme_project_db=# \d
... list of relations ...
You can do one of two things:
-
Have pgloader import into the
public
schema:LOAD DATABASE ... WITH create no schema ALTER SCHEMA 'project_db' RENAME TO 'public';
-
Alter the default schema search path, and set a sane schema name:
LOAD DATABASE ... ALTER SCHEMA 'project_db' RENAME TO 'project'; AFTER LOAD DO $$ ALTER DATABASE acme_project_db SET search_path TO project, public; $$
That second solution works, because the ALTER DATABASE
will make it
the default, and according to the PostgreSQL schema
documentation:
The first schema in the search path that exists is the default location for creating new objects. [...] Also, since myschema is the first element in the path, new objects would by default be created in it.
This means that a schema-agnostic Django project won't notice that we've put everything in a named schema. And the schemas may come in handy in the future.
pgloader WITH-statements explained
The WITH-statements are by default:
[...] and the default WITH clause is: no truncate, create tables, include drop, create indexes, reset sequences, foreign keys, downcase identifiers, uniquify index names. [...] Please note that CASCADE is used to ensure that tables are dropped even if there are foreign keys pointing to them.
We'll tweak them:
WITH
-- Django indexes should already be unique, no need to make them more unique
preserve index names,
-- Default doesn't drop schema (especially not public), but we'll want to
-- flush it clean ('include drop' only (cascade!) drops the remote targets
-- (does not exist in pgloader 3.4.x!)
drop schema