pgloader import / mysql to postgresql

Written by
Walter Doekes
Published on 2019-09-20

When loading old (Django) projects in K8S, we've decided to give PostgreSQL a go as default database. Here are some notes that aid in importing.

After looking around, pgloader seems to be the right tool for the job. Feature and stability wise it beats any other solution. And it's available on recent Debian/Ubuntu.
We need access to the remote PostgreSQL db; because pgloader will not provide an SQL dump (for reasons). (For larger databases this makes sense, but for smaller ones doing an scp(1) of an SQL dump file would've been easier. *Shrug*)
Index names preferably need to be made unique. See the steps below.
The schema name needs to be adjusted. Explained below.
It's worth looking through the default casting rules; especially if you're using non-standard datatypes. (For instance: unsigned smallints will get converted to regular integers, as postgres does not do unsigned types. If you're not using that 16th bit, you could save space.)

Dump before commencing

You'll want to dump the database first (our example source database is called project_db). For smaller ones, this could be:

# mysqldump --defaults-file=/etc/mysql/debian.cnf project_db \
  --skip-extended-insert --routines \
  >/root/dump-first.sql

# mysqldump --defaults-file=/etc/mysql/debian.cnf project_db \
  --skip-extended-insert --routines --compatible=postgresql \
  >/root/dump-first-compat.sql

Unique index names

You'll need to ensure that index names are unique in the source database. You could let pgloader make all names unique, but chances are your ORM (Django) already did this for you.

-- List duplicate index names
SELECT DISTINCT t.iname FROM (
  SELECT s1.table_name, s1.index_name AS iname
  FROM information_schema.statistics s1
  INNER JOIN information_schema.statistics s2 ON
    s2.index_name = s1.index_name AND s2.table_name <> s1.table_name
  WHERE s1.index_name <> 'PRIMARY' AND s1.table_schema = DATABASE()) t
ORDER BY iname;

-- Create "fix" statements
SELECT CONCAT('ALTER TABLE ', tname, ' RENAME INDEX ', iname, ' TO ',
    tname, '_00000000_', iname, ';') AS cmd
FROM (
  SELECT DISTINCT table_name AS tname, index_name AS iname
  FROM information_schema.statistics
  WHERE table_schema = 'project_db' AND index_name <> 'PRIMARY' AND
    NOT index_name REGEXP '_.*[0-9]') t
ORDER BY tname, iname;

-- Run the above, and then check again and fix as appropriate

Create a pgloader command-file and import

In this example project_db is the source, acme_project_db is the destination and project is the project. The schema alteration is explained further on.

The import step is a matter of creating the command-file and letting pgloader do its magic. Put the following in import.pgloader:

LOAD DATABASE
  FROM mysql://debian-sys-maint@localhost/project_db
  INTO postgresql://acme_project@remotedb/acme_project_db

  -- Check unique index names first
  WITH preserve index names

  -- pgloader after 3.4 can do this with a WITH-statement
  BEFORE LOAD DO
    $$ DROP SCHEMA IF EXISTS project CASCADE; $$

  ALTER SCHEMA 'project_db' RENAME TO 'project'
  AFTER LOAD DO
    $$ ALTER DATABASE acme_project_db SET search_path TO project, public; $$
;

Execute it. It may emit some WARNINGs, but it should not produce any ERRORs. If it does, you'll need to go back and fix those.

# pgloader import.pgloader

Checking for success

Do a dump on the destination DB and do basic checks.

$ pg_dump -Ustolon -W -hlocalhost acme_project_db --inserts \
  >dump-postgres.sql

Check if total dump sizes are somewhat equal, and do an INSERT count.

# grep --text ^INSERT /root/dump-first-compat.sql | wc -l
8966

$ grep --text ^INSERT dump-postgres.sql | wc -l
8966

Schema name change explanation

In MySQL, a schema is the same as a database. In PostgreSQL they are not equal. pgloader will import your data into a schema with the same name as your database by default.

That schema is not searched by default. So after import, you'll be looking at 0 relations in your psql shell, until you switch schema-search.

acme_project_db=# \d
Did not find any relations.

acme_project_db=# SET search_path TO project, public;
SET

acme_project_db=# \d
... list of relations ...

You can do one of two things:

Have pgloader import into the public schema:

LOAD DATABASE
...
WITH create no schema
ALTER SCHEMA 'project_db' RENAME TO 'public';

Alter the default schema search path, and set a sane schema name:

LOAD DATABASE
...
ALTER SCHEMA 'project_db' RENAME TO 'project';
AFTER LOAD DO
  $$ ALTER DATABASE acme_project_db SET search_path TO project, public; $$

That second solution works, because the ALTER DATABASE will make it the default, and according to the PostgreSQL schema documentation:

The first schema in the search path that exists is the default location for creating new objects. [...] Also, since myschema is the first element in the path, new objects would by default be created in it.

This means that a schema-agnostic Django project won't notice that we've put everything in a named schema. And the schemas may come in handy in the future.

pgloader WITH-statements explained

The WITH-statements are by default:

[...] and the default WITH clause is: no truncate, create tables, include drop, create indexes, reset sequences, foreign keys, downcase identifiers, uniquify index names. [...] Please note that CASCADE is used to ensure that tables are dropped even if there are foreign keys pointing to them.

We'll tweak them:

WITH
  -- Django indexes should already be unique, no need to make them more unique
  preserve index names,
  -- Default doesn't drop schema (especially not public), but we'll want to
  -- flush it clean ('include drop' only (cascade!) drops the remote targets
  -- (does not exist in pgloader 3.4.x!)
  drop schema

pgloader import / mysql to postgresql