Migration from mine2updater

Not compatible with PDBj dump files

pdb-mine-builder is not compatible with the PostgreSQL dump files distributed by PDBj (e.g., mine2_pdbj.dump). The schema structure, column types, and data transformations differ significantly. You cannot restore a PDBj dump into a pdb-mine-builder database or vice versa. pdb-mine-builder builds its database from source data files (CIF/mmJSON), not from database dumps.

pdb-mine-builder is a complete rewrite of the original mine2updater — the RDB updater for PDBj's Mine 2 system. This page covers what changed and why.

Why Rewrite?

The original mine2updater served PDBj well, but had several limitations:

Unmaintained dependencies — Node.js libpq native bindings, OpenBabel for fingerprinting
No chemical search integration — Fingerprints stored as raw bigint columns (byte0–byte15) via OpenBabel, no built-in substructure or similarity search
PostgreSQL 12 era — No support for newer PostgreSQL features
No tests or type checking — Difficult to maintain and extend safely
mmJSON only — CIF parsing not supported (mmJSON conversion required upstream)
No schema migration tooling — Dynamic ALTER TABLE based schema upgrades, no version tracking

Technology Comparison

Aspect	mine2updater	pdb-mine-builder
Language	JavaScript (Node.js 14+)	Python 3.12+
PostgreSQL	12+	17+
DB driver	libpq (C binding)	psycopg3
Parser	Built-in mmJSON parser	gemmi (CIF + mmJSON)
Default format	mmJSON	CIF
Chemical search	OpenBabel fingerprints (bigint columns)	RDKit PostgreSQL cartridge (mol type)
Schema definition	YAML def files	SQLAlchemy Core
Schema migration	Dynamic ALTER TABLE	Alembic (versioned)
CLI	Shell script + `node mine2.js`	Typer + Rich
Config validation	None	Pydantic
Package manager	npm	Pixi (conda + PyPI)
Parallel processing	Node.js cluster	ProcessPoolExecutor
Tests	None	pytest
Type checking	None	ruff

Key Changes

CIF as Default Format

mine2updater exclusively used mmJSON files. pdb-mine-builder defaults to CIF (mmCIF) format and parses both CIF and mmJSON via gemmi. CIF is the canonical format distributed by wwPDB and avoids the need for upstream mmJSON conversion.

mmJSON is still supported — set format: mmjson in config.yml for pipelines that support it.

RDKit Instead of OpenBabel

The original system used OpenBabel to generate FP2 fingerprints, stored as 16 bigint columns (byte0–byte15) in cc.brief_summary. Searches required custom bit-manipulation SQL.

pdb-mine-builder uses the RDKit PostgreSQL cartridge, which provides:

Native mol column type for molecular structures
Built-in substructure search (@> operator)
Tanimoto similarity search (% operator with Morgan fingerprints)
Molecular descriptor functions (MW, LogP, TPSA, etc.)

-- Substructure search (pdb-mine-builder)
SELECT * FROM cc.brief_summary WHERE mol @> 'c1ccccc1'::mol;

-- Similarity search
SELECT *, tanimoto_sml(morganbv_fp(mol), morganbv_fp('CCO'::mol)) AS similarity
FROM cc.brief_summary
WHERE morganbv_fp(mol) % morganbv_fp('CCO'::mol);

SMILES values are generated from molecular structure via ccd2rdmol + RDKit, not taken from pdbx_chem_comp_descriptor records.

No Foreign Key Constraints

mine2updater defined foreign keys in its YAML schema files and managed them during schema upgrades. pdb-mine-builder removes all foreign key constraints to improve bulk loading performance. Data integrity is ensured by the pipeline logic, not database constraints.

Alembic Migrations

Schema changes in mine2updater were handled by comparing the running database against the YAML definition and issuing ALTER TABLE statements dynamically. This had no version tracking and could fail on complex changes.

pdb-mine-builder uses Alembic for versioned, reproducible schema migrations:

pixi run db-migrate "add new column"  # Generate migration
pixi run db-upgrade                   # Apply migrations
pixi run db-downgrade                 # Rollback
pixi run db-history                   # View history

Removed Columns

docid (bigint) — Retained in most brief_summary tables for compatibility, but not used as a primary key. pdb-mine-builder uses text-based primary keys directly. Removed from prd_family.brief_summary (replaced by name).
byte0–byte15 (bigint) — Replaced by RDKit mol column in cc.brief_summary.

New Columns

Columns added by pdb-mine-builder that do not exist in the original mine2 schema. These are marked with [pmb] prefix in column descriptions.

Schema	Table	Column	Type	Description
`cc`	`brief_summary`	`canonical_smiles`	text	Canonical SMILES generated by RDKit via ccd2rdmol
`pdbj`	`pdbx_struct_assembly_gen`	`_hash_asym_id_list`	text	SHA-256 hash for composite PK deduplication
`pdbj`	`pdbx_struct_assembly_gen`	`_hash_oper_expression`	text	SHA-256 hash for composite PK deduplication
`prd_family`	`brief_summary`	`name`	text	Family name (from pdbx_reference_molecule_family)

Removed Tables

Tables present in mine2 but removed in pdb-mine-builder:

Schema	Table	Reason
`cc`	`link_entry_pdbjplus`	PDBjPlus-specific, not needed
`pdbj`	`history_pdbmlplus`	PDBjPlus-specific, not needed
`pdbj`	`citation_author_pdbmlplus`	PDBjPlus-specific, not needed
`pdbj`	`exptl_crystal_pdbmlplus`	PDBjPlus-specific, not needed
`pdbj`	`refine_ls_shell_pdbmlplus`	PDBjPlus-specific, not needed
`pdbj`	`reflns_shell_pdbmlplus`	PDBjPlus-specific, not needed
`pdbj`	`software_pdbmlplus`	PDBjPlus-specific, not needed
`pdbj`	`struct_ref_src_pdbmlplus`	PDBjPlus-specific, not needed

Schema Changes

Removed schemas (present in mine2 rdb_docs but not in pdb-mine-builder):

Schema	Reason
`empiar`	Out of scope (EMPIAR data)
`misc`	Consolidated into other schemas
`sifts`	Out of scope (SIFTS mapping data)

Schema-only definitions (no pipeline yet):

Schema	Notes
`emdb`	Electron Microscopy Data Bank
`ihm`	Integrative/Hybrid Methods

These schemas have table definitions but no data-loading pipeline. See the Database Overview for current status.

prd_family pipeline

prd_family existed as schema-only in mine2 (no loading pipeline). pdb-mine-builder now provides a full CIF pipeline using family-all.cif.gz.

Mtime-Based Skip Optimization

pdb-mine-builder tracks file modification times in an entry_metadata table. During incremental updates, unchanged entries are automatically skipped. Use --force to bypass this check.

mine2updater did not have this optimization — it processed all entries on every run.

Bulk Load with COPY Protocol

pdb-mine-builder supports a dedicated bulk load mode using PostgreSQL's COPY protocol for initial data loading, which is significantly faster than row-by-row INSERT:

pixi run pmb load pdbj --force

Database Compatibility

pdb-mine-builder produces a database with the same Mine 2 schema structure. Schema names, table names, and column naming conventions are preserved. Basic queries should work with minimal changes:

Replace docid-based lookups with primary key lookups
Replace byte0–byte15 fingerprint queries with RDKit operators
Schema names and table names are unchanged
Column names follow the same CIF/mmJSON category naming convention

Important

pdb-mine-builder is an independent reimplementation of the Mine 2 database builder. While the schema structure is largely compatible, full query compatibility with PDBj's official services is not guaranteed.

SQL queries written for PDBj's Mine 2 RDB web API or REST API may not work as-is against a pdb-mine-builder database
Column types, NULL handling, and data transformations may differ in subtle ways
The brief_summary tables are constructed differently (e.g., no docid, different fingerprint columns)
New columns or tables added by PDBj's official Mine 2 system may not be present

If you rely on queries from PDBj's official documentation or web interface, test them against your local database before use in production.

Configuration Comparison

mine2updater (config.yml):

rdb:
  nworkers: 16
  constring: "dbname='mine2' user='pdbj' password='pdbj_pwd' port=5432"

obabel: /usr/bin/obabel

pipelines:
  pdb:
    deffile: ${CWD}schemas/pdbj.def.yml
    data: ${CWD}data/mmjson-noatom/

pdb-mine-builder (config.yml):

rdb:
  constring: "dbname=pmb user=pdbj port=5433"
  nworkers: 8

pipelines:
  pdbj:
    format: cif
    data: /path/to/data/pdb/cif/

Key config differences:

obabel setting removed (RDKit handles chemistry natively)
deffile removed (schemas defined in Python code)
format field added (choose between cif and mmjson)
Pipeline name pdb renamed to pdbj

Why Rewrite?​

Technology Comparison​

Key Changes​

CIF as Default Format​

RDKit Instead of OpenBabel​

No Foreign Key Constraints​

Alembic Migrations​

Removed Columns​

New Columns​

Removed Tables​

Schema Changes​

Mtime-Based Skip Optimization​

Bulk Load with COPY Protocol​

Database Compatibility​

Configuration Comparison​