Skip to main content

Migration from mine2updater

Not compatible with PDBj dump files

pdb-mine-builder is not compatible with the PostgreSQL dump files distributed by PDBj (e.g., mine2_pdbj.dump). The schema structure, column types, and data transformations differ significantly. You cannot restore a PDBj dump into a pdb-mine-builder database or vice versa. pdb-mine-builder builds its database from source data files (CIF/mmJSON), not from database dumps.

pdb-mine-builder is a complete rewrite of the original mine2updater — the RDB updater for PDBj's Mine 2 system. This page covers what changed and why.

Why Rewrite?

The original mine2updater served PDBj well, but had several limitations:

  • Unmaintained dependencies — Node.js libpq native bindings, OpenBabel for fingerprinting
  • No chemical search integration — Fingerprints stored as raw bigint columns (byte0–byte15) via OpenBabel, no built-in substructure or similarity search
  • PostgreSQL 12 era — No support for newer PostgreSQL features
  • No tests or type checking — Difficult to maintain and extend safely
  • mmJSON only — CIF parsing not supported (mmJSON conversion required upstream)
  • No schema migration tooling — Dynamic ALTER TABLE based schema upgrades, no version tracking

Technology Comparison

Aspectmine2updaterpdb-mine-builder
LanguageJavaScript (Node.js 14+)Python 3.12+
PostgreSQL12+17+
DB driverlibpq (C binding)psycopg3
ParserBuilt-in mmJSON parsergemmi (CIF + mmJSON)
Default formatmmJSONCIF
Chemical searchOpenBabel fingerprints (bigint columns)RDKit PostgreSQL cartridge (mol type)
Schema definitionYAML def filesSQLAlchemy Core
Schema migrationDynamic ALTER TABLEAlembic (versioned)
CLIShell script + node mine2.jsTyper + Rich
Config validationNonePydantic
Package managernpmPixi (conda + PyPI)
Parallel processingNode.js clusterProcessPoolExecutor
TestsNonepytest
Type checkingNoneruff

Key Changes

CIF as Default Format

mine2updater exclusively used mmJSON files. pdb-mine-builder defaults to CIF (mmCIF) format and parses both CIF and mmJSON via gemmi. CIF is the canonical format distributed by wwPDB and avoids the need for upstream mmJSON conversion.

mmJSON is still supported — set format: mmjson in config.yml for pipelines that support it.

RDKit Instead of OpenBabel

The original system used OpenBabel to generate FP2 fingerprints, stored as 16 bigint columns (byte0byte15) in cc.brief_summary. Searches required custom bit-manipulation SQL.

pdb-mine-builder uses the RDKit PostgreSQL cartridge, which provides:

  • Native mol column type for molecular structures
  • Built-in substructure search (@> operator)
  • Tanimoto similarity search (% operator with Morgan fingerprints)
  • Molecular descriptor functions (MW, LogP, TPSA, etc.)
-- Substructure search (pdb-mine-builder)
SELECT * FROM cc.brief_summary WHERE mol @> 'c1ccccc1'::mol;

-- Similarity search
SELECT *, tanimoto_sml(morganbv_fp(mol), morganbv_fp('CCO'::mol)) AS similarity
FROM cc.brief_summary
WHERE morganbv_fp(mol) % morganbv_fp('CCO'::mol);

SMILES values are generated from molecular structure via ccd2rdmol + RDKit, not taken from pdbx_chem_comp_descriptor records.

No Foreign Key Constraints

mine2updater defined foreign keys in its YAML schema files and managed them during schema upgrades. pdb-mine-builder removes all foreign key constraints to improve bulk loading performance. Data integrity is ensured by the pipeline logic, not database constraints.

Alembic Migrations

Schema changes in mine2updater were handled by comparing the running database against the YAML definition and issuing ALTER TABLE statements dynamically. This had no version tracking and could fail on complex changes.

pdb-mine-builder uses Alembic for versioned, reproducible schema migrations:

pixi run db-migrate "add new column"  # Generate migration
pixi run db-upgrade # Apply migrations
pixi run db-downgrade # Rollback
pixi run db-history # View history

Removed Columns

  • docid (bigint) — Retained in most brief_summary tables for compatibility, but not used as a primary key. pdb-mine-builder uses text-based primary keys directly. Removed from prd_family.brief_summary (replaced by name).
  • byte0byte15 (bigint) — Replaced by RDKit mol column in cc.brief_summary.

New Columns

Columns added by pdb-mine-builder that do not exist in the original mine2 schema. These are marked with [pmb] prefix in column descriptions.

SchemaTableColumnTypeDescription
ccbrief_summarycanonical_smilestextCanonical SMILES generated by RDKit via ccd2rdmol
pdbjpdbx_struct_assembly_gen_hash_asym_id_listtextSHA-256 hash for composite PK deduplication
pdbjpdbx_struct_assembly_gen_hash_oper_expressiontextSHA-256 hash for composite PK deduplication
prd_familybrief_summarynametextFamily name (from pdbx_reference_molecule_family)

Removed Tables

Tables present in mine2 but removed in pdb-mine-builder:

SchemaTableReason
cclink_entry_pdbjplusPDBjPlus-specific, not needed
pdbjhistory_pdbmlplusPDBjPlus-specific, not needed
pdbjcitation_author_pdbmlplusPDBjPlus-specific, not needed
pdbjexptl_crystal_pdbmlplusPDBjPlus-specific, not needed
pdbjrefine_ls_shell_pdbmlplusPDBjPlus-specific, not needed
pdbjreflns_shell_pdbmlplusPDBjPlus-specific, not needed
pdbjsoftware_pdbmlplusPDBjPlus-specific, not needed
pdbjstruct_ref_src_pdbmlplusPDBjPlus-specific, not needed

Schema Changes

Removed schemas (present in mine2 rdb_docs but not in pdb-mine-builder):

SchemaReason
empiarOut of scope (EMPIAR data)
miscConsolidated into other schemas
siftsOut of scope (SIFTS mapping data)

Schema-only definitions (no pipeline yet):

SchemaNotes
emdbElectron Microscopy Data Bank
ihmIntegrative/Hybrid Methods

These schemas have table definitions but no data-loading pipeline. See the Database Overview for current status.

prd_family pipeline

prd_family existed as schema-only in mine2 (no loading pipeline). pdb-mine-builder now provides a full CIF pipeline using family-all.cif.gz.

Mtime-Based Skip Optimization

pdb-mine-builder tracks file modification times in an entry_metadata table. During incremental updates, unchanged entries are automatically skipped. Use --force to bypass this check.

mine2updater did not have this optimization — it processed all entries on every run.

Bulk Load with COPY Protocol

pdb-mine-builder supports a dedicated bulk load mode using PostgreSQL's COPY protocol for initial data loading, which is significantly faster than row-by-row INSERT:

pixi run pmb load pdbj --force

Database Compatibility

pdb-mine-builder produces a database with the same Mine 2 schema structure. Schema names, table names, and column naming conventions are preserved. Basic queries should work with minimal changes:

  • Replace docid-based lookups with primary key lookups
  • Replace byte0byte15 fingerprint queries with RDKit operators
  • Schema names and table names are unchanged
  • Column names follow the same CIF/mmJSON category naming convention
Important

pdb-mine-builder is an independent reimplementation of the Mine 2 database builder. While the schema structure is largely compatible, full query compatibility with PDBj's official services is not guaranteed.

  • SQL queries written for PDBj's Mine 2 RDB web API or REST API may not work as-is against a pdb-mine-builder database
  • Column types, NULL handling, and data transformations may differ in subtle ways
  • The brief_summary tables are constructed differently (e.g., no docid, different fingerprint columns)
  • New columns or tables added by PDBj's official Mine 2 system may not be present

If you rely on queries from PDBj's official documentation or web interface, test them against your local database before use in production.

Configuration Comparison

mine2updater (config.yml):

rdb:
nworkers: 16
constring: "dbname='mine2' user='pdbj' password='pdbj_pwd' port=5432"

obabel: /usr/bin/obabel

pipelines:
pdb:
deffile: ${CWD}schemas/pdbj.def.yml
data: ${CWD}data/mmjson-noatom/

pdb-mine-builder (config.yml):

rdb:
constring: "dbname=pmb user=pdbj port=5433"
nworkers: 8

pipelines:
pdbj:
format: cif
data: /path/to/data/pdb/cif/

Key config differences:

  • obabel setting removed (RDKit handles chemistry natively)
  • deffile removed (schemas defined in Python code)
  • format field added (choose between cif and mmjson)
  • Pipeline name pdb renamed to pdbj