Migration from mine2updater
pdb-mine-builder is not compatible with the PostgreSQL dump files distributed by PDBj (e.g., mine2_pdbj.dump). The schema structure, column types, and data transformations differ significantly. You cannot restore a PDBj dump into a pdb-mine-builder database or vice versa. pdb-mine-builder builds its database from source data files (CIF/mmJSON), not from database dumps.
pdb-mine-builder is a complete rewrite of the original mine2updater — the RDB updater for PDBj's Mine 2 system. This page covers what changed and why.
Why Rewrite?
The original mine2updater served PDBj well, but had several limitations:
- Unmaintained dependencies — Node.js
libpqnative bindings, OpenBabel for fingerprinting - No chemical search integration — Fingerprints stored as raw bigint columns (byte0–byte15) via OpenBabel, no built-in substructure or similarity search
- PostgreSQL 12 era — No support for newer PostgreSQL features
- No tests or type checking — Difficult to maintain and extend safely
- mmJSON only — CIF parsing not supported (mmJSON conversion required upstream)
- No schema migration tooling — Dynamic ALTER TABLE based schema upgrades, no version tracking
Technology Comparison
| Aspect | mine2updater | pdb-mine-builder |
|---|---|---|
| Language | JavaScript (Node.js 14+) | Python 3.12+ |
| PostgreSQL | 12+ | 17+ |
| DB driver | libpq (C binding) | psycopg3 |
| Parser | Built-in mmJSON parser | gemmi (CIF + mmJSON) |
| Default format | mmJSON | CIF |
| Chemical search | OpenBabel fingerprints (bigint columns) | RDKit PostgreSQL cartridge (mol type) |
| Schema definition | YAML def files | SQLAlchemy Core |
| Schema migration | Dynamic ALTER TABLE | Alembic (versioned) |
| CLI | Shell script + node mine2.js | Typer + Rich |
| Config validation | None | Pydantic |
| Package manager | npm | Pixi (conda + PyPI) |
| Parallel processing | Node.js cluster | ProcessPoolExecutor |
| Tests | None | pytest |
| Type checking | None | ruff |
Key Changes
CIF as Default Format
mine2updater exclusively used mmJSON files. pdb-mine-builder defaults to CIF (mmCIF) format and parses both CIF and mmJSON via gemmi. CIF is the canonical format distributed by wwPDB and avoids the need for upstream mmJSON conversion.
mmJSON is still supported — set format: mmjson in config.yml for pipelines that support it.
RDKit Instead of OpenBabel
The original system used OpenBabel to generate FP2 fingerprints, stored as 16 bigint columns (byte0–byte15) in cc.brief_summary. Searches required custom bit-manipulation SQL.
pdb-mine-builder uses the RDKit PostgreSQL cartridge, which provides:
- Native
molcolumn type for molecular structures - Built-in substructure search (
@>operator) - Tanimoto similarity search (
%operator with Morgan fingerprints) - Molecular descriptor functions (MW, LogP, TPSA, etc.)
-- Substructure search (pdb-mine-builder)
SELECT * FROM cc.brief_summary WHERE mol @> 'c1ccccc1'::mol;
-- Similarity search
SELECT *, tanimoto_sml(morganbv_fp(mol), morganbv_fp('CCO'::mol)) AS similarity
FROM cc.brief_summary
WHERE morganbv_fp(mol) % morganbv_fp('CCO'::mol);
SMILES values are generated from molecular structure via ccd2rdmol + RDKit, not taken from pdbx_chem_comp_descriptor records.
No Foreign Key Constraints
mine2updater defined foreign keys in its YAML schema files and managed them during schema upgrades. pdb-mine-builder removes all foreign key constraints to improve bulk loading performance. Data integrity is ensured by the pipeline logic, not database constraints.
Alembic Migrations
Schema changes in mine2updater were handled by comparing the running database against the YAML definition and issuing ALTER TABLE statements dynamically. This had no version tracking and could fail on complex changes.
pdb-mine-builder uses Alembic for versioned, reproducible schema migrations:
pixi run db-migrate "add new column" # Generate migration
pixi run db-upgrade # Apply migrations
pixi run db-downgrade # Rollback
pixi run db-history # View history
Removed Columns
docid(bigint) — Retained in mostbrief_summarytables for compatibility, but not used as a primary key. pdb-mine-builder uses text-based primary keys directly. Removed fromprd_family.brief_summary(replaced byname).byte0–byte15(bigint) — Replaced by RDKitmolcolumn incc.brief_summary.
New Columns
Columns added by pdb-mine-builder that do not exist in the original mine2 schema. These are marked with [pmb] prefix in column descriptions.
| Schema | Table | Column | Type | Description |
|---|---|---|---|---|
cc | brief_summary | canonical_smiles | text | Canonical SMILES generated by RDKit via ccd2rdmol |
pdbj | pdbx_struct_assembly_gen | _hash_asym_id_list | text | SHA-256 hash for composite PK deduplication |
pdbj | pdbx_struct_assembly_gen | _hash_oper_expression | text | SHA-256 hash for composite PK deduplication |
prd_family | brief_summary | name | text | Family name (from pdbx_reference_molecule_family) |
Removed Tables
Tables present in mine2 but removed in pdb-mine-builder:
| Schema | Table | Reason |
|---|---|---|
cc | link_entry_pdbjplus | PDBjPlus-specific, not needed |
pdbj | history_pdbmlplus | PDBjPlus-specific, not needed |
pdbj | citation_author_pdbmlplus | PDBjPlus-specific, not needed |
pdbj | exptl_crystal_pdbmlplus | PDBjPlus-specific, not needed |
pdbj | refine_ls_shell_pdbmlplus | PDBjPlus-specific, not needed |
pdbj | reflns_shell_pdbmlplus | PDBjPlus-specific, not needed |
pdbj | software_pdbmlplus | PDBjPlus-specific, not needed |
pdbj | struct_ref_src_pdbmlplus | PDBjPlus-specific, not needed |
Schema Changes
Removed schemas (present in mine2 rdb_docs but not in pdb-mine-builder):
| Schema | Reason |
|---|---|
empiar | Out of scope (EMPIAR data) |
misc | Consolidated into other schemas |
sifts | Out of scope (SIFTS mapping data) |
Schema-only definitions (no pipeline yet):
| Schema | Notes |
|---|---|
emdb | Electron Microscopy Data Bank |
ihm | Integrative/Hybrid Methods |
These schemas have table definitions but no data-loading pipeline. See the Database Overview for current status.
prd_family existed as schema-only in mine2 (no loading pipeline). pdb-mine-builder now provides a full CIF pipeline using family-all.cif.gz.
Mtime-Based Skip Optimization
pdb-mine-builder tracks file modification times in an entry_metadata table. During incremental updates, unchanged entries are automatically skipped. Use --force to bypass this check.
mine2updater did not have this optimization — it processed all entries on every run.
Bulk Load with COPY Protocol
pdb-mine-builder supports a dedicated bulk load mode using PostgreSQL's COPY protocol for initial data loading, which is significantly faster than row-by-row INSERT:
pixi run pmb load pdbj --force
Database Compatibility
pdb-mine-builder produces a database with the same Mine 2 schema structure. Schema names, table names, and column naming conventions are preserved. Basic queries should work with minimal changes:
- Replace
docid-based lookups with primary key lookups - Replace
byte0–byte15fingerprint queries with RDKit operators - Schema names and table names are unchanged
- Column names follow the same CIF/mmJSON category naming convention
pdb-mine-builder is an independent reimplementation of the Mine 2 database builder. While the schema structure is largely compatible, full query compatibility with PDBj's official services is not guaranteed.
- SQL queries written for PDBj's Mine 2 RDB web API or REST API may not work as-is against a pdb-mine-builder database
- Column types, NULL handling, and data transformations may differ in subtle ways
- The
brief_summarytables are constructed differently (e.g., nodocid, different fingerprint columns) - New columns or tables added by PDBj's official Mine 2 system may not be present
If you rely on queries from PDBj's official documentation or web interface, test them against your local database before use in production.
Configuration Comparison
mine2updater (config.yml):
rdb:
nworkers: 16
constring: "dbname='mine2' user='pdbj' password='pdbj_pwd' port=5432"
obabel: /usr/bin/obabel
pipelines:
pdb:
deffile: ${CWD}schemas/pdbj.def.yml
data: ${CWD}data/mmjson-noatom/
pdb-mine-builder (config.yml):
rdb:
constring: "dbname=pmb user=pdbj port=5433"
nworkers: 8
pipelines:
pdbj:
format: cif
data: /path/to/data/pdb/cif/
Key config differences:
obabelsetting removed (RDKit handles chemistry natively)deffileremoved (schemas defined in Python code)formatfield added (choose betweencifandmmjson)- Pipeline name
pdbrenamed topdbj