Skip to main content

Database Overview

pdb-mine-builder creates a PostgreSQL database with multiple schemas, each containing tables derived from PDB (Protein Data Bank) data sources.

Schemas

SchemaPrimary KeyEntriesTablesSizeDescription
pdbjpdbid~250k250183 GBMain structure data from mmCIF/mmJSON
vrptpdbid~250k69152 GBValidation reports
contactspdbid~250k213 GBProtein-protein contact data
cccomp_id~50k12811 MBChemical component dictionary (with RDKit)
ccmodelmodel_id~23k8174 MBChemical component 3D models
prdprd_id~1.2k1750 MBBIRD reference dictionary
prd_familyfamily_prd_id~200102.3 MBBIRD family classifications
chem(source, id)~50k1Unified chemical compounds (cc + prd)
emdbemdb_id79Electron Microscopy Data Bank (experimental)
ihmpdbid114Integrative/Hybrid Methods (experimental)

Total: 369 tables, ~349 GB (excluding emdb/ihm which have no data pipeline yet, as of 2026-03-16)

Schema Search — Search across all schemas, tables, and columns in one place. Table Relations — Explore relationships between tables interactively.

emdb / ihm schemas

The emdb and ihm schemas have table definitions but have not been thoroughly tested with production data. They currently have no data-loading pipeline. If there is community demand, these schemas will receive full pipeline support and validation. Feedback and contributions are welcome via GitHub Issues.

Data Flow

PDBj rsync servers

│ pixi run pmb sync

Local data files (CIF / mmJSON / JSON)

│ pixi run pmb load / update

Pipelines: parse → transform → bulk upsert


PostgreSQL (one schema per data source)

Schema Design

  • Each schema has a brief_summary table with key metadata for quick lookups
  • Primary keys are text-based identifiers (e.g., pdbid, comp_id)
  • No foreign key constraints (for loading performance)
  • All columns except primary keys are nullable
  • Table names match CIF/mmJSON category names for straightforward mapping
  • Columns added by pdb-mine-builder (not in the original mine2 schema) are marked with [pmb] prefix in their descriptions

Common Query Patterns

Find an entry

-- PDB structure
SELECT * FROM pdbj.brief_summary WHERE pdbid = '1abc';

-- Chemical component
SELECT * FROM cc.brief_summary WHERE comp_id = 'ATP';

List tables in a schema

SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'pdbj'
ORDER BY table_name;

Count entries per schema

SELECT 'pdbj' AS schema, COUNT(*) FROM pdbj.brief_summary
UNION ALL
SELECT 'cc', COUNT(*) FROM cc.brief_summary
UNION ALL
SELECT 'vrpt', COUNT(*) FROM vrpt.brief_summary
UNION ALL
SELECT 'chem', COUNT(*) FROM chem.compounds;