Database Overview

pdb-mine-builder creates a PostgreSQL database with multiple schemas, each containing tables derived from PDB (Protein Data Bank) data sources.

Schemas

Schema	Primary Key	Entries	Tables	Size	Description
pdbj	`pdbid`	~250k	250	183 GB	Main structure data from mmCIF/mmJSON
vrpt	`pdbid`	~250k	69	152 GB	Validation reports
contacts	`pdbid`	~250k	2	13 GB	Protein-protein contact data
cc	`comp_id`	~50k	12	811 MB	Chemical component dictionary (with RDKit)
ccmodel	`model_id`	~23k	8	174 MB	Chemical component 3D models
prd	`prd_id`	~1.2k	17	50 MB	BIRD reference dictionary
prd_family	`family_prd_id`	~200	10	2.3 MB	BIRD family classifications
chem	`(source, id)`	~50k	1	—	Unified chemical compounds (cc + prd)
emdb	`emdb_id`	—	79	—	Electron Microscopy Data Bank (experimental)
ihm	`pdbid`	—	114	—	Integrative/Hybrid Methods (experimental)

Total: 369 tables, ~349 GB (excluding emdb/ihm which have no data pipeline yet, as of 2026-03-16)

Schema Search — Search across all schemas, tables, and columns in one place. Table Relations — Explore relationships between tables interactively.

emdb / ihm schemas

The emdb and ihm schemas have table definitions but have not been thoroughly tested with production data. They currently have no data-loading pipeline. If there is community demand, these schemas will receive full pipeline support and validation. Feedback and contributions are welcome via GitHub Issues.

Data Flow

PDBj rsync servers
       │
       │  pixi run pmb sync
       ▼
Local data files (CIF / mmJSON / JSON)
       │
       │  pixi run pmb load / update
       ▼
Pipelines: parse → transform → bulk upsert
       │
       ▼
PostgreSQL (one schema per data source)

Schema Design

Each schema has a brief_summary table with key metadata for quick lookups
Primary keys are text-based identifiers (e.g., pdbid, comp_id)
No foreign key constraints (for loading performance)
All columns except primary keys are nullable
Table names match CIF/mmJSON category names for straightforward mapping
Columns added by pdb-mine-builder (not in the original mine2 schema) are marked with [pmb] prefix in their descriptions

Common Query Patterns

Find an entry

-- PDB structure
SELECT * FROM pdbj.brief_summary WHERE pdbid = '1abc';

-- Chemical component
SELECT * FROM cc.brief_summary WHERE comp_id = 'ATP';

List tables in a schema

SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'pdbj'
ORDER BY table_name;

Count entries per schema

SELECT 'pdbj' AS schema, COUNT(*) FROM pdbj.brief_summary
UNION ALL
SELECT 'cc', COUNT(*) FROM cc.brief_summary
UNION ALL
SELECT 'vrpt', COUNT(*) FROM vrpt.brief_summary
UNION ALL
SELECT 'chem', COUNT(*) FROM chem.compounds;

Schemas​

Data Flow​

Schema Design​

Common Query Patterns​

Find an entry​

List tables in a schema​

Count entries per schema​