Database Overview
pdb-mine-builder creates a PostgreSQL database with multiple schemas, each containing tables derived from PDB (Protein Data Bank) data sources.
Schemas
| Schema | Primary Key | Entries | Tables | Size | Description |
|---|---|---|---|---|---|
| pdbj | pdbid | ~250k | 250 | 183 GB | Main structure data from mmCIF/mmJSON |
| vrpt | pdbid | ~250k | 69 | 152 GB | Validation reports |
| contacts | pdbid | ~250k | 2 | 13 GB | Protein-protein contact data |
| cc | comp_id | ~50k | 12 | 811 MB | Chemical component dictionary (with RDKit) |
| ccmodel | model_id | ~23k | 8 | 174 MB | Chemical component 3D models |
| prd | prd_id | ~1.2k | 17 | 50 MB | BIRD reference dictionary |
| prd_family | family_prd_id | ~200 | 10 | 2.3 MB | BIRD family classifications |
| chem | (source, id) | ~50k | 1 | — | Unified chemical compounds (cc + prd) |
| emdb | emdb_id | — | 79 | — | Electron Microscopy Data Bank (experimental) |
| ihm | pdbid | — | 114 | — | Integrative/Hybrid Methods (experimental) |
Total: 369 tables, ~349 GB (excluding emdb/ihm which have no data pipeline yet, as of 2026-03-16)
Schema Search — Search across all schemas, tables, and columns in one place. Table Relations — Explore relationships between tables interactively.
emdb / ihm schemas
The emdb and ihm schemas have table definitions but have not been thoroughly tested with production data. They currently have no data-loading pipeline. If there is community demand, these schemas will receive full pipeline support and validation. Feedback and contributions are welcome via GitHub Issues.
Data Flow
PDBj rsync servers
│
│ pixi run pmb sync
▼
Local data files (CIF / mmJSON / JSON)
│
│ pixi run pmb load / update
▼
Pipelines: parse → transform → bulk upsert
│
▼
PostgreSQL (one schema per data source)
Schema Design
- Each schema has a
brief_summarytable with key metadata for quick lookups - Primary keys are text-based identifiers (e.g.,
pdbid,comp_id) - No foreign key constraints (for loading performance)
- All columns except primary keys are nullable
- Table names match CIF/mmJSON category names for straightforward mapping
- Columns added by pdb-mine-builder (not in the original mine2 schema) are marked with
[pmb]prefix in their descriptions
Common Query Patterns
Find an entry
-- PDB structure
SELECT * FROM pdbj.brief_summary WHERE pdbid = '1abc';
-- Chemical component
SELECT * FROM cc.brief_summary WHERE comp_id = 'ATP';
List tables in a schema
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'pdbj'
ORDER BY table_name;
Count entries per schema
SELECT 'pdbj' AS schema, COUNT(*) FROM pdbj.brief_summary
UNION ALL
SELECT 'cc', COUNT(*) FROM cc.brief_summary
UNION ALL
SELECT 'vrpt', COUNT(*) FROM vrpt.brief_summary
UNION ALL
SELECT 'chem', COUNT(*) FROM chem.compounds;