Skip to main content

Updating the Database

After syncing data files, use the update or load commands to parse them and load records into PostgreSQL.

Pipelines

Each pipeline processes a specific type of PDB data:

PipelineDescriptionFormatNotes
pdbjMain structure data (~248k entries)CIF / mmJSONFile-per-entry, atom_site skipped
ccChemical component dictionary (~40k compounds)CIF / mmJSONSingle file (components.cif.gz)
ccmodelChemical component modelsCIF / mmJSONSingle file (chem_comp_model.cif.gz)
prdBIRD reference dictionaryCIF / mmJSONDual file (prd-all.cif.gz + prdcc-all.cif.gz)
vrptValidation reportsCIFNested directory structure
contactsProtein-protein contact dataJSONArray format
emdbElectron Microscopy Data Bank--Schema only, no pipeline
ihmIntegrative/Hybrid Methods--Schema only, no pipeline

The first four pipelines (pdbj, cc, ccmodel, prd) support dual format. Set format: cif or format: mmjson in config.yml to choose. CIF is the default.

Initial Load (Bulk)

For first-time database population, use the load command. It uses PostgreSQL's COPY protocol for significantly faster throughput.

warning

The load command truncates all tables in the target schema before loading. Do not use it on a database with data you want to keep.

# Load a single pipeline
pixi run pmb load cc --force

# Load multiple pipelines
pixi run pmb load cc ccmodel prd --force

# Load with entry limit (useful for testing)
pixi run pmb load pdbj --limit 1000 --force

The --force flag skips the interactive confirmation prompt for the truncate operation.

Bulk Load Mode (Optional)

For very large initial loads (especially pdbj with 248k+ entries), you can tune PostgreSQL for maximum write throughput:

# 1. Enable bulk load mode (disables fsync, autovacuum)
pixi run db-bulkload-mode

# 2. Run data loading
pixi run pmb load pdbj --force
pixi run pmb load cc ccmodel prd vrpt contacts --force

# 3. Restore safe settings
pixi run db-safe-mode

# 4. Run VACUUM ANALYZE to update statistics
psql -d pmb -c "VACUUM ANALYZE;"
warning

Bulk load mode disables crash safety (fsync=off). If PostgreSQL crashes during bulk load, the data directory may be corrupted and you will need to reinitialize:

pixi run db-stop
rm -rf $PGDATA
pixi run db-init
pixi run db-start
# Re-run loading from scratch

Incremental Updates

After the initial load, use the update command for ongoing incremental updates:

# Update all pipelines
pixi run pmb update

# Update specific pipelines
pixi run pmb update pdbj cc

# Limit entries processed (useful for testing)
pixi run pmb update pdbj --limit 100

# Force reprocessing (ignore mtime cache)
pixi run pmb update pdbj --force

The update command tracks file modification times (mtime) for file-per-entry pipelines (pdbj, vrpt, contacts). Unchanged entries are automatically skipped, making incremental updates fast.

CLI Options

OptionShortDescription
--limit-lLimit number of entries to process
--workers-wNumber of worker processes (overrides nworkers in config)
--force-fReprocess all entries, ignoring mtime cache (pdbj, vrpt, contacts)
--logCustom log file path (default: logs/<pipeline>_YYYYMMDD_HHMMSS.log)
--verbose-vEnable DEBUG-level logging
--config-cPath to config file (default: config.yml)

Full Cycle: sync + update

The all command runs sync followed by update in a single step:

pixi run pmb all

This is equivalent to:

pixi run pmb sync
pixi run pmb update

Reset Schemas

To drop all tables in a schema and start over:

# Reset a single schema
pixi run pmb reset cc

# Reset multiple schemas
pixi run pmb reset cc pdbj

# Reset all schemas
pixi run pmb reset all

# Skip confirmation prompt
pixi run pmb reset all --force
warning

reset drops all tables and data in the specified schema(s). This cannot be undone.

Database Statistics

View current table counts, row counts, and last update timestamps:

pixi run pmb stats

Backward Compatibility

Legacy pipeline names with format suffixes (pdbj-cif, cc-json, etc.) are still accepted but deprecated. They emit a warning and resolve to the base pipeline name. Format selection is now controlled by the format field in config.yml.

A typical first-time setup looks like this:

# 1. Sync data from PDBj
pixi run pmb sync pdbj cc prd

# 2. (Optional) Enable bulk load mode
pixi run db-bulkload-mode

# 3. Load data into PostgreSQL
pixi run pmb load cc --force
pixi run pmb load prd --force
pixi run pmb load pdbj --limit 1000 --force # Test with subset first
pixi run pmb load pdbj --force # Then load everything

# 4. Restore safe mode and vacuum
pixi run db-safe-mode
psql -d pmb -c "VACUUM ANALYZE;"

# 5. Check stats
pixi run pmb stats

For ongoing updates after the initial load:

pixi run pmb sync
pixi run pmb update