Skip to main content

Syncing Data

The sync command downloads data using rsync. All sync targets (source URLs, destinations, and rsync options) are defined in config.yml under the sync section. There are no hardcoded defaults -- you have full control.

How It Works

pdb-mine-builder runs rsync for each configured sync target. Only changed files are transferred on subsequent runs, making incremental syncs fast.

Usage

# Sync all configured targets
pixi run pmb sync

# Sync specific targets
pixi run pmb sync pdbj cc prd

# Preview what would be synced (no actual download)
pixi run pmb sync --dry-run
pixi run pmb sync pdbj --dry-run

Configuring Sync Targets

All sync targets are defined in the sync section of config.yml. Each target specifies:

  • source (or sources): rsync source URL(s)
  • dest: local destination directory
  • options: rsync options (default: ["-av", "--size-only"])
sync:
pdbj:
source: "data.pdbj.org::ftp_data/structures/divided/mmCIF/"
dest: ${DATA_DIR}/data/structures/divided/mmCIF/
options: ["-av", "--delete", "--size-only"]
cc:
source: "data.pdbj.org::ftp_data/monomers/components.cif.gz"
dest: ${DATA_DIR}/data/monomers/
prd:
sources:
- "data.pdbj.org::ftp_data/bird/prd/prd-all.cif.gz"
- "data.pdbj.org::ftp_data/bird/prd/prdcc-all.cif.gz"
dest: ${DATA_DIR}/data/bird/prd/

See Configuration - Sync Targets and config.example.yml for the full list of available targets.

tip

You only need to configure the targets that match your chosen format. If you use CIF (the default), you do not need the mmJSON targets.

Regional Mirrors

The default config.example.yml uses PDBj (Japan) servers. For CIF targets, you can use a regional wwPDB mirror by changing the source URLs:

RegionServer
Japan (PDBj)data.pdbj.org::ftp_data/...
US (RCSB)rsync.rcsb.org::ftp_data/...
Europe (PDBe)rsync.ebi.ac.uk::pub/databases/pdb/data/...
sync:
pdbj:
# Change source to RCSB (US)
source: "rsync.rcsb.org::ftp_data/structures/divided/mmCIF/"
dest: ${DATA_DIR}/data/structures/divided/mmCIF/

Multiple Sources

Some targets (e.g., prd) need multiple files. Use the sources field (list) instead of source:

sync:
prd:
sources:
- "data.pdbj.org::ftp_data/bird/prd/prd-all.cif.gz"
- "data.pdbj.org::ftp_data/bird/prd/prdcc-all.cif.gz"
dest: ${DATA_DIR}/data/bird/prd/

Custom rsync Options

Each target can have custom rsync options. The default is ["-av", "--size-only"]:

sync:
pdbj:
source: "data.pdbj.org::ftp_data/structures/divided/mmCIF/"
dest: ${DATA_DIR}/data/structures/divided/mmCIF/
options: ["-av", "--delete", "--size-only"] # add --delete to remove stale files
vrpt:
source: "data.pdbj.org::ftp/validation_reports/"
dest: ${DATA_DIR}/validation_reports/
options: ["-av", "--size-only", "--include=*/", "--include=*_validation.cif.gz", "--exclude=*"]

CLI Options

OptionShortDescription
--config-cPath to config file (default: config.yml)
--dry-run-nShow what would be synced without downloading

Disk Space Considerations

The full PDBj dataset is large. Approximate sizes:

  • pdbj (mmCIF): ~100 GB (248k+ compressed files)
  • vrpt: ~50 GB
  • cc: ~1 GB (single compressed file)
  • contacts: ~10 GB
note

Start with a small subset using the update --limit flag before syncing the full dataset. You can sync specific targets incrementally as needed.