title: CPBind Extraction Plan tags: [plan, data-scraping, cpbind, dataset] created: 2026-05-14 updated: 2026-05-27 status: active related:


03_cpbind — Implementation Plan

For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.

Goal: Extract cyclic peptide-protein binding affinity data from CPBind (CPSea’s high-affinity subset with computed Rosetta ddG + Vina scores).

Architecture: Download CPBind_unique.zip (3.9 GB) from Zenodo — the deduplicated 101K-entry subset. Extract only the property CSVs (CPBind_unique_affinity.csv, CPBind_unique_basic.tsv). Parse, normalize, and output standardized CSV. Mark all entries as affinity_is_computed=True since affinities are from Rosetta ddG + Vina, not experimental.

Tech Stack: Python, pandas

Key considerations:

  • CPBind has 476K entries, CPBind_unique has 101K (deduplicated by Foldseek clustering). Use CPBind_unique for first pass.
  • CPBind.zip is 19.3 GB, CPBind_unique.zip is 3.9 GB. Both contain PDB structures we don’t need.
  • We ONLY need the property CSVs from inside the zip (affinity + basic files).
  • All entries are cyclic peptides by definition (the dataset is filtered for cyclic).
  • Affinities are computed (Rosetta ddG + Vina), not experimental. Must set affinity_is_computed=True.
  • No sequences are in the property files — sequences come from PDB structure filenames. The index file maps names to entries. We’ll extract peptide length and receptor info from the basic file.
  • The basic file has filter metrics. The affinity file has ddG and Vina scores.

Task 1: Create source directory structure

Files:

  • Create: data-scraping/sources/03_cpbind/
  • Create: data-scraping/sources/03_cpbind/raw/
  • Create: data-scraping/sources/03_cpbind/src/
  • Create: data-scraping/sources/03_cpbind/output/
mkdir -p data-scraping/sources/03_cpbind/{raw,src,output}

Verify: ls data-scraping/sources/03_cpbind/ shows raw, src, output dirs.


Task 2: Download CPBind_unique.zip and extract property CSVs

Objective: Download the zip, extract ONLY the property CSV/TSV files, then delete the zip.

Files:

  • Create: data-scraping/sources/03_cpbind/raw/CPBind_unique_affinity.csv
  • Create: data-scraping/sources/03_cpbind/raw/CPBind_unique_basic.tsv
  • Create: data-scraping/sources/03_cpbind/raw/CPBind_unique_index.txt

Step 1: Download

cd data-scraping/sources/03_cpbind/raw
curl -C - -o CPBind_unique.zip "https://zenodo.org/records/16794716/files/CPBind_unique.zip?download=1"

Expected: ~3.9 GB file downloaded.

Step 2: Extract only property files + index

unzip -j CPBind_unique.zip "CPBind_unique/CPBind_unique_Properties/*" "CPBind_unique/CPBind_unique_index.txt" -d .

Step 3: Extract the basic file if it’s TSV inside Properties

# Check what files were extracted
ls -lh *.csv *.tsv *.txt 2>/dev/null

Expected: CPBind_unique_affinity.csv, CPBind_unique_basic.tsv, CPBind_unique_hydrophobic.csv, CPBind_unique_validity.csv, CPBind_unique_index.txt

Step 4: Delete the zip

rm CPBind_unique.zip

Step 5: Inspect the extracted files

head -3 CPBind_unique_affinity.csv
head -3 CPBind_unique_basic.tsv
wc -l CPBind_unique_affinity.csv CPBind_unique_basic.tsv

Verify: Property CSVs are real data (not HTML 404), column headers visible.


Task 3: Write 03_cpbind.py extraction script

Objective: Parse CPBind property files and output standardized CSV.

Files:

  • Create: data-scraping/sources/03_cpbind/src/03_cpbind.py

Key implementation notes:

  1. No download function — raw files already downloaded manually (too large for urllib).

  2. Schema mapping from CPBind columns:

    • CPBind_unique_basic.tsv: has structure IDs, peptide length, receptor info, filter metrics
    • CPBind_unique_affinity.csv: has Rosetta ddG and Vina scores per structure
  3. Affinity handling:

    • Rosetta ddG: units are kcal/mol. Lower (more negative) = stronger binding.
    • Vina score: units are kcal/mol. Lower (more negative) = stronger binding.
    • These are NOT directly comparable to experimental Kd/IC50. We’ll store them as-is in affinity_value with unit “kcal/mol” and type “ddG” (Rosetta) or “Vina_score”.
    • Convert ddG to approximate Kd using: Kd ≈ exp(ddG / RT) where R=0.001987 kcal/(mol·K), T=298K. Store in affinity_value_nM for cross-source comparison, but flag as computed approximation.
    • affinity_is_computed = True for ALL entries.
    • is_binder based on the dataset’s own filter: ddG < -25 and Vina < -6 (original CPBind definition).
  4. Peptide sequence: Not directly available in property files. Extract from structure filenames in index file — the format is {AFDB_id}_{first_residue}_{last_residue}_relaxed. Store peptide length from basic file.

  5. Cyclization types: From basic file or inferred from structure naming (CysCys = disulfide, HeadTail = head_to_tail, IsoPep = isopeptide/thioether).

  6. Target protein: AFDB ID from structure name (before the residue range). Not human-readable protein names.

  7. Follow the pattern from 01_ppikb.py and 02_cyclicpepedia.py for directory structure:

    SOURCE_DIR = Path(__file__).parent.parent
    RAW_DIR = SOURCE_DIR / "raw"
    OUTPUT_DIR = SOURCE_DIR / "output"
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

Pseudocode:

def extract():
    # Load basic.tsv (TSV)
    df_basic = pd.read_csv(RAW_DIR / "CPBind_unique_basic.tsv", sep="\t")
    # Load affinity.csv
    df_aff = pd.read_csv(RAW_DIR / "CPBind_unique_affinity.csv")
    # Load index.txt
    with open(RAW_DIR / "CPBind_unique_index.txt") as f:
        index_entries = [line.strip() for line in f if line.strip()]
    
    # Merge basic + affinity on structure ID
    df = df_basic.merge(df_aff, on=<common_key>, how="left")
    
    # Parse entries
    for _, row in df.iterrows():
        # Extract cyclization type from naming convention
        # Map ddG + Vina to affinity fields
        # Convert ddG to approximate Kd_nM
        # Build standardized row
    
    return pd.DataFrame(rows)

Verify: Script runs without error, outputs 03_cpbind.csv to output/.


Task 4: Run extraction and verify output

Step 1: Run the script

cd data-scraping/sources/03_cpbind/src
python3 03_cpbind.py

Step 2: Verify output

wc -l data-scraping/sources/03_cpbind/output/03_cpbind.csv
head -2 data-scraping/sources/03_cpbind/output/03_cpbind.csv

Expected: ~101K rows, all marked affinity_is_computed=True.

Step 3: Sanity checks

  • Every entry has affinity_is_computed=True
  • ddG and Vina values are negative (binding)
  • Peptide lengths are reasonable (5-50 residues)
  • No duplicate entry_ids

Open Questions / Risks

  1. Large download: 3.9 GB for CPBind_unique. May need to fall back to Kaggle API if Zenodo is slow.
  2. Column names unknown: Need to inspect actual CSV headers after extraction to confirm mapping. Plan will be updated after Task 2 inspection.
  3. No peptide sequences: CPBind property files likely don’t have sequences — only structure IDs. Sequences are encoded in PDB files (which we skip). We’ll store what we can from the basic file and index.
  4. Computed vs experimental: CPBind affinities are all computed. They’re useful for training but not for experimental benchmarks. The final collation step should keep these clearly separated.