title: CPBind Extraction Plan tags: [plan, data-scraping, cpbind, dataset] created: 2026-05-14 updated: 2026-05-27 status: active related:
03_cpbind — Implementation Plan
For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.
Goal: Extract cyclic peptide-protein binding affinity data from CPBind (CPSea’s high-affinity subset with computed Rosetta ddG + Vina scores).
Architecture: Download CPBind_unique.zip (3.9 GB) from Zenodo — the deduplicated 101K-entry subset. Extract only the property CSVs (CPBind_unique_affinity.csv, CPBind_unique_basic.tsv). Parse, normalize, and output standardized CSV. Mark all entries as affinity_is_computed=True since affinities are from Rosetta ddG + Vina, not experimental.
Tech Stack: Python, pandas
Key considerations:
- CPBind has 476K entries, CPBind_unique has 101K (deduplicated by Foldseek clustering). Use CPBind_unique for first pass.
- CPBind.zip is 19.3 GB, CPBind_unique.zip is 3.9 GB. Both contain PDB structures we don’t need.
- We ONLY need the property CSVs from inside the zip (affinity + basic files).
- All entries are cyclic peptides by definition (the dataset is filtered for cyclic).
- Affinities are computed (Rosetta ddG + Vina), not experimental. Must set
affinity_is_computed=True. - No sequences are in the property files — sequences come from PDB structure filenames. The index file maps names to entries. We’ll extract peptide length and receptor info from the basic file.
- The
basicfile has filter metrics. Theaffinityfile has ddG and Vina scores.
Task 1: Create source directory structure
Files:
- Create:
data-scraping/sources/03_cpbind/ - Create:
data-scraping/sources/03_cpbind/raw/ - Create:
data-scraping/sources/03_cpbind/src/ - Create:
data-scraping/sources/03_cpbind/output/
mkdir -p data-scraping/sources/03_cpbind/{raw,src,output}Verify: ls data-scraping/sources/03_cpbind/ shows raw, src, output dirs.
Task 2: Download CPBind_unique.zip and extract property CSVs
Objective: Download the zip, extract ONLY the property CSV/TSV files, then delete the zip.
Files:
- Create:
data-scraping/sources/03_cpbind/raw/CPBind_unique_affinity.csv - Create:
data-scraping/sources/03_cpbind/raw/CPBind_unique_basic.tsv - Create:
data-scraping/sources/03_cpbind/raw/CPBind_unique_index.txt
Step 1: Download
cd data-scraping/sources/03_cpbind/raw
curl -C - -o CPBind_unique.zip "https://zenodo.org/records/16794716/files/CPBind_unique.zip?download=1"Expected: ~3.9 GB file downloaded.
Step 2: Extract only property files + index
unzip -j CPBind_unique.zip "CPBind_unique/CPBind_unique_Properties/*" "CPBind_unique/CPBind_unique_index.txt" -d .Step 3: Extract the basic file if it’s TSV inside Properties
# Check what files were extracted
ls -lh *.csv *.tsv *.txt 2>/dev/nullExpected: CPBind_unique_affinity.csv, CPBind_unique_basic.tsv, CPBind_unique_hydrophobic.csv, CPBind_unique_validity.csv, CPBind_unique_index.txt
Step 4: Delete the zip
rm CPBind_unique.zipStep 5: Inspect the extracted files
head -3 CPBind_unique_affinity.csv
head -3 CPBind_unique_basic.tsv
wc -l CPBind_unique_affinity.csv CPBind_unique_basic.tsvVerify: Property CSVs are real data (not HTML 404), column headers visible.
Task 3: Write 03_cpbind.py extraction script
Objective: Parse CPBind property files and output standardized CSV.
Files:
- Create:
data-scraping/sources/03_cpbind/src/03_cpbind.py
Key implementation notes:
-
No download function — raw files already downloaded manually (too large for urllib).
-
Schema mapping from CPBind columns:
CPBind_unique_basic.tsv: has structure IDs, peptide length, receptor info, filter metricsCPBind_unique_affinity.csv: has Rosetta ddG and Vina scores per structure
-
Affinity handling:
- Rosetta ddG: units are kcal/mol. Lower (more negative) = stronger binding.
- Vina score: units are kcal/mol. Lower (more negative) = stronger binding.
- These are NOT directly comparable to experimental Kd/IC50. We’ll store them as-is in
affinity_valuewith unit “kcal/mol” and type “ddG” (Rosetta) or “Vina_score”. - Convert ddG to approximate Kd using: Kd ≈ exp(ddG / RT) where R=0.001987 kcal/(mol·K), T=298K. Store in
affinity_value_nMfor cross-source comparison, but flag as computed approximation. affinity_is_computed = Truefor ALL entries.is_binderbased on the dataset’s own filter: ddG < -25 and Vina < -6 (original CPBind definition).
-
Peptide sequence: Not directly available in property files. Extract from structure filenames in index file — the format is
{AFDB_id}_{first_residue}_{last_residue}_relaxed. Store peptide length from basic file. -
Cyclization types: From basic file or inferred from structure naming (CysCys = disulfide, HeadTail = head_to_tail, IsoPep = isopeptide/thioether).
-
Target protein: AFDB ID from structure name (before the residue range). Not human-readable protein names.
-
Follow the pattern from
01_ppikb.pyand02_cyclicpepedia.pyfor directory structure:SOURCE_DIR = Path(__file__).parent.parent RAW_DIR = SOURCE_DIR / "raw" OUTPUT_DIR = SOURCE_DIR / "output" OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
Pseudocode:
def extract():
# Load basic.tsv (TSV)
df_basic = pd.read_csv(RAW_DIR / "CPBind_unique_basic.tsv", sep="\t")
# Load affinity.csv
df_aff = pd.read_csv(RAW_DIR / "CPBind_unique_affinity.csv")
# Load index.txt
with open(RAW_DIR / "CPBind_unique_index.txt") as f:
index_entries = [line.strip() for line in f if line.strip()]
# Merge basic + affinity on structure ID
df = df_basic.merge(df_aff, on=<common_key>, how="left")
# Parse entries
for _, row in df.iterrows():
# Extract cyclization type from naming convention
# Map ddG + Vina to affinity fields
# Convert ddG to approximate Kd_nM
# Build standardized row
return pd.DataFrame(rows)Verify: Script runs without error, outputs 03_cpbind.csv to output/.
Task 4: Run extraction and verify output
Step 1: Run the script
cd data-scraping/sources/03_cpbind/src
python3 03_cpbind.pyStep 2: Verify output
wc -l data-scraping/sources/03_cpbind/output/03_cpbind.csv
head -2 data-scraping/sources/03_cpbind/output/03_cpbind.csvExpected: ~101K rows, all marked affinity_is_computed=True.
Step 3: Sanity checks
- Every entry has
affinity_is_computed=True - ddG and Vina values are negative (binding)
- Peptide lengths are reasonable (5-50 residues)
- No duplicate entry_ids
Open Questions / Risks
- Large download: 3.9 GB for CPBind_unique. May need to fall back to Kaggle API if Zenodo is slow.
- Column names unknown: Need to inspect actual CSV headers after extraction to confirm mapping. Plan will be updated after Task 2 inspection.
- No peptide sequences: CPBind property files likely don’t have sequences — only structure IDs. Sequences are encoded in PDB files (which we skip). We’ll store what we can from the basic file and index.
- Computed vs experimental: CPBind affinities are all computed. They’re useful for training but not for experimental benchmarks. The final collation step should keep these clearly separated.