title: CPBind Extraction Plan tags: [plan, data-scraping, cpbind, dataset] created: 2026-05-14 updated: 2026-05-27 status: active related:

03_cpbind — Implementation Plan

For Hermes: Use subagent-driven-development skill to implement this plan task-by-task.

Goal: Extract cyclic peptide-protein binding affinity data from CPBind (CPSea’s high-affinity subset with computed Rosetta ddG + Vina scores).

Architecture: Download CPBind_unique.zip (3.9 GB) from Zenodo — the deduplicated 101K-entry subset. Extract only the property CSVs (CPBind_unique_affinity.csv, CPBind_unique_basic.tsv). Parse, normalize, and output standardized CSV. Mark all entries as affinity_is_computed=True since affinities are from Rosetta ddG + Vina, not experimental.

Tech Stack: Python, pandas

Key considerations:

CPBind has 476K entries, CPBind_unique has 101K (deduplicated by Foldseek clustering). Use CPBind_unique for first pass.
CPBind.zip is 19.3 GB, CPBind_unique.zip is 3.9 GB. Both contain PDB structures we don’t need.
We ONLY need the property CSVs from inside the zip (affinity + basic files).
All entries are cyclic peptides by definition (the dataset is filtered for cyclic).
Affinities are computed (Rosetta ddG + Vina), not experimental. Must set affinity_is_computed=True.
No sequences are in the property files — sequences come from PDB structure filenames. The index file maps names to entries. We’ll extract peptide length and receptor info from the basic file.
The basic file has filter metrics. The affinity file has ddG and Vina scores.

Task 1: Create source directory structure

Files:

Create: data-scraping/sources/03_cpbind/
Create: data-scraping/sources/03_cpbind/raw/
Create: data-scraping/sources/03_cpbind/src/
Create: data-scraping/sources/03_cpbind/output/

mkdir -p data-scraping/sources/03_cpbind/{raw,src,output}

Verify: ls data-scraping/sources/03_cpbind/ shows raw, src, output dirs.

Task 2: Download CPBind_unique.zip and extract property CSVs

Objective: Download the zip, extract ONLY the property CSV/TSV files, then delete the zip.

Files:

Create: data-scraping/sources/03_cpbind/raw/CPBind_unique_affinity.csv
Create: data-scraping/sources/03_cpbind/raw/CPBind_unique_basic.tsv
Create: data-scraping/sources/03_cpbind/raw/CPBind_unique_index.txt

Step 1: Download

cd data-scraping/sources/03_cpbind/raw
curl -C - -o CPBind_unique.zip "https://zenodo.org/records/16794716/files/CPBind_unique.zip?download=1"

Expected: ~3.9 GB file downloaded.

Step 2: Extract only property files + index

unzip -j CPBind_unique.zip "CPBind_unique/CPBind_unique_Properties/*" "CPBind_unique/CPBind_unique_index.txt" -d .

Step 3: Extract the basic file if it’s TSV inside Properties

# Check what files were extracted
ls -lh *.csv *.tsv *.txt 2>/dev/null

Expected: CPBind_unique_affinity.csv, CPBind_unique_basic.tsv, CPBind_unique_hydrophobic.csv, CPBind_unique_validity.csv, CPBind_unique_index.txt

Step 4: Delete the zip

rm CPBind_unique.zip

Step 5: Inspect the extracted files

head -3 CPBind_unique_affinity.csv
head -3 CPBind_unique_basic.tsv
wc -l CPBind_unique_affinity.csv CPBind_unique_basic.tsv

Verify: Property CSVs are real data (not HTML 404), column headers visible.

Task 3: Write 03_cpbind.py extraction script

Objective: Parse CPBind property files and output standardized CSV.

Files:

Create: data-scraping/sources/03_cpbind/src/03_cpbind.py

Key implementation notes:

No download function — raw files already downloaded manually (too large for urllib).
Schema mapping from CPBind columns:
- CPBind_unique_basic.tsv: has structure IDs, peptide length, receptor info, filter metrics
- CPBind_unique_affinity.csv: has Rosetta ddG and Vina scores per structure
Affinity handling:
- Rosetta ddG: units are kcal/mol. Lower (more negative) = stronger binding.
- Vina score: units are kcal/mol. Lower (more negative) = stronger binding.
- These are NOT directly comparable to experimental Kd/IC50. We’ll store them as-is in affinity_value with unit “kcal/mol” and type “ddG” (Rosetta) or “Vina_score”.
- Convert ddG to approximate Kd using: Kd ≈ exp(ddG / RT) where R=0.001987 kcal/(mol·K), T=298K. Store in affinity_value_nM for cross-source comparison, but flag as computed approximation.
- affinity_is_computed = True for ALL entries.
- is_binder based on the dataset’s own filter: ddG < -25 and Vina < -6 (original CPBind definition).
Peptide sequence: Not directly available in property files. Extract from structure filenames in index file — the format is {AFDB_id}_{first_residue}_{last_residue}_relaxed. Store peptide length from basic file.
Cyclization types: From basic file or inferred from structure naming (CysCys = disulfide, HeadTail = head_to_tail, IsoPep = isopeptide/thioether).
Target protein: AFDB ID from structure name (before the residue range). Not human-readable protein names.

Follow the pattern from 01_ppikb.py and 02_cyclicpepedia.py for directory structure:

SOURCE_DIR = Path(__file__).parent.parent
RAW_DIR = SOURCE_DIR / "raw"
OUTPUT_DIR = SOURCE_DIR / "output"
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

Pseudocode:

def extract():
    # Load basic.tsv (TSV)
    df_basic = pd.read_csv(RAW_DIR / "CPBind_unique_basic.tsv", sep="\t")
    # Load affinity.csv
    df_aff = pd.read_csv(RAW_DIR / "CPBind_unique_affinity.csv")
    # Load index.txt
    with open(RAW_DIR / "CPBind_unique_index.txt") as f:
        index_entries = [line.strip() for line in f if line.strip()]
    
    # Merge basic + affinity on structure ID
    df = df_basic.merge(df_aff, on=<common_key>, how="left")
    
    # Parse entries
    for _, row in df.iterrows():
        # Extract cyclization type from naming convention
        # Map ddG + Vina to affinity fields
        # Convert ddG to approximate Kd_nM
        # Build standardized row
    
    return pd.DataFrame(rows)

Verify: Script runs without error, outputs 03_cpbind.csv to output/.

Task 4: Run extraction and verify output

Step 1: Run the script

cd data-scraping/sources/03_cpbind/src
python3 03_cpbind.py

Step 2: Verify output

wc -l data-scraping/sources/03_cpbind/output/03_cpbind.csv
head -2 data-scraping/sources/03_cpbind/output/03_cpbind.csv

Expected: ~101K rows, all marked affinity_is_computed=True.

Step 3: Sanity checks

Every entry has affinity_is_computed=True
ddG and Vina values are negative (binding)
Peptide lengths are reasonable (5-50 residues)
No duplicate entry_ids

Open Questions / Risks

Large download: 3.9 GB for CPBind_unique. May need to fall back to Kaggle API if Zenodo is slow.
Column names unknown: Need to inspect actual CSV headers after extraction to confirm mapping. Plan will be updated after Task 2 inspection.
No peptide sequences: CPBind property files likely don’t have sequences — only structure IDs. Sequences are encoded in PDB files (which we skip). We’ll store what we can from the basic file and index.
Computed vs experimental: CPBind affinities are all computed. They’re useful for training but not for experimental benchmarks. The final collation step should keep these clearly separated.

Lemna Knowledge Base

Explorer

03_cpbind — Implementation Plan

Task 1: Create source directory structure

Task 2: Download CPBind_unique.zip and extract property CSVs

Task 3: Write 03_cpbind.py extraction script

Task 4: Run extraction and verify output

Open Questions / Risks

Graph View

Recent Notes

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Table of Contents

Backlinks