title: Cyclic Peptide–Protein Binding Affinity Extraction Pipeline Plan tags: [plan, data-scraping, pipeline, dataset] created: 2026-05-12 updated: 2026-05-27 status: active related:
Cyclic Peptide–Protein Binding Affinity: Extraction Pipeline Plan
Architecture
data-scraping/
├── plan.md # Original search plan
├── summary.md # Curation summary
├── paper_collection.json # Paper metadata
├── pipeline_plan.md # This file
├── sources/
│ ├── _template.py # Template for new source scripts
│ ├── 01_ppikb/
│ │ ├── raw/ # Downloaded PPIKB xlsx files
│ │ ├── src/ # PPIKB extraction scripts
│ │ └── output/ # Extracted & normalized CSVs
│ ├── 02_cyclicpepedia/
│ │ ├── raw/ # Downloaded CyclicPepedia xlsx files
│ │ ├── src/ # CyclicPepedia extraction scripts
│ │ └── output/ # Extracted & normalized CSVs
│ ├── 03_cpbind/
│ │ ├── raw/
│ │ ├── src/
│ │ └── output/
│ ├── 04_pepbi/
│ │ ├── raw/
│ │ ├── src/
│ │ └── output/
│ ├── 05_paper_norman_2021/
│ │ ├── raw/
│ │ ├── src/
│ │ └── output/
│ └── ... # Additional paper sources
Standardized Schema (per intermediate CSV)
Every source script MUST output a CSV with these columns to {source_dir}/output/:
| Column | Type | Required | Description |
|---|---|---|---|
source_id | str | YES | Source script name (e.g. “01_ppikb”) |
entry_id | str | YES | Unique ID within source |
peptide_name | str | no | Name/identifier of peptide |
peptide_sequence | str | YES | Sequence (one-letter if canonical, HELM for modified) |
cyclization_type | str | YES | One of: disulfide, head-to-tail, thioether, stapled_hydrocarbon, stapled_lactam, thioether_bipyridyl, bicyclic, CPPC_DRP, other |
cyclization_detail | str | no | Additional detail (e.g., “i,i+4 olefin stapling”) |
peptide_length | int | YES | Number of residues |
has_noncanonical | bool | no | Contains N-methyl, D-amino acid, etc. |
target_protein | str | YES | Protein name |
target_uniprot | str | no | UniProt accession |
target_pdb | str | no | PDB ID of complex |
affinity_value | float | YES* | Numeric affinity (set to NaN if binary only) |
affinity_unit | str | YES* | nM, uM, mM, pM |
affinity_type | str | YES | Kd, Ki, IC50, EC50, Ka, Kd_on, Kd_off, bind/nobind |
affinity_is_binary | bool | YES | True if only binder/nonbinder (no numeric value) |
assay_method | str | YES | SPR, ITC, FP, MST, BLI, AS-MS, ELISA, AlphaScreen, TR-FRET, competition, enzymatic, other |
assay_conditions | str | no | pH, temp, buffer |
is_binder | bool | YES | Binary: Kd/Ki/IC50 < 10 uM or explicitly stated binder |
is_strong_binder | bool | no | Kd < 100 nM |
is_moderate_binder | bool | no | 100 nM <= Kd < 1 uM |
is_weak_binder | bool | no | 1 uM <= Kd < 10 uM |
is_non_binder | bool | no | Explicitly non-binding or Kd > 100 uM |
affinity_is_computed | bool | YES | True if in silico (Rosetta, Vina, etc.), False if experimental |
doi | str | no | DOI of source paper |
pubmed_id | str | no | PMID |
year | int | no | Publication year |
notes | str | no | Free text |
*Required if affinity_is_binary is False.
Source Prioritization
Tier 1: Large structured databases (download + filter)
These give the most data per effort and are immediately downloadable.
| # | Source | Expected entries | Effort | Notes |
|---|---|---|---|---|
| 01 | PPIKB | 500-5000 cyclic entries | Low | Filter Main.xlsx for cyclic peptides. Has Kd/Ki/IC50. |
| 02 | CyclicPepedia | 100-1000 bind entries | Low | Has target + bioassay data; needs cross-referencing |
| 03 | CPBind | ~100K+ entries | Low | Computed affinity only; separate label for training |
| 04 | PEPBI | ~20-50 cyclic entries | Low | ITC thermodynamic data; mostly linear but worth extracting |
Tier 2: Papers with large-scale screening data (structured extraction)
These papers have supplementary tables or structured data with many peptide-protein pairs.
| # | Source | Expected entries | Key data |
|---|---|---|---|
| 05 | Norman et al. 2021 (SARS-CoV-2 spike, mRNA display) | 10-50 cyclic peptides with Kd | thioether cyclic peptides |
| 06 | Patel et al. 2020 (BET bromodomain DNA-encoded) | 20-100 cyclic peptides with Kd/IC50 | multiple bromodomains |
| 08 | Hacker et al. 2020 (linear/mono/bicyclic vs streptavidin) | 20-50 entries | direct comparison of formats |
| 10 | Zhao et al. 2025 (DEL macrocyclization strategies) | 10-50 entries | multiple cyclization chemistries |
| 14 | Linciano et al. 2024 (yeast display) | 10-30 entries | multiple targets |
| 15 | Smith et al. 2023 (FGF-R selective/promiscuous) | 20-50 entries | selectivity profile |
| 27 | Hacker et al. 2020 supplementary | potentially large dataset | mRNA display hit tables |
Tier 3: Papers with focused studies (manual extraction from text/tables)
These have smaller datasets but high-quality, well-characterized data.
| # | Source | Expected entries | Key data |
|---|---|---|---|
| 07 | de Araujo et al. 2022 (MDM2 stapled systematic) | 10-20 entries | multiple stapling methods compared |
| 09 | Lee et al. 2025 (linearizable macrocyclic) | 5-15 entries | AS-MS + SPR |
| 11 | Glas et al. 2017 (macrocycle kinetics) | 5-10 entries | kon/koff/Kd from SPR |
| 12 | Schneider et al. 2021 (MDM2 CPP-bicyclic) | 5-10 entries | Kd, IC50 |
| 13 | Villequey et al. 2024 (FGFR3c bicyclic) | 10-20 entries | Kd, IC50, thermal stability |
| 16 | Goldbach et al. 2019 (amylase macrocycle) | 5-10 entries | ITC full thermodynamics |
| 17 | Li et al. 2024 (AI-designed binders) | 5-10 entries | Kd via SPR |
| 18 | Rettie et al. 2024 (RFpeptides) | 5-15 entries | experimental Kd |
| 19 | Gaucher et al. 2022 (VEGF cyclization ITC) | 5-15 entries | full ΔG/ΔH/ΔS |
| 20 | Kruger et al. 2017 (non-natural macrocyclic PPI) | 5-10 entries | crystal structure + Kd |
| 21 | EphA2 bicyclic (BCY18469) | 5-10 entries | Kd from SPR/ITC |
| 22 | Li et al. 2023 (MYC bicyclic) | 5-10 entries | IC50 values |
| 24 | Manschwetus et al. 2019 (stapled PKI) | 5-10 entries | Kd values |
| 31 | Phage display disulfide constrained (Gao 2024) | 20-50 entries | sub-uM affinities |
| 33 | Landscaping HDM2 stapled peptides | 10-20 entries | Kd, helicity, stability |
Tier 4: Additional papers (expand as we go)
More papers can be added using the same template. The 80 papers in paper_collection.json provide the source list.
Script Template
Each source script follows this pattern:
- Download raw data (from URL, DOI, or manual upload)
- Parse and extract relevant fields
- Normalize to the standardized schema
- Output to
{source_dir}/output/{source_id}.csv - Print summary statistics
The master collate.py script then:
- Reads all
intermediate/*.csvfiles - Deduplicates across sources (by DOI + peptide_sequence + target_protein)
- Resolves conflicting affinity values (prefer experimental > computed, SPR > FP > other)
- Computes derived features (binder categories)
- Outputs final dataset
Execution Order
We’ll go Tier 1 first (largest data, least effort), then Tier 2, then Tier 3:
01_ppikb.py— PPIKB (likely 500-5000 cyclic entries)02_cyclicpepedia.py— CyclicPepedia cross-reference03_cpbind.py— Computed affinities from CPBind04_pepbi.py— PEPBI thermodynamic data- Then Tier 2 papers one by one…
- Then Tier 3 papers one by one…
- Finally:
collate.py— merge everything
Each script is independently runnable and produces a verifiable intermediate CSV.