title: Cyclic Peptide–Protein Binding Affinity Extraction Pipeline Plan tags: [plan, data-scraping, pipeline, dataset] created: 2026-05-12 updated: 2026-05-27 status: active related:


Cyclic Peptide–Protein Binding Affinity: Extraction Pipeline Plan

Architecture

data-scraping/
├── plan.md                          # Original search plan
├── summary.md                       # Curation summary
├── paper_collection.json             # Paper metadata
├── pipeline_plan.md                  # This file
├── sources/
│   ├── _template.py                  # Template for new source scripts
│   ├── 01_ppikb/
│   │   ├── raw/                      # Downloaded PPIKB xlsx files
│   │   ├── src/                      # PPIKB extraction scripts
│   │   └── output/                   # Extracted & normalized CSVs
│   ├── 02_cyclicpepedia/
│   │   ├── raw/                      # Downloaded CyclicPepedia xlsx files
│   │   ├── src/                      # CyclicPepedia extraction scripts
│   │   └── output/                   # Extracted & normalized CSVs
│   ├── 03_cpbind/
│   │   ├── raw/
│   │   ├── src/
│   │   └── output/
│   ├── 04_pepbi/
│   │   ├── raw/
│   │   ├── src/
│   │   └── output/
│   ├── 05_paper_norman_2021/
│   │   ├── raw/
│   │   ├── src/
│   │   └── output/
│   └── ...                           # Additional paper sources

Standardized Schema (per intermediate CSV)

Every source script MUST output a CSV with these columns to {source_dir}/output/:

ColumnTypeRequiredDescription
source_idstrYESSource script name (e.g. “01_ppikb”)
entry_idstrYESUnique ID within source
peptide_namestrnoName/identifier of peptide
peptide_sequencestrYESSequence (one-letter if canonical, HELM for modified)
cyclization_typestrYESOne of: disulfide, head-to-tail, thioether, stapled_hydrocarbon, stapled_lactam, thioether_bipyridyl, bicyclic, CPPC_DRP, other
cyclization_detailstrnoAdditional detail (e.g., “i,i+4 olefin stapling”)
peptide_lengthintYESNumber of residues
has_noncanonicalboolnoContains N-methyl, D-amino acid, etc.
target_proteinstrYESProtein name
target_uniprotstrnoUniProt accession
target_pdbstrnoPDB ID of complex
affinity_valuefloatYES*Numeric affinity (set to NaN if binary only)
affinity_unitstrYES*nM, uM, mM, pM
affinity_typestrYESKd, Ki, IC50, EC50, Ka, Kd_on, Kd_off, bind/nobind
affinity_is_binaryboolYESTrue if only binder/nonbinder (no numeric value)
assay_methodstrYESSPR, ITC, FP, MST, BLI, AS-MS, ELISA, AlphaScreen, TR-FRET, competition, enzymatic, other
assay_conditionsstrnopH, temp, buffer
is_binderboolYESBinary: Kd/Ki/IC50 < 10 uM or explicitly stated binder
is_strong_binderboolnoKd < 100 nM
is_moderate_binderboolno100 nM <= Kd < 1 uM
is_weak_binderboolno1 uM <= Kd < 10 uM
is_non_binderboolnoExplicitly non-binding or Kd > 100 uM
affinity_is_computedboolYESTrue if in silico (Rosetta, Vina, etc.), False if experimental
doistrnoDOI of source paper
pubmed_idstrnoPMID
yearintnoPublication year
notesstrnoFree text

*Required if affinity_is_binary is False.

Source Prioritization

Tier 1: Large structured databases (download + filter)

These give the most data per effort and are immediately downloadable.

#SourceExpected entriesEffortNotes
01PPIKB500-5000 cyclic entriesLowFilter Main.xlsx for cyclic peptides. Has Kd/Ki/IC50.
02CyclicPepedia100-1000 bind entriesLowHas target + bioassay data; needs cross-referencing
03CPBind~100K+ entriesLowComputed affinity only; separate label for training
04PEPBI~20-50 cyclic entriesLowITC thermodynamic data; mostly linear but worth extracting

Tier 2: Papers with large-scale screening data (structured extraction)

These papers have supplementary tables or structured data with many peptide-protein pairs.

#SourceExpected entriesKey data
05Norman et al. 2021 (SARS-CoV-2 spike, mRNA display)10-50 cyclic peptides with Kdthioether cyclic peptides
06Patel et al. 2020 (BET bromodomain DNA-encoded)20-100 cyclic peptides with Kd/IC50multiple bromodomains
08Hacker et al. 2020 (linear/mono/bicyclic vs streptavidin)20-50 entriesdirect comparison of formats
10Zhao et al. 2025 (DEL macrocyclization strategies)10-50 entriesmultiple cyclization chemistries
14Linciano et al. 2024 (yeast display)10-30 entriesmultiple targets
15Smith et al. 2023 (FGF-R selective/promiscuous)20-50 entriesselectivity profile
27Hacker et al. 2020 supplementarypotentially large datasetmRNA display hit tables

Tier 3: Papers with focused studies (manual extraction from text/tables)

These have smaller datasets but high-quality, well-characterized data.

#SourceExpected entriesKey data
07de Araujo et al. 2022 (MDM2 stapled systematic)10-20 entriesmultiple stapling methods compared
09Lee et al. 2025 (linearizable macrocyclic)5-15 entriesAS-MS + SPR
11Glas et al. 2017 (macrocycle kinetics)5-10 entrieskon/koff/Kd from SPR
12Schneider et al. 2021 (MDM2 CPP-bicyclic)5-10 entriesKd, IC50
13Villequey et al. 2024 (FGFR3c bicyclic)10-20 entriesKd, IC50, thermal stability
16Goldbach et al. 2019 (amylase macrocycle)5-10 entriesITC full thermodynamics
17Li et al. 2024 (AI-designed binders)5-10 entriesKd via SPR
18Rettie et al. 2024 (RFpeptides)5-15 entriesexperimental Kd
19Gaucher et al. 2022 (VEGF cyclization ITC)5-15 entriesfull ΔG/ΔH/ΔS
20Kruger et al. 2017 (non-natural macrocyclic PPI)5-10 entriescrystal structure + Kd
21EphA2 bicyclic (BCY18469)5-10 entriesKd from SPR/ITC
22Li et al. 2023 (MYC bicyclic)5-10 entriesIC50 values
24Manschwetus et al. 2019 (stapled PKI)5-10 entriesKd values
31Phage display disulfide constrained (Gao 2024)20-50 entriessub-uM affinities
33Landscaping HDM2 stapled peptides10-20 entriesKd, helicity, stability

Tier 4: Additional papers (expand as we go)

More papers can be added using the same template. The 80 papers in paper_collection.json provide the source list.

Script Template

Each source script follows this pattern:

  1. Download raw data (from URL, DOI, or manual upload)
  2. Parse and extract relevant fields
  3. Normalize to the standardized schema
  4. Output to {source_dir}/output/{source_id}.csv
  5. Print summary statistics

The master collate.py script then:

  1. Reads all intermediate/*.csv files
  2. Deduplicates across sources (by DOI + peptide_sequence + target_protein)
  3. Resolves conflicting affinity values (prefer experimental > computed, SPR > FP > other)
  4. Computes derived features (binder categories)
  5. Outputs final dataset

Execution Order

We’ll go Tier 1 first (largest data, least effort), then Tier 2, then Tier 3:

  1. 01_ppikb.py — PPIKB (likely 500-5000 cyclic entries)
  2. 02_cyclicpepedia.py — CyclicPepedia cross-reference
  3. 03_cpbind.py — Computed affinities from CPBind
  4. 04_pepbi.py — PEPBI thermodynamic data
  5. Then Tier 2 papers one by one…
  6. Then Tier 3 papers one by one…
  7. Finally: collate.py — merge everything

Each script is independently runnable and produces a verifiable intermediate CSV.