title: Cyclic Peptide–Protein Binding Affinity Dataset Curation Plan tags: [plan, data-scraping, dataset, cyclic-peptide, binding-affinity] created: 2026-05-10 updated: 2026-05-27 status: active related:


Cyclic Peptide–Protein Binding Affinity Dataset Curation Plan

Goal

Curate a dataset of cyclic peptides with measured binding affinity (or binary bind/non-bind) against protein targets, suitable for training/evaluating ML models at Lemna.


1. Background & Keyword Landscape

Core terms

  • cyclic peptide, macrocyclic peptide, bicyclic peptide
  • stapled peptide, stapled helix, hydrocarbon-stapled peptide
  • thioether-macrocyclic peptide, disulfide-cyclized peptide, head-to-tail cyclic peptide
  • peptide macrocycle, constrained peptide, conformationally constrained peptide

Binding affinity measurement terms

  • Kd, KD (dissociation constant)
  • Ki (inhibition constant)
  • IC50 (half-maximal inhibitory concentration)
  • EC50 (half-maximal effective concentration)
  • Ka, kon, koff (kinetic association/dissociation rates)
  • ΔG, ΔH, ΔS (thermodynamic parameters, often from ITC)
  • binding / no binding (binary classification)

Assay methods

  • SPR (surface plasmon resonance)
  • ITC (isothermal titration calorimetry)
  • FP (fluorescence polarization)
  • MST (microscale thermothermophoresis)
  • BLI (bio-layer interferometry)
  • ELISA, AlphaScreen, TR-FRET
  • affinity selection-mass spectrometry (AS-MS)
  • phage display / mRNA display / RaPID system

Target context terms

  • protein-protein interaction (PPI) inhibitor/modulator
  • protein-targeting, receptor binding
  • drug target, therapeutic target

2. Existing Databases & Resources (to check and potentially integrate)

DatabaseContentRelevanceURL
CPSea2.7M cyclic peptide-receptor complexes (AFDB-derived) + CPBind subset with affinity scores (Rosetta ddG < -25, Vina < -6) + CPBind_affinity.csvHIGH — largest resource, has computed affinityhttps://github.com/YZY010418/CPSea, Zenodo
CycPeptMPDB7,991 cyclic peptides with membrane permeability dataMEDIUM — permeability not binding, but has structureshttp://cycpeptmpdb.com
PPIKBPeptide-protein interaction structures + quantitative affinity measurementsHIGH — affinity + structure + targetshttps://ppikb.duanlab.ac
BindingDBProtein-ligand binding affinities (mostly small molecules, some peptides)MEDIUM — filter for cyclic peptide entrieshttps://www.bindingdb.org
PEPBI329 peptide-protein complexes with ΔG, ΔH, ΔS (ITC-derived)MEDIUM — small but high-quality thermodynamic dataNature Sci Data 2025
PepBDB / PepSetPeptide-protein complexes from PDBLOW-MEDIUM — structural, affinity not always present
PepBenchmark29 canonical + 6 non-canonical peptide ML datasetsMEDIUM — includes some affinity datasetshttps://github.com/ZGCI
CyclicPepedia8,751 cyclic peptides, 59 targetsMEDIUM — broad coverage, check for affinity datahttps://www.biosino.org/iMAC/cyclicpepedia
CREMP36 macrocyclic peptide conformer ensembles (structural, not affinity)LOW — structural onlyZenodo

3. Search Strategy with Paperclip

Phase 3A: Broad discovery searches

Use multiple keyword combinations to capture the full literature:

  1. cyclic peptide protein binding affinity Kd
  2. macrocyclic peptide protein target IC50 Ki
  3. stapled peptide protein binding affinity measured
  4. bicyclic peptide protein target SPR ITC
  5. thioether macrocyclic peptide protein binding
  6. disulfide cyclized peptide protein target affinity
  7. head-to-tail cyclic peptide receptor binding
  8. peptide-protein interaction inhibitor cyclic macrocyclic
  9. RaPID system cyclic peptide binding affinity
  10. mRNA display cyclic peptide protein target Kd

Phase 3B: Targeted affinity-measurement searches

Focus on specific assay types and quantitative measurements:

  1. "cyclic peptide" protein "isothermal titration calorimetry" binding
  2. "cyclic peptide" protein SPR "dissociation constant"
  3. "cyclic peptide" "fluorescence polarization" protein affinity
  4. cyclic peptide protein binding "no binding" OR "non-binder"
  5. macrocyclic peptide selectivity "protein-protein interaction" affinity

Phase 3C: Methods & screening searches

  1. phage display cyclic peptide binder protein target
  2. mRNA display macrocyclic peptide protein binder affinity
  3. OBOC cyclic peptide protein target screening
  4. affinity selection mass spectrometry cyclic peptide protein
  5. in vitro selection cyclic peptide de novo protein target

Phase 3D: Dataset / database papers

  1. cyclic peptide protein interaction database dataset
  2. peptide protein binding affinity benchmark machine learning
  3. cyclic peptide binder design computational dataset

4. Data Extraction Pipeline

For each relevant paper found:

Fields to extract

  • Peptide info: sequence (one-letter if canonical, HELM notation if modified), cyclization type (disulfide, head-to-tail, thioether, stapled, lactam, other), ring size, molecular weight
  • Protein target info: name, UniProt ID, PDB ID (if available), protein family/class
  • Binding data: Kd/Ki/IC50/EC50 value, unit, assay method, assay conditions (pH, temp, buffer if available), binary bind/no-bind label if available
  • Structure: PDB ID of complex if available
  • Source: DOI, year, authors

Classification criteria

  • Binder: reported Kd < 10 µM OR Ki < 10 µM OR IC50 < 10 µM (or explicitly stated as binder)
  • Non-binder: explicitly reported as non-binding OR Kd > 100 µM (or similar cutoff — to be refined)
  • Strong binder: Kd < 100 nM
  • Moderate binder: 100 nM < Kd < 1 µM
  • Weak binder: 1 µM < Kd < 10 µM

5. Priority Actions & Deliverables

StepActionStatus
1Create data-scraping/ folderDONE
2Broad keyword research (web)DONE
3Draft plan for reviewDONE
4Execute Paperclip searches (Phase 3A-D)TODO
5Check existing databases (CPSea, PPIKB, BindingDB, CyclicPepedia) for downloadable datasetsTODO
6Extract data from top papersTODO
7Normalize & deduplicate entriesTODO
8Save curated dataset as CSV/JSON in data-scraping/TODO
9Summary statistics & quality reportTODO

6. Open Questions for User

  • Binding affinity threshold: Use 10 µM for binder/non-binder split? Or different cutoff?
  • Cyclization types: Include all (disulfide, head-to-tail, thioether, stapled, lactam, etc.) or focus on specific classes?
  • Non-canonical amino acids: Include peptides with N-methylated, D-amino acids, other modifications?
  • Binary vs. regression: Do you want both continuous (Kd) and binary (bind/no-bind) labels, or prioritize one?
  • Synthetic vs. natural: Include natural cyclic peptide products (cyclotides, etc.) or only de novo designed ones?
  • Dataset size target: Are you aiming for hundreds, thousands, or tens of thousands of entries?
  • Primary use case: Model training (need large dataset)? Benchmarking (need high-quality, diverse)? Both?