title: Cyclic Peptide–Protein Binding Affinity Dataset Curation Plan tags: [plan, data-scraping, dataset, cyclic-peptide, binding-affinity] created: 2026-05-10 updated: 2026-05-27 status: active related:
Cyclic Peptide–Protein Binding Affinity Dataset Curation Plan
Goal
Curate a dataset of cyclic peptides with measured binding affinity (or binary bind/non-bind) against protein targets, suitable for training/evaluating ML models at Lemna.
1. Background & Keyword Landscape
Core terms
- cyclic peptide, macrocyclic peptide, bicyclic peptide
- stapled peptide, stapled helix, hydrocarbon-stapled peptide
- thioether-macrocyclic peptide, disulfide-cyclized peptide, head-to-tail cyclic peptide
- peptide macrocycle, constrained peptide, conformationally constrained peptide
Binding affinity measurement terms
- Kd, KD (dissociation constant)
- Ki (inhibition constant)
- IC50 (half-maximal inhibitory concentration)
- EC50 (half-maximal effective concentration)
- Ka, kon, koff (kinetic association/dissociation rates)
- ΔG, ΔH, ΔS (thermodynamic parameters, often from ITC)
- binding / no binding (binary classification)
Assay methods
- SPR (surface plasmon resonance)
- ITC (isothermal titration calorimetry)
- FP (fluorescence polarization)
- MST (microscale thermothermophoresis)
- BLI (bio-layer interferometry)
- ELISA, AlphaScreen, TR-FRET
- affinity selection-mass spectrometry (AS-MS)
- phage display / mRNA display / RaPID system
Target context terms
- protein-protein interaction (PPI) inhibitor/modulator
- protein-targeting, receptor binding
- drug target, therapeutic target
2. Existing Databases & Resources (to check and potentially integrate)
| Database | Content | Relevance | URL |
|---|---|---|---|
| CPSea | 2.7M cyclic peptide-receptor complexes (AFDB-derived) + CPBind subset with affinity scores (Rosetta ddG < -25, Vina < -6) + CPBind_affinity.csv | HIGH — largest resource, has computed affinity | https://github.com/YZY010418/CPSea, Zenodo |
| CycPeptMPDB | 7,991 cyclic peptides with membrane permeability data | MEDIUM — permeability not binding, but has structures | http://cycpeptmpdb.com |
| PPIKB | Peptide-protein interaction structures + quantitative affinity measurements | HIGH — affinity + structure + targets | https://ppikb.duanlab.ac |
| BindingDB | Protein-ligand binding affinities (mostly small molecules, some peptides) | MEDIUM — filter for cyclic peptide entries | https://www.bindingdb.org |
| PEPBI | 329 peptide-protein complexes with ΔG, ΔH, ΔS (ITC-derived) | MEDIUM — small but high-quality thermodynamic data | Nature Sci Data 2025 |
| PepBDB / PepSet | Peptide-protein complexes from PDB | LOW-MEDIUM — structural, affinity not always present | |
| PepBenchmark | 29 canonical + 6 non-canonical peptide ML datasets | MEDIUM — includes some affinity datasets | https://github.com/ZGCI |
| CyclicPepedia | 8,751 cyclic peptides, 59 targets | MEDIUM — broad coverage, check for affinity data | https://www.biosino.org/iMAC/cyclicpepedia |
| CREMP | 36 macrocyclic peptide conformer ensembles (structural, not affinity) | LOW — structural only | Zenodo |
3. Search Strategy with Paperclip
Phase 3A: Broad discovery searches
Use multiple keyword combinations to capture the full literature:
cyclic peptide protein binding affinity Kdmacrocyclic peptide protein target IC50 Kistapled peptide protein binding affinity measuredbicyclic peptide protein target SPR ITCthioether macrocyclic peptide protein bindingdisulfide cyclized peptide protein target affinityhead-to-tail cyclic peptide receptor bindingpeptide-protein interaction inhibitor cyclic macrocyclicRaPID system cyclic peptide binding affinitymRNA display cyclic peptide protein target Kd
Phase 3B: Targeted affinity-measurement searches
Focus on specific assay types and quantitative measurements:
"cyclic peptide" protein "isothermal titration calorimetry" binding"cyclic peptide" protein SPR "dissociation constant""cyclic peptide" "fluorescence polarization" protein affinitycyclic peptide protein binding "no binding" OR "non-binder"macrocyclic peptide selectivity "protein-protein interaction" affinity
Phase 3C: Methods & screening searches
phage display cyclic peptide binder protein targetmRNA display macrocyclic peptide protein binder affinityOBOC cyclic peptide protein target screeningaffinity selection mass spectrometry cyclic peptide proteinin vitro selection cyclic peptide de novo protein target
Phase 3D: Dataset / database papers
cyclic peptide protein interaction database datasetpeptide protein binding affinity benchmark machine learningcyclic peptide binder design computational dataset
4. Data Extraction Pipeline
For each relevant paper found:
Fields to extract
- Peptide info: sequence (one-letter if canonical, HELM notation if modified), cyclization type (disulfide, head-to-tail, thioether, stapled, lactam, other), ring size, molecular weight
- Protein target info: name, UniProt ID, PDB ID (if available), protein family/class
- Binding data: Kd/Ki/IC50/EC50 value, unit, assay method, assay conditions (pH, temp, buffer if available), binary bind/no-bind label if available
- Structure: PDB ID of complex if available
- Source: DOI, year, authors
Classification criteria
- Binder: reported Kd < 10 µM OR Ki < 10 µM OR IC50 < 10 µM (or explicitly stated as binder)
- Non-binder: explicitly reported as non-binding OR Kd > 100 µM (or similar cutoff — to be refined)
- Strong binder: Kd < 100 nM
- Moderate binder: 100 nM < Kd < 1 µM
- Weak binder: 1 µM < Kd < 10 µM
5. Priority Actions & Deliverables
| Step | Action | Status |
|---|---|---|
| 1 | Create data-scraping/ folder | DONE |
| 2 | Broad keyword research (web) | DONE |
| 3 | Draft plan for review | DONE |
| 4 | Execute Paperclip searches (Phase 3A-D) | TODO |
| 5 | Check existing databases (CPSea, PPIKB, BindingDB, CyclicPepedia) for downloadable datasets | TODO |
| 6 | Extract data from top papers | TODO |
| 7 | Normalize & deduplicate entries | TODO |
| 8 | Save curated dataset as CSV/JSON in data-scraping/ | TODO |
| 9 | Summary statistics & quality report | TODO |
6. Open Questions for User
- Binding affinity threshold: Use 10 µM for binder/non-binder split? Or different cutoff?
- Cyclization types: Include all (disulfide, head-to-tail, thioether, stapled, lactam, etc.) or focus on specific classes?
- Non-canonical amino acids: Include peptides with N-methylated, D-amino acids, other modifications?
- Binary vs. regression: Do you want both continuous (Kd) and binary (bind/no-bind) labels, or prioritize one?
- Synthetic vs. natural: Include natural cyclic peptide products (cyclotides, etc.) or only de novo designed ones?
- Dataset size target: Are you aiming for hundreds, thousands, or tens of thousands of entries?
- Primary use case: Model training (need large dataset)? Benchmarking (need high-quality, diverse)? Both?