title: Cyclic Peptide–Protein Binding Affinity Dataset Curation Summary
tags: [data-scraping, dataset, summary, cyclic-peptide, binding-affinity]
created: 2026-05-11
updated: 2026-05-27
status: active
related:
Cyclic Peptide–Protein Binding Affinity Dataset: Curation Summary
Overview
Search completed across Paperclip (23 queries), web searches, and database checks.
Total papers collected: 80 (62 with experimental binding data)
Key Databases (Downloadable)
1. PPIKB — PRIMARY RESOURCE
- URL: https://ppikb.duanlab.ac
- Downloads: https://ppikb.duanlab.ac/downloads/
- Content: 40,347 entries total (19,509 research + ~2,815 patent entries)
- Data includes: Kd, Ki, IC50 values, peptide sequences, protein targets, PDB IDs
- Files available:
Main.xlsx — Full affinity dataset (sequences, affinity data, DOIs, PMIDs)
Branch.xlsx — Refined dataset with matched PDB crystal structures
research.xlsx — Research paper–derived affinity data
structure.xlsx — PDB structural data
- Filtering needed: Filter for cyclic peptides (look for cyclic/cyclized/disulfide/macrocycle keywords in sequence or annotation)
- License: CC-BY 4.0
2. CPSea / CPBind — COMPUTED AFFINITY + STRUCTURES
- URL: https://github.com/YZY010418/CPSea
- Kaggle: https://www.kaggle.com/datasets/ziyiyang180104/cpsea
- Zenodo: https://zenodo.org/records/17324994
- Content: 2.7M cyclic peptide-receptor complexes (AFDB-derived)
- Key subsets:
CPBind_affinity.csv — 476K entries with Rosetta ddG < -25 and Vina score < -6
CPBind_basic.tsv — Basic info and filter metrics
CPBind_validity.csv — Ramachandran and interaction quality
CPBind_hydrophobic.csv — GRAVY, logP, rTPSA
CPBind_cluster.tsv — Foldseek cluster assignments
- Note: Affinity is COMPUTED (Rosetta ddG + Vina), not experimental. Useful for training but needs experimental validation benchmark.
- License: Research/academic use
3. CyclicPepedia — NATURAL + SYNTHETIC CYCLIC PEPTIDES
- URL: https://www.biosino.org/iMAC/cyclicpepedia/
- Downloads: https://www.biosino.org/iMAC/cyclicpepedia/download
- Content: 8,751 cyclic peptides, 59 targets, 821 sources
- Key files:
Peptide_basic_info.xlsx (2.4 MB)
Peptide_to_target.xlsx — Maps peptides to protein targets
Target.xlsx — Target protein info
Bioassay.xlsx — Bioassay data (280 KB)
Peptide_sequence_info.xlsx
structure_complex.zip — 3D PDB files of peptide-target complexes
structure_pdb.zip — 3D PDB files of cyclic peptides
- Note: Bioassay data likely includes some binding affinity measurements but may not be comprehensive.
4. CycPeptMPDB — PERMEABILITY (NOT BINDING)
- URL: http://cycpeptmpdb.com
- Content: 7,991 cyclic peptides with permeability data
- Note: Permeability only, NOT binding affinity. Useful as supplementary ADME data but does not directly serve our purpose.
5. PEPBI — THERMODYNAMIC DATA (LINEAR PEPTIDES)
- URL: Nature Scientific Data 2025
- Content: 329 peptide-protein complexes with ΔG, ΔH, ΔS (ITC)
- Note: Mostly linear peptides (5-20 aa). Could filter for cyclic entries but likely very few.
6. PepBenchmark — ML BENCHMARK
- URL: https://github.com/ZGCI (PepBenchmark)
- Content: 29 canonical + 6 non-canonical peptide datasets
- Note: Includes some binding affinity prediction tasks. Check for cyclic peptide subsets.
Paper Collection Summary
80 papers catalogued in paper_collection.json
By cyclization type:
| Type | Count | Notes |
|---|
| Thioether macrocyclic | ~12 | Most common in mRNA/RaPID display papers |
| Hydrocarbon stapled | ~8 | MDM2/p53, helix mimetics |
| Disulfide constrained | ~6 | Phage display, DRPs |
| Lactam stapled | ~2 | Helical peptides |
| Head-to-tail cyclic | ~5 | Native cyclization |
| Bicyclic | ~6 | Including bismuth, CPP-conjugated |
| Multiple/various | ~15 | Comparison studies |
| Other (thioether-bipyridyl, CPPC, etc.) | ~6 | Specialized chemistries |
By assay type:
| Assay | Count |
|---|
| SPR (Kd, kon, koff) | ~25 |
| FP (Kd, IC50) | ~15 |
| ITC (Kd, ΔG, ΔH, ΔS) | ~8 |
| AS-MS (bind/no-bind) | ~3 |
| Enzymatic IC50 | ~10 |
| Competition binding | ~8 |
| Yeast/phage display enrichment | ~5 |
| Computed (ddG, Vina score) | ~5 |
By target protein (top):
| Target | Papers |
|---|
| MDM2/MDMX | ~5 |
| BET bromodomains | ~3 |
| SARS-CoV-2 Spike | ~3 |
| FGF receptors | ~3 |
| PTP1B | ~2 |
| Keap1 | ~2 |
| TNFα | ~2 |
| Various kinases | ~3 |
| GPCRs | ~2 |
| Other diverse targets | ~40+ |
Recommended Dataset Assembly Strategy
- PPIKB: Download
Main.xlsx and Branch.xlsx. Filter for cyclic peptides (search for cyclic disulfide, macrocyclic, stapled, thioether, head-to-tail cyclization keywords in peptide name/description)
- CyclicPepedia: Download
Peptide_to_target.xlsx, Target.xlsx, Bioassay.xlsx, Peptide_basic_info.xlsx. Join tables to extract binding affinity entries
- CPSea/CPBind: Download
CPBind_affinity.csv for computed affinity scores (Rosetta ddG + Vina). Use as supplementary training data
- Read the 62 papers with experimental binding data using Paperclip’s full-text access
- Extract individual (peptide, protein, affinity, assay, conditions) tuples
- Prioritize papers with large-scale datasets (e.g., RaPID libraries, DEL screening, phage display hit validation)
Phase 3: Normalize & Combine
- Standardize affinity values to Kd (nM) where possible
- Convert IC50/Ki to approximate Kd using Cheng-Prusoff where Kd type not directly reported
- Add binary binder/non-binder labels at 10 µM threshold
- Deduplicate across sources (PPIKB, CyclicPepedia, literature extraction)
Phase 4: Quality Control
- Flag computed vs. experimental affinity
- Flag assay type (SPR > ITC > FP > competition > enzymatic > computed)
- Remove entries with contradictory affinity values
- Annotate cyclization type, peptide length, modification type
Data Schema (Planned)
| Field | Type | Description |
|---|
| entry_id | str | Unique identifier |
| peptide_sequence | str | One-letter code or HELM notation |
| cyclization_type | str | disulfide, head-to-tail, thioether, stapled, lactam, etc. |
| peptide_length | int | Number of residues |
| has_noncanonical | bool | Contains N-methyl, D-amino acids, etc. |
| target_protein | str | Protein name |
| target_uniprot | str | UniProt accession |
| target_pdb | str | PDB ID of complex (if available) |
| affinity_value | float | Numeric affinity value |
| affinity_unit | str | nM, µM, mM |
| affinity_type | str | Kd, Ki, IC50, EC50 |
| assay_method | str | SPR, ITC, FP, MST, BLI, AS-MS, etc. |
| assay_conditions | str | pH, temp, buffer (if available) |
| is_binder | bool | Binary label (affinity < 10 µM) |
| is_strong_binder | bool | Kd < 100 nM |
| is_moderate_binder | bool | 100 nM < Kd < 1 µM |
| is_weak_binder | bool | 1 µM < Kd < 10 µM |
| is_non_binder | bool | Explicitly non-binding or Kd > 100 µM |
| source | str | PPIKB, CyclicPepedia, literature DOI, CPBind |
| source_doi | str | DOI of source paper |
| source_year | int | Year of publication |
| notes | str | Additional context |
Files Created
| File | Description |
|---|
plan.md | Original curation plan |
paper_collection.json | 80 papers with metadata |
summary.md | This file — curation summary and strategy |