title: Cyclic Peptide–Protein Binding Affinity Dataset Curation Summary tags: [data-scraping, dataset, summary, cyclic-peptide, binding-affinity] created: 2026-05-11 updated: 2026-05-27 status: active related:


Cyclic Peptide–Protein Binding Affinity Dataset: Curation Summary

Overview

Search completed across Paperclip (23 queries), web searches, and database checks. Total papers collected: 80 (62 with experimental binding data)


Key Databases (Downloadable)

1. PPIKB — PRIMARY RESOURCE

  • URL: https://ppikb.duanlab.ac
  • Downloads: https://ppikb.duanlab.ac/downloads/
  • Content: 40,347 entries total (19,509 research + ~2,815 patent entries)
  • Data includes: Kd, Ki, IC50 values, peptide sequences, protein targets, PDB IDs
  • Files available:
    • Main.xlsx — Full affinity dataset (sequences, affinity data, DOIs, PMIDs)
    • Branch.xlsx — Refined dataset with matched PDB crystal structures
    • research.xlsx — Research paper–derived affinity data
    • structure.xlsx — PDB structural data
  • Filtering needed: Filter for cyclic peptides (look for cyclic/cyclized/disulfide/macrocycle keywords in sequence or annotation)
  • License: CC-BY 4.0

2. CPSea / CPBind — COMPUTED AFFINITY + STRUCTURES

  • URL: https://github.com/YZY010418/CPSea
  • Kaggle: https://www.kaggle.com/datasets/ziyiyang180104/cpsea
  • Zenodo: https://zenodo.org/records/17324994
  • Content: 2.7M cyclic peptide-receptor complexes (AFDB-derived)
  • Key subsets:
    • CPBind_affinity.csv — 476K entries with Rosetta ddG < -25 and Vina score < -6
    • CPBind_basic.tsv — Basic info and filter metrics
    • CPBind_validity.csv — Ramachandran and interaction quality
    • CPBind_hydrophobic.csv — GRAVY, logP, rTPSA
    • CPBind_cluster.tsv — Foldseek cluster assignments
  • Note: Affinity is COMPUTED (Rosetta ddG + Vina), not experimental. Useful for training but needs experimental validation benchmark.
  • License: Research/academic use

3. CyclicPepedia — NATURAL + SYNTHETIC CYCLIC PEPTIDES

  • URL: https://www.biosino.org/iMAC/cyclicpepedia/
  • Downloads: https://www.biosino.org/iMAC/cyclicpepedia/download
  • Content: 8,751 cyclic peptides, 59 targets, 821 sources
  • Key files:
    • Peptide_basic_info.xlsx (2.4 MB)
    • Peptide_to_target.xlsx — Maps peptides to protein targets
    • Target.xlsx — Target protein info
    • Bioassay.xlsx — Bioassay data (280 KB)
    • Peptide_sequence_info.xlsx
    • structure_complex.zip — 3D PDB files of peptide-target complexes
    • structure_pdb.zip — 3D PDB files of cyclic peptides
  • Note: Bioassay data likely includes some binding affinity measurements but may not be comprehensive.

4. CycPeptMPDB — PERMEABILITY (NOT BINDING)

  • URL: http://cycpeptmpdb.com
  • Content: 7,991 cyclic peptides with permeability data
  • Note: Permeability only, NOT binding affinity. Useful as supplementary ADME data but does not directly serve our purpose.

5. PEPBI — THERMODYNAMIC DATA (LINEAR PEPTIDES)

  • URL: Nature Scientific Data 2025
  • Content: 329 peptide-protein complexes with ΔG, ΔH, ΔS (ITC)
  • Note: Mostly linear peptides (5-20 aa). Could filter for cyclic entries but likely very few.

6. PepBenchmark — ML BENCHMARK

  • URL: https://github.com/ZGCI (PepBenchmark)
  • Content: 29 canonical + 6 non-canonical peptide datasets
  • Note: Includes some binding affinity prediction tasks. Check for cyclic peptide subsets.

Paper Collection Summary

80 papers catalogued in paper_collection.json

By cyclization type:

TypeCountNotes
Thioether macrocyclic~12Most common in mRNA/RaPID display papers
Hydrocarbon stapled~8MDM2/p53, helix mimetics
Disulfide constrained~6Phage display, DRPs
Lactam stapled~2Helical peptides
Head-to-tail cyclic~5Native cyclization
Bicyclic~6Including bismuth, CPP-conjugated
Multiple/various~15Comparison studies
Other (thioether-bipyridyl, CPPC, etc.)~6Specialized chemistries

By assay type:

AssayCount
SPR (Kd, kon, koff)~25
FP (Kd, IC50)~15
ITC (Kd, ΔG, ΔH, ΔS)~8
AS-MS (bind/no-bind)~3
Enzymatic IC50~10
Competition binding~8
Yeast/phage display enrichment~5
Computed (ddG, Vina score)~5

By target protein (top):

TargetPapers
MDM2/MDMX~5
BET bromodomains~3
SARS-CoV-2 Spike~3
FGF receptors~3
PTP1B~2
Keap1~2
TNFα~2
Various kinases~3
GPCRs~2
Other diverse targets~40+

Phase 1: Download & Filter Existing Databases (HIGHEST PRIORITY — immediate data)

  1. PPIKB: Download Main.xlsx and Branch.xlsx. Filter for cyclic peptides (search for cyclic disulfide, macrocyclic, stapled, thioether, head-to-tail cyclization keywords in peptide name/description)
  2. CyclicPepedia: Download Peptide_to_target.xlsx, Target.xlsx, Bioassay.xlsx, Peptide_basic_info.xlsx. Join tables to extract binding affinity entries
  3. CPSea/CPBind: Download CPBind_affinity.csv for computed affinity scores (Rosetta ddG + Vina). Use as supplementary training data

Phase 2: Extract Data from Key Papers

  • Read the 62 papers with experimental binding data using Paperclip’s full-text access
  • Extract individual (peptide, protein, affinity, assay, conditions) tuples
  • Prioritize papers with large-scale datasets (e.g., RaPID libraries, DEL screening, phage display hit validation)

Phase 3: Normalize & Combine

  • Standardize affinity values to Kd (nM) where possible
  • Convert IC50/Ki to approximate Kd using Cheng-Prusoff where Kd type not directly reported
  • Add binary binder/non-binder labels at 10 µM threshold
  • Deduplicate across sources (PPIKB, CyclicPepedia, literature extraction)

Phase 4: Quality Control

  • Flag computed vs. experimental affinity
  • Flag assay type (SPR > ITC > FP > competition > enzymatic > computed)
  • Remove entries with contradictory affinity values
  • Annotate cyclization type, peptide length, modification type

Data Schema (Planned)

FieldTypeDescription
entry_idstrUnique identifier
peptide_sequencestrOne-letter code or HELM notation
cyclization_typestrdisulfide, head-to-tail, thioether, stapled, lactam, etc.
peptide_lengthintNumber of residues
has_noncanonicalboolContains N-methyl, D-amino acids, etc.
target_proteinstrProtein name
target_uniprotstrUniProt accession
target_pdbstrPDB ID of complex (if available)
affinity_valuefloatNumeric affinity value
affinity_unitstrnM, µM, mM
affinity_typestrKd, Ki, IC50, EC50
assay_methodstrSPR, ITC, FP, MST, BLI, AS-MS, etc.
assay_conditionsstrpH, temp, buffer (if available)
is_binderboolBinary label (affinity < 10 µM)
is_strong_binderboolKd < 100 nM
is_moderate_binderbool100 nM < Kd < 1 µM
is_weak_binderbool1 µM < Kd < 10 µM
is_non_binderboolExplicitly non-binding or Kd > 100 µM
sourcestrPPIKB, CyclicPepedia, literature DOI, CPBind
source_doistrDOI of source paper
source_yearintYear of publication
notesstrAdditional context

Files Created

FileDescription
plan.mdOriginal curation plan
paper_collection.json80 papers with metadata
summary.mdThis file — curation summary and strategy