title: Cyclic Peptide–Protein Binding Affinity Extraction Pipeline Plan tags: [plan, data-scraping, pipeline, dataset] created: 2026-05-12 updated: 2026-05-27 status: active related:

Cyclic Peptide–Protein Binding Affinity: Extraction Pipeline Plan

Architecture

data-scraping/
├── plan.md                          # Original search plan
├── summary.md                       # Curation summary
├── paper_collection.json             # Paper metadata
├── pipeline_plan.md                  # This file
├── sources/
│   ├── _template.py                  # Template for new source scripts
│   ├── 01_ppikb/
│   │   ├── raw/                      # Downloaded PPIKB xlsx files
│   │   ├── src/                      # PPIKB extraction scripts
│   │   └── output/                   # Extracted & normalized CSVs
│   ├── 02_cyclicpepedia/
│   │   ├── raw/                      # Downloaded CyclicPepedia xlsx files
│   │   ├── src/                      # CyclicPepedia extraction scripts
│   │   └── output/                   # Extracted & normalized CSVs
│   ├── 03_cpbind/
│   │   ├── raw/
│   │   ├── src/
│   │   └── output/
│   ├── 04_pepbi/
│   │   ├── raw/
│   │   ├── src/
│   │   └── output/
│   ├── 05_paper_norman_2021/
│   │   ├── raw/
│   │   ├── src/
│   │   └── output/
│   └── ...                           # Additional paper sources

Standardized Schema (per intermediate CSV)

Every source script MUST output a CSV with these columns to {source_dir}/output/:

Column	Type	Required	Description
`source_id`	str	YES	Source script name (e.g. “01_ppikb”)
`entry_id`	str	YES	Unique ID within source
`peptide_name`	str	no	Name/identifier of peptide
`peptide_sequence`	str	YES	Sequence (one-letter if canonical, HELM for modified)
`cyclization_type`	str	YES	One of: disulfide, head-to-tail, thioether, stapled_hydrocarbon, stapled_lactam, thioether_bipyridyl, bicyclic, CPPC_DRP, other
`cyclization_detail`	str	no	Additional detail (e.g., “i,i+4 olefin stapling”)
`peptide_length`	int	YES	Number of residues
`has_noncanonical`	bool	no	Contains N-methyl, D-amino acid, etc.
`target_protein`	str	YES	Protein name
`target_uniprot`	str	no	UniProt accession
`target_pdb`	str	no	PDB ID of complex
`affinity_value`	float	YES*	Numeric affinity (set to NaN if binary only)
`affinity_unit`	str	YES*	nM, uM, mM, pM
`affinity_type`	str	YES	Kd, Ki, IC50, EC50, Ka, Kd_on, Kd_off, bind/nobind
`affinity_is_binary`	bool	YES	True if only binder/nonbinder (no numeric value)
`assay_method`	str	YES	SPR, ITC, FP, MST, BLI, AS-MS, ELISA, AlphaScreen, TR-FRET, competition, enzymatic, other
`assay_conditions`	str	no	pH, temp, buffer
`is_binder`	bool	YES	Binary: Kd/Ki/IC50 < 10 uM or explicitly stated binder
`is_strong_binder`	bool	no	Kd < 100 nM
`is_moderate_binder`	bool	no	100 nM <= Kd < 1 uM
`is_weak_binder`	bool	no	1 uM <= Kd < 10 uM
`is_non_binder`	bool	no	Explicitly non-binding or Kd > 100 uM
`affinity_is_computed`	bool	YES	True if in silico (Rosetta, Vina, etc.), False if experimental
`doi`	str	no	DOI of source paper
`pubmed_id`	str	no	PMID
`year`	int	no	Publication year
`notes`	str	no	Free text

*Required if affinity_is_binary is False.

Source Prioritization

Tier 1: Large structured databases (download + filter)

These give the most data per effort and are immediately downloadable.

#	Source	Expected entries	Effort	Notes
01	PPIKB	500-5000 cyclic entries	Low	Filter Main.xlsx for cyclic peptides. Has Kd/Ki/IC50.
02	CyclicPepedia	100-1000 bind entries	Low	Has target + bioassay data; needs cross-referencing
03	CPBind	~100K+ entries	Low	Computed affinity only; separate label for training
04	PEPBI	~20-50 cyclic entries	Low	ITC thermodynamic data; mostly linear but worth extracting

Tier 2: Papers with large-scale screening data (structured extraction)

These papers have supplementary tables or structured data with many peptide-protein pairs.

#	Source	Expected entries	Key data
05	Norman et al. 2021 (SARS-CoV-2 spike, mRNA display)	10-50 cyclic peptides with Kd	thioether cyclic peptides
06	Patel et al. 2020 (BET bromodomain DNA-encoded)	20-100 cyclic peptides with Kd/IC50	multiple bromodomains
08	Hacker et al. 2020 (linear/mono/bicyclic vs streptavidin)	20-50 entries	direct comparison of formats
10	Zhao et al. 2025 (DEL macrocyclization strategies)	10-50 entries	multiple cyclization chemistries
14	Linciano et al. 2024 (yeast display)	10-30 entries	multiple targets
15	Smith et al. 2023 (FGF-R selective/promiscuous)	20-50 entries	selectivity profile
27	Hacker et al. 2020 supplementary	potentially large dataset	mRNA display hit tables

Tier 3: Papers with focused studies (manual extraction from text/tables)

These have smaller datasets but high-quality, well-characterized data.

#	Source	Expected entries	Key data
07	de Araujo et al. 2022 (MDM2 stapled systematic)	10-20 entries	multiple stapling methods compared
09	Lee et al. 2025 (linearizable macrocyclic)	5-15 entries	AS-MS + SPR
11	Glas et al. 2017 (macrocycle kinetics)	5-10 entries	kon/koff/Kd from SPR
12	Schneider et al. 2021 (MDM2 CPP-bicyclic)	5-10 entries	Kd, IC50
13	Villequey et al. 2024 (FGFR3c bicyclic)	10-20 entries	Kd, IC50, thermal stability
16	Goldbach et al. 2019 (amylase macrocycle)	5-10 entries	ITC full thermodynamics
17	Li et al. 2024 (AI-designed binders)	5-10 entries	Kd via SPR
18	Rettie et al. 2024 (RFpeptides)	5-15 entries	experimental Kd
19	Gaucher et al. 2022 (VEGF cyclization ITC)	5-15 entries	full ΔG/ΔH/ΔS
20	Kruger et al. 2017 (non-natural macrocyclic PPI)	5-10 entries	crystal structure + Kd
21	EphA2 bicyclic (BCY18469)	5-10 entries	Kd from SPR/ITC
22	Li et al. 2023 (MYC bicyclic)	5-10 entries	IC50 values
24	Manschwetus et al. 2019 (stapled PKI)	5-10 entries	Kd values
31	Phage display disulfide constrained (Gao 2024)	20-50 entries	sub-uM affinities
33	Landscaping HDM2 stapled peptides	10-20 entries	Kd, helicity, stability

Tier 4: Additional papers (expand as we go)

More papers can be added using the same template. The 80 papers in paper_collection.json provide the source list.

Script Template

Each source script follows this pattern:

Download raw data (from URL, DOI, or manual upload)
Parse and extract relevant fields
Normalize to the standardized schema
Output to {source_dir}/output/{source_id}.csv
Print summary statistics

The master collate.py script then:

Reads all intermediate/*.csv files
Deduplicates across sources (by DOI + peptide_sequence + target_protein)
Resolves conflicting affinity values (prefer experimental > computed, SPR > FP > other)
Computes derived features (binder categories)
Outputs final dataset

Execution Order

We’ll go Tier 1 first (largest data, least effort), then Tier 2, then Tier 3:

01_ppikb.py — PPIKB (likely 500-5000 cyclic entries)
02_cyclicpepedia.py — CyclicPepedia cross-reference
03_cpbind.py — Computed affinities from CPBind
04_pepbi.py — PEPBI thermodynamic data
Then Tier 2 papers one by one…
Then Tier 3 papers one by one…
Finally: collate.py — merge everything

Each script is independently runnable and produces a verifiable intermediate CSV.

Lemna Knowledge Base

Explorer

Cyclic Peptide–Protein Binding Affinity: Extraction Pipeline Plan

Architecture

Standardized Schema (per intermediate CSV)

Source Prioritization

Tier 1: Large structured databases (download + filter)

Tier 2: Papers with large-scale screening data (structured extraction)

Tier 3: Papers with focused studies (manual extraction from text/tables)

Tier 4: Additional papers (expand as we go)

Script Template

Execution Order

Graph View

Recent Notes

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Table of Contents

Backlinks