title: Lemna-Data v2 Plan tags: [inventory, lemna-data, pantry, data-management, architecture, plan] created: 2026-05-06 updated: 2026-05-06 status: active related: []
Lemna-Data v2 Plan
Overview
Git-inspired data management for the Lemna monorepo. artifacts.json is the manifest (committed to git). data/ is the working tree (gitignored). R2 is the remote. Content-addressed storage means storage_key = hash. No implicit uploads.
Manifest (artifacts.json)
{
"version": 2,
"entries": {
"data/pesto/1.4.1/model.pt": {
"storage_key": "a3f2e8c1d7b0...b8c1",
"size": 42189312,
"created_at": "2026-05-04T14:30:00Z"
}
}
}Fields:
storage_key— R2 object key (content-addressed SHA-256).nullif added but not pushed.size— file size in bytescreated_at— ISO timestamp when entry was added
No created_by. No hash (storage_key is the hash). No pattern.
Data Patterns (.datapattern)
Like .gitignore, but determines what add and status track. Resides at repo root next to artifacts.json.
# Model weights
data/*/v*/*.pt
# Training outputs
data/training/*.csv
data/training/*.json
# Benchmarks
data/benchmark/*.json
Rules:
- One pattern per line
*matches anything except/**matches across directories#comments!negation (exclude a pattern)- Blank lines ignored
- If
.datapatternis missing,addandstatusrequire explicit paths
Commands
add
lemna-data add <path> # add one file
lemna-data add data/pesto/ # add all matching files under dir
lemna-data add --all # add all untracked files matching .datapattern
Behavior:
- Hash the file (SHA-256)
- Add entry to
artifacts.jsonwithstorage_key: null,size,created_at - Does NOT upload to R2
rm
lemna-data rm <path> # remove entry from manifest
lemna-data rm --cached <path> # remove from manifest, keep on disk
Does NOT delete from R2 (immutable). Does NOT delete from disk by default.
status
lemna-data status
synced data/pesto/1.4.1/model.pt (42MB, key: a3f2...)
synced data/training/metrics.csv (2KB, key: 7d91...)
added data/pesto/1.5.0/model.pt (42MB, not pushed)
modified data/training/config.json (disk hash ≠ manifest key)
missing data/pesto/1.3.0/model.pt (not on disk, key: b4c0...)
untracked data/new_experiment/results.csv (matches .datapattern, not in manifest)
deleted data/old/benchmark.csv (on disk only as .datapattern negation)
Categories:
| Status | Meaning |
|---|---|
| synced | On disk + manifest + R2. Hashes match. |
| added | In manifest with storage_key: null. Not pushed yet. |
| modified | In manifest + on disk, but disk content hash ≠ storage_key. Needs re-add. |
| missing | In manifest with storage_key, not on disk. Needs pull. |
| untracked | On disk, matches .datapattern, not in manifest. Needs add. |
| deleted | In manifest, not on disk, no .datapattern match. Likely needs rm. |
push
lemna-data push # push all added/modified entries
lemna-data push <path> # push specific entry
lemna-data push --dry-run # show what would be pushed
Behavior:
- For each entry with
storage_key: null: hash file → upload to R2 → writestorage_key - For each modified entry: re-hash → re-upload (new key) → update manifest
- R2 key = SHA-256 of content (content-addressed, deduplicated)
- If content already exists on R2, skip upload (HEAD check)
- Updates
artifacts.jsonwith new keys
pull
lemna-data pull # pull all missing entries for this repo
lemna-data pull <path> # pull specific file
lemna-data pull --extern <ns> # pull from external namespace
lemna-data pull --all # pull all namespaces from export.json
lemna-data pull --pattern "data/training/*" # sparse pull
lemna-data pull --dry-run # show what would be downloaded
Behavior:
- For each entry missing from disk with a
storage_key: download from R2 - Verify downloaded content hash matches
storage_key - Write to disk at manifest path
- Skip if file already exists and hash matches
resolve() (Python API)
from lemna_data import resolve
# local repo data
path = resolve("data/pesto/1.4.1/model.pt")
# external repo data
path = resolve("data/pesto/1.4.1/model.pt", extern="pesto")
# with templates
path = resolve("data/{model}/{version}/model.pt",
extern="pesto", model="pesto", version="1.4.1")Behavior:
- Look up path in relevant
artifacts.json - If on disk: return path
- If not on disk +
storage_keyexists: auto-download from R2, return path - If not on disk + no
storage_key: raiseKeyError - Never mutates manifest. Pure read + lazy hydrate.
verify
lemna-data verify # check all local files match manifest
lemna-data verify <path> # check specific file
Re-hash every local file. Compare to storage_key. Report mismatches. Like git fsck.
check
lemna-data check # AST scan across all repos in export.json
Scans Python files for resolve() / add() calls. Reports:
- Dangling:
resolve()references a path not in any manifest - Unverified:
resolve()with variable/f-string args (can’t verify statically) - Untracked: files on disk matching
.datapatternbut not in any manifest
diff
lemna-data diff # diff working tree vs manifest
lemna-data diff main..v1.5 # diff manifest between git refs
lemna-data diff --extern pesto main..v1.5
Shows what entries were added/removed/modified between two states.
log
lemna-data log # git log for artifacts.json changes
lemna-data log --oneline # compact format
lemna-data log <path> # history of a specific data entry
Parses git log for commits touching artifacts.json. Shows when data was added, what changed, by whom. Data lineage from git history.
tree
lemna-data tree
data/
├── pesto/
│ ├── 1.4.1/ (2 files, 76MB) ✓
│ └── 1.5.0/ (1 file, 42MB) ↑push
└── training/ (2 files, 4KB) ✓
✓ synced ↑added ↓missing ~modified
gc
lemna-data gc # list R2 objects not in any manifest
lemna-data gc --prune # delete orphaned R2 objects
Lists (and optionally deletes) R2 objects that no manifest references. Like garbage collection.
File Layout (per repo)
models/pesto/pesto/
├── artifacts.json ← committed (manifest)
├── .datapattern ← committed (track patterns)
├── .dataignore ← committed (exclude patterns, optional)
├── src/ ← committed (code)
├── data/ ← gitignored (local materialization)
│ ├── pesto/
│ │ ├── 1.4.1/
│ │ └── 1.5.0/
│ └── training/
└── pyproject.toml
Monorepo root:
.
├── export.json ← committed (namespace → repo path map)
├── models/pesto/pesto/artifacts.json
├── models/carbonara/carbonara/artifacts.json
└── pipeline/services/endpoint-pesto/artifacts.json
.dataignore (optional)
Like .gitignore but for data. Excludes paths from add --all and status untracked detection.
*.tmp
wandb/
__pycache__/
.ipynb_checkpoints/
If absent, only .datapattern controls what gets tracked.
Full Workflow
DAY 1 — SETUP ──────────────────────────────────────────
git clone monorepo
lemna-data pull # hydrate current repo
# or:
lemna-data pull --extern pesto # only pesto data
# or: do nothing, resolve() lazy-hydrates on demand
DAY 2 — PRODUCE NEW DATA ───────────────────────────────
python train.py # writes data/pesto/1.5.0/model.pt
lemna-data status # shows "untracked" for new file
lemna-data add --all # add all .datapattern matches
lemna-data push # upload to R2, update manifest keys
git add artifacts.json
git commit -m "add pesto v1.5.0"
DAY 3 — MODIFY EXISTING DATA ───────────────────────────
python retrain.py # overwrites data/pesto/1.5.0/model.pt
lemna-data status # shows "modified" (hash mismatch)
lemna-data add data/pesto/1.5.0/model.pt # re-hash, update entry
lemna-data push # upload new version (new key)
git commit -am "update pesto v1.5.0 weights"
DAY 4 — CONSUME FROM ANOTHER REPO ──────────────────────
# in carbonara repo:
path = resolve("data/pesto/1.5.0/model.pt", extern="pesto")
# first call: auto-downloads from R2
# subsequent calls: returns cached local path
DAY 5 — DISASTER RECOVERY ─────────────────────────────
rm -rf data/ # uh oh
lemna-data pull # re-download everything from R2
# or just: python run_inference.py # resolve() auto-hydrates
DAY 6 — REVIEW ────────────────────────────────────────
lemna-data verify # integrity check
lemna-data diff main..dev # what data changed on this branch?
lemna-data log # when was this data added?
lemna-data tree # visual overview
What’s Dropped From v1
| v1 | v2 | Reason |
|---|---|---|
create() | add | Writing data is the user’s job. Manifest tracks it after. |
register() | add | Same. If it exists on disk, add it. |
pattern field | .datapattern file | Patterns don’t belong in the manifest. They’re config. |
created_by | Dropped | Git log + lemna-data log gives better provenance. |
hash field | Dropped | storage_key IS the hash (content-addressed). |
find_root() | Kept | Still needed. Same behavior. |
export.json | Kept | Namespace registry for extern. |
check() | Kept | AST scan. Simplified (no pattern matching). |
Implementation Priority
- Core:
add,status,push,pull, manifest v2 model - API:
resolve()withextern+ lazy auto-hydrate - Verification:
verify,check - UX:
tree,diff,log - Maintenance:
gc,.dataignore - Advanced: sparse pull,
rm --cached,--dry-runflags