title: Lemna-Data v2 Plan tags: [inventory, lemna-data, pantry, data-management, architecture, plan] created: 2026-05-06 updated: 2026-05-06 status: active related: []

Lemna-Data v2 Plan

Overview

Git-inspired data management for the Lemna monorepo. artifacts.json is the manifest (committed to git). data/ is the working tree (gitignored). R2 is the remote. Content-addressed storage means storage_key = hash. No implicit uploads.

Manifest (artifacts.json)

{
  "version": 2,
  "entries": {
    "data/pesto/1.4.1/model.pt": {
      "storage_key": "a3f2e8c1d7b0...b8c1",
      "size": 42189312,
      "created_at": "2026-05-04T14:30:00Z"
    }
  }
}

Fields:

  • storage_key — R2 object key (content-addressed SHA-256). null if added but not pushed.
  • size — file size in bytes
  • created_at — ISO timestamp when entry was added

No created_by. No hash (storage_key is the hash). No pattern.

Data Patterns (.datapattern)

Like .gitignore, but determines what add and status track. Resides at repo root next to artifacts.json.

# Model weights
data/*/v*/*.pt

# Training outputs
data/training/*.csv
data/training/*.json

# Benchmarks
data/benchmark/*.json

Rules:

  • One pattern per line
  • * matches anything except /
  • ** matches across directories
  • # comments
  • ! negation (exclude a pattern)
  • Blank lines ignored
  • If .datapattern is missing, add and status require explicit paths

Commands

add

lemna-data add <path>        # add one file
lemna-data add data/pesto/   # add all matching files under dir
lemna-data add --all         # add all untracked files matching .datapattern

Behavior:

  • Hash the file (SHA-256)
  • Add entry to artifacts.json with storage_key: null, size, created_at
  • Does NOT upload to R2

rm

lemna-data rm <path>         # remove entry from manifest
lemna-data rm --cached <path> # remove from manifest, keep on disk

Does NOT delete from R2 (immutable). Does NOT delete from disk by default.

status

lemna-data status

  synced    data/pesto/1.4.1/model.pt       (42MB, key: a3f2...)
  synced    data/training/metrics.csv        (2KB, key: 7d91...)
  added     data/pesto/1.5.0/model.pt       (42MB, not pushed)
  modified  data/training/config.json        (disk hash ≠ manifest key)
  missing   data/pesto/1.3.0/model.pt       (not on disk, key: b4c0...)
  untracked data/new_experiment/results.csv   (matches .datapattern, not in manifest)
  deleted   data/old/benchmark.csv           (on disk only as .datapattern negation)

Categories:

StatusMeaning
syncedOn disk + manifest + R2. Hashes match.
addedIn manifest with storage_key: null. Not pushed yet.
modifiedIn manifest + on disk, but disk content hash ≠ storage_key. Needs re-add.
missingIn manifest with storage_key, not on disk. Needs pull.
untrackedOn disk, matches .datapattern, not in manifest. Needs add.
deletedIn manifest, not on disk, no .datapattern match. Likely needs rm.

push

lemna-data push              # push all added/modified entries
lemna-data push <path>       # push specific entry
lemna-data push --dry-run    # show what would be pushed

Behavior:

  • For each entry with storage_key: null: hash file → upload to R2 → write storage_key
  • For each modified entry: re-hash → re-upload (new key) → update manifest
  • R2 key = SHA-256 of content (content-addressed, deduplicated)
  • If content already exists on R2, skip upload (HEAD check)
  • Updates artifacts.json with new keys

pull

lemna-data pull              # pull all missing entries for this repo
lemna-data pull <path>       # pull specific file
lemna-data pull --extern <ns> # pull from external namespace
lemna-data pull --all        # pull all namespaces from export.json
lemna-data pull --pattern "data/training/*"  # sparse pull
lemna-data pull --dry-run    # show what would be downloaded

Behavior:

  • For each entry missing from disk with a storage_key: download from R2
  • Verify downloaded content hash matches storage_key
  • Write to disk at manifest path
  • Skip if file already exists and hash matches

resolve() (Python API)

from lemna_data import resolve
 
# local repo data
path = resolve("data/pesto/1.4.1/model.pt")
 
# external repo data
path = resolve("data/pesto/1.4.1/model.pt", extern="pesto")
 
# with templates
path = resolve("data/{model}/{version}/model.pt",
               extern="pesto", model="pesto", version="1.4.1")

Behavior:

  • Look up path in relevant artifacts.json
  • If on disk: return path
  • If not on disk + storage_key exists: auto-download from R2, return path
  • If not on disk + no storage_key: raise KeyError
  • Never mutates manifest. Pure read + lazy hydrate.

verify

lemna-data verify             # check all local files match manifest
lemna-data verify <path>      # check specific file

Re-hash every local file. Compare to storage_key. Report mismatches. Like git fsck.

check

lemna-data check              # AST scan across all repos in export.json

Scans Python files for resolve() / add() calls. Reports:

  • Dangling: resolve() references a path not in any manifest
  • Unverified: resolve() with variable/f-string args (can’t verify statically)
  • Untracked: files on disk matching .datapattern but not in any manifest

diff

lemna-data diff              # diff working tree vs manifest
lemna-data diff main..v1.5   # diff manifest between git refs
lemna-data diff --extern pesto main..v1.5

Shows what entries were added/removed/modified between two states.

log

lemna-data log               # git log for artifacts.json changes
lemna-data log --oneline     # compact format
lemna-data log <path>        # history of a specific data entry

Parses git log for commits touching artifacts.json. Shows when data was added, what changed, by whom. Data lineage from git history.

tree

lemna-data tree

data/
├── pesto/
│   ├── 1.4.1/ (2 files, 76MB) ✓
│   └── 1.5.0/ (1 file, 42MB) ↑push
└── training/ (2 files, 4KB) ✓

✓ synced  ↑added  ↓missing  ~modified

gc

lemna-data gc                # list R2 objects not in any manifest
lemna-data gc --prune        # delete orphaned R2 objects

Lists (and optionally deletes) R2 objects that no manifest references. Like garbage collection.

File Layout (per repo)

models/pesto/pesto/
├── artifacts.json     ← committed (manifest)
├── .datapattern       ← committed (track patterns)
├── .dataignore        ← committed (exclude patterns, optional)
├── src/               ← committed (code)
├── data/              ← gitignored (local materialization)
│   ├── pesto/
│   │   ├── 1.4.1/
│   │   └── 1.5.0/
│   └── training/
└── pyproject.toml

Monorepo root:

.
├── export.json        ← committed (namespace → repo path map)
├── models/pesto/pesto/artifacts.json
├── models/carbonara/carbonara/artifacts.json
└── pipeline/services/endpoint-pesto/artifacts.json

.dataignore (optional)

Like .gitignore but for data. Excludes paths from add --all and status untracked detection.

*.tmp
wandb/
__pycache__/
.ipynb_checkpoints/

If absent, only .datapattern controls what gets tracked.

Full Workflow

DAY 1 — SETUP ──────────────────────────────────────────

  git clone monorepo
  lemna-data pull               # hydrate current repo
  # or:
  lemna-data pull --extern pesto # only pesto data
  # or: do nothing, resolve() lazy-hydrates on demand


DAY 2 — PRODUCE NEW DATA ───────────────────────────────

  python train.py                # writes data/pesto/1.5.0/model.pt
  lemna-data status              # shows "untracked" for new file
  lemna-data add --all           # add all .datapattern matches
  lemna-data push                # upload to R2, update manifest keys
  git add artifacts.json
  git commit -m "add pesto v1.5.0"


DAY 3 — MODIFY EXISTING DATA ───────────────────────────

  python retrain.py              # overwrites data/pesto/1.5.0/model.pt
  lemna-data status              # shows "modified" (hash mismatch)
  lemna-data add data/pesto/1.5.0/model.pt  # re-hash, update entry
  lemna-data push                # upload new version (new key)
  git commit -am "update pesto v1.5.0 weights"


DAY 4 — CONSUME FROM ANOTHER REPO ──────────────────────

  # in carbonara repo:
  path = resolve("data/pesto/1.5.0/model.pt", extern="pesto")
  # first call: auto-downloads from R2
  # subsequent calls: returns cached local path


DAY 5 — DISASTER RECOVERY ─────────────────────────────

  rm -rf data/                   # uh oh
  lemna-data pull                # re-download everything from R2
  # or just: python run_inference.py  # resolve() auto-hydrates


DAY 6 — REVIEW ────────────────────────────────────────

  lemna-data verify              # integrity check
  lemna-data diff main..dev     # what data changed on this branch?
  lemna-data log                 # when was this data added?
  lemna-data tree                # visual overview

What’s Dropped From v1

v1v2Reason
create()addWriting data is the user’s job. Manifest tracks it after.
register()addSame. If it exists on disk, add it.
pattern field.datapattern filePatterns don’t belong in the manifest. They’re config.
created_byDroppedGit log + lemna-data log gives better provenance.
hash fieldDroppedstorage_key IS the hash (content-addressed).
find_root()KeptStill needed. Same behavior.
export.jsonKeptNamespace registry for extern.
check()KeptAST scan. Simplified (no pattern matching).

Implementation Priority

  1. Core: add, status, push, pull, manifest v2 model
  2. API: resolve() with extern + lazy auto-hydrate
  3. Verification: verify, check
  4. UX: tree, diff, log
  5. Maintenance: gc, .dataignore
  6. Advanced: sparse pull, rm --cached, --dry-run flags