title: Lemna-Data v2 Plan tags: [inventory, lemna-data, pantry, data-management, architecture, plan] created: 2026-05-06 updated: 2026-05-06 status: active related: []

Lemna-Data v2 Plan

Overview

Git-inspired data management for the Lemna monorepo. artifacts.json is the manifest (committed to git). data/ is the working tree (gitignored). R2 is the remote. Content-addressed storage means storage_key = hash. No implicit uploads.

Manifest (`artifacts.json`)

{
  "version": 2,
  "entries": {
    "data/pesto/1.4.1/model.pt": {
      "storage_key": "a3f2e8c1d7b0...b8c1",
      "size": 42189312,
      "created_at": "2026-05-04T14:30:00Z"
    }
  }
}

Fields:

storage_key — R2 object key (content-addressed SHA-256). null if added but not pushed.
size — file size in bytes
created_at — ISO timestamp when entry was added

No created_by. No hash (storage_key is the hash). No pattern.

Data Patterns (`.datapattern`)

Like .gitignore, but determines what add and status track. Resides at repo root next to artifacts.json.

# Model weights
data/*/v*/*.pt

# Training outputs
data/training/*.csv
data/training/*.json

# Benchmarks
data/benchmark/*.json

Rules:

One pattern per line
* matches anything except /
** matches across directories
# comments
! negation (exclude a pattern)
Blank lines ignored
If .datapattern is missing, add and status require explicit paths

Commands

`add`

lemna-data add <path>        # add one file
lemna-data add data/pesto/   # add all matching files under dir
lemna-data add --all         # add all untracked files matching .datapattern

Behavior:

Hash the file (SHA-256)
Add entry to artifacts.json with storage_key: null, size, created_at
Does NOT upload to R2

`rm`

lemna-data rm <path>         # remove entry from manifest
lemna-data rm --cached <path> # remove from manifest, keep on disk

Does NOT delete from R2 (immutable). Does NOT delete from disk by default.

`status`

lemna-data status

  synced    data/pesto/1.4.1/model.pt       (42MB, key: a3f2...)
  synced    data/training/metrics.csv        (2KB, key: 7d91...)
  added     data/pesto/1.5.0/model.pt       (42MB, not pushed)
  modified  data/training/config.json        (disk hash ≠ manifest key)
  missing   data/pesto/1.3.0/model.pt       (not on disk, key: b4c0...)
  untracked data/new_experiment/results.csv   (matches .datapattern, not in manifest)
  deleted   data/old/benchmark.csv           (on disk only as .datapattern negation)

Categories:

Status	Meaning
synced	On disk + manifest + R2. Hashes match.
added	In manifest with `storage_key: null`. Not pushed yet.
modified	In manifest + on disk, but disk content hash ≠ `storage_key`. Needs re-add.
missing	In manifest with `storage_key`, not on disk. Needs pull.
untracked	On disk, matches `.datapattern`, not in manifest. Needs add.
deleted	In manifest, not on disk, no `.datapattern` match. Likely needs rm.

`push`

lemna-data push              # push all added/modified entries
lemna-data push <path>       # push specific entry
lemna-data push --dry-run    # show what would be pushed

Behavior:

For each entry with storage_key: null: hash file → upload to R2 → write storage_key
For each modified entry: re-hash → re-upload (new key) → update manifest
R2 key = SHA-256 of content (content-addressed, deduplicated)
If content already exists on R2, skip upload (HEAD check)
Updates artifacts.json with new keys

`pull`

lemna-data pull              # pull all missing entries for this repo
lemna-data pull <path>       # pull specific file
lemna-data pull --extern <ns> # pull from external namespace
lemna-data pull --all        # pull all namespaces from export.json
lemna-data pull --pattern "data/training/*"  # sparse pull
lemna-data pull --dry-run    # show what would be downloaded

Behavior:

For each entry missing from disk with a storage_key: download from R2
Verify downloaded content hash matches storage_key
Write to disk at manifest path
Skip if file already exists and hash matches

`resolve()` (Python API)

from lemna_data import resolve
 
# local repo data
path = resolve("data/pesto/1.4.1/model.pt")
 
# external repo data
path = resolve("data/pesto/1.4.1/model.pt", extern="pesto")
 
# with templates
path = resolve("data/{model}/{version}/model.pt",
               extern="pesto", model="pesto", version="1.4.1")

Behavior:

Look up path in relevant artifacts.json
If on disk: return path
If not on disk + storage_key exists: auto-download from R2, return path
If not on disk + no storage_key: raise KeyError
Never mutates manifest. Pure read + lazy hydrate.

`verify`

lemna-data verify             # check all local files match manifest
lemna-data verify <path>      # check specific file

Re-hash every local file. Compare to storage_key. Report mismatches. Like git fsck.

`check`

lemna-data check              # AST scan across all repos in export.json

Scans Python files for resolve() / add() calls. Reports:

Dangling: resolve() references a path not in any manifest
Unverified: resolve() with variable/f-string args (can’t verify statically)
Untracked: files on disk matching .datapattern but not in any manifest

`diff`

lemna-data diff              # diff working tree vs manifest
lemna-data diff main..v1.5   # diff manifest between git refs
lemna-data diff --extern pesto main..v1.5

Shows what entries were added/removed/modified between two states.

`log`

lemna-data log               # git log for artifacts.json changes
lemna-data log --oneline     # compact format
lemna-data log <path>        # history of a specific data entry

Parses git log for commits touching artifacts.json. Shows when data was added, what changed, by whom. Data lineage from git history.

`tree`

lemna-data tree

data/
├── pesto/
│   ├── 1.4.1/ (2 files, 76MB) ✓
│   └── 1.5.0/ (1 file, 42MB) ↑push
└── training/ (2 files, 4KB) ✓

✓ synced  ↑added  ↓missing  ~modified

`gc`

lemna-data gc                # list R2 objects not in any manifest
lemna-data gc --prune        # delete orphaned R2 objects

Lists (and optionally deletes) R2 objects that no manifest references. Like garbage collection.

File Layout (per repo)

models/pesto/pesto/
├── artifacts.json     ← committed (manifest)
├── .datapattern       ← committed (track patterns)
├── .dataignore        ← committed (exclude patterns, optional)
├── src/               ← committed (code)
├── data/              ← gitignored (local materialization)
│   ├── pesto/
│   │   ├── 1.4.1/
│   │   └── 1.5.0/
│   └── training/
└── pyproject.toml

Monorepo root:

.
├── export.json        ← committed (namespace → repo path map)
├── models/pesto/pesto/artifacts.json
├── models/carbonara/carbonara/artifacts.json
└── pipeline/services/endpoint-pesto/artifacts.json

`.dataignore` (optional)

Like .gitignore but for data. Excludes paths from add --all and status untracked detection.

*.tmp
wandb/
__pycache__/
.ipynb_checkpoints/

If absent, only .datapattern controls what gets tracked.

Full Workflow

DAY 1 — SETUP ──────────────────────────────────────────

  git clone monorepo
  lemna-data pull               # hydrate current repo
  # or:
  lemna-data pull --extern pesto # only pesto data
  # or: do nothing, resolve() lazy-hydrates on demand


DAY 2 — PRODUCE NEW DATA ───────────────────────────────

  python train.py                # writes data/pesto/1.5.0/model.pt
  lemna-data status              # shows "untracked" for new file
  lemna-data add --all           # add all .datapattern matches
  lemna-data push                # upload to R2, update manifest keys
  git add artifacts.json
  git commit -m "add pesto v1.5.0"


DAY 3 — MODIFY EXISTING DATA ───────────────────────────

  python retrain.py              # overwrites data/pesto/1.5.0/model.pt
  lemna-data status              # shows "modified" (hash mismatch)
  lemna-data add data/pesto/1.5.0/model.pt  # re-hash, update entry
  lemna-data push                # upload new version (new key)
  git commit -am "update pesto v1.5.0 weights"


DAY 4 — CONSUME FROM ANOTHER REPO ──────────────────────

  # in carbonara repo:
  path = resolve("data/pesto/1.5.0/model.pt", extern="pesto")
  # first call: auto-downloads from R2
  # subsequent calls: returns cached local path


DAY 5 — DISASTER RECOVERY ─────────────────────────────

  rm -rf data/                   # uh oh
  lemna-data pull                # re-download everything from R2
  # or just: python run_inference.py  # resolve() auto-hydrates


DAY 6 — REVIEW ────────────────────────────────────────

  lemna-data verify              # integrity check
  lemna-data diff main..dev     # what data changed on this branch?
  lemna-data log                 # when was this data added?
  lemna-data tree                # visual overview

What’s Dropped From v1

v1	v2	Reason
`create()`	`add`	Writing data is the user’s job. Manifest tracks it after.
`register()`	`add`	Same. If it exists on disk, add it.
`pattern` field	`.datapattern` file	Patterns don’t belong in the manifest. They’re config.
`created_by`	Dropped	Git log + `lemna-data log` gives better provenance.
`hash` field	Dropped	`storage_key` IS the hash (content-addressed).
`find_root()`	Kept	Still needed. Same behavior.
`export.json`	Kept	Namespace registry for extern.
`check()`	Kept	AST scan. Simplified (no pattern matching).

Implementation Priority

Core: add, status, push, pull, manifest v2 model
API: resolve() with extern + lazy auto-hydrate
Verification: verify, check
UX: tree, diff, log
Maintenance: gc, .dataignore
Advanced: sparse pull, rm --cached, --dry-run flags

Lemna Knowledge Base

Explorer

title: Lemna-Data v2 Plan tags: [inventory, lemna-data, pantry, data-management, architecture, plan] created: 2026-05-06 updated: 2026-05-06 status: active related: []

Lemna-Data v2 Plan

Overview

Manifest (`artifacts.json`)

Data Patterns (`.datapattern`)

Commands

`add`

`rm`

`status`

`push`

`pull`

`resolve()` (Python API)

`verify`

`check`

`diff`

`log`

`tree`

`gc`

File Layout (per repo)

`.dataignore` (optional)

Full Workflow

What’s Dropped From v1

Implementation Priority

Graph View

Recent Notes

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Untitled

Table of Contents

Backlinks

Explorer

title: Lemna-Data v2 Plan tags: [inventory, lemna-data, pantry, data-management, architecture, plan] created: 2026-05-06 updated: 2026-05-06 status: active related: []

Lemna-Data v2 Plan

Overview

Manifest (artifacts.json)

Data Patterns (.datapattern)

Commands

add

rm

status

push

pull

resolve() (Python API)

verify

check

diff

log

tree

gc

File Layout (per repo)

.dataignore (optional)

Full Workflow

What’s Dropped From v1

Implementation Priority

Graph View

Recent Notes

Table of Contents

Backlinks

Manifest (`artifacts.json`)

Data Patterns (`.datapattern`)

`add`

`rm`

`status`

`push`

`pull`

`resolve()` (Python API)

`verify`

`check`

`diff`

`log`

`tree`

`gc`

`.dataignore` (optional)