Skip to content

notch8/openlight

Repository files navigation

Openlight

An open, Apache-2.0, Notch8-stewarded toolkit for AI-assisted descriptive metadata generation of digitized cultural-heritage works, with a human-in-the-loop improvement loop that drives quality above an institution's review threshold before scaling.

Institutions hold large backlogs of digitized works (images, documents, audio, video, maps) that need descriptive metadata before they can be discovered and used. Originating that description by hand is the bottleneck. This toolkit drafts it with an LLM, keeps a cataloguer in control of every record, carries full provenance on every value, and measures itself against the institution's own quality bar. The institution clones it, tunes it, owns it, and runs it on its own infrastructure.

It is the open implementation of the pipeline specified in Notch8's Automated Metadata Service initiative, and it competes with subscription products by being owned, transparent, repository- agnostic, and capability-transferring rather than rented.

How it is shaped

A first-pass record is produced by a readable seven-stage pipeline in which the LLM is one stage, not the whole system, which is what makes the output auditable:

[1] Intake & inventory   [2] Pre-processing   [3] Classification & routing
[4] Generation (LLM)     [5] Authority resolution (deterministic)
[6] Human review (confidence-gated)            [7] Export to the repository

Every value carries confidence, the basis for it, and an authority candidate where it is controlled. The LLM proposes controlled values; a separate deterministic stage confirms them against id.loc.gov, Getty, FAST, and VIAF; the cataloguer makes the final call. Every record carries provenance and a PCC-MARC-588-style AI-disclosure note.

Two parts

  • This repository is the shared, contributable commons: generalizable prompts, skills, agents, schemas, and docs that Notch8 maintains and the community improves.
  • An adjacent workspace (never committed here) holds one institution's works, application profile, local context, local overrides, runs, and output. Local overrides take precedence over the shared assets, the same layering Hyku uses with the Knapsack. See docs/adjacent-workspace.md.

The improvement loop

On a small sample, raise the per-record pass rate above the institution's review-policy threshold (for example Oregon State's roughly 90% needs-no-modification bar) before scaling. A frontier model orchestrates and refines prompts, cost-effective models do the volume, humans review, and each refinement is classified as a generalized improvement to contribute upstream or a domain customization that stays local. Refine on one sample, confirm on a fresh held-out sample, then scale. See docs/the-improvement-loop.md.

Quickstart

  1. Clone this repo and create an adjacent workspace from the template (docs/adjacent-workspace.md).
  2. Build your application profile (skills/build-application-profile): target schema, controlled vocabularies, house style, model choice.
  3. Run a first-pass batch on a bounded sample (skills/describe-collection); stop and sanity-check the first 25 records.
  4. Triage by confidence (skills/review-triage), resolve authorities (skills/resolve-authorities), and review.
  5. Compute the pass rate (skills/compute-pass-rate) and refine (skills/refine-prompts); repeat until the held-out sample clears your threshold.
  6. Export to your repository (skills/export-for-repository).

The worked example in examples/met-open-access/ walks the whole loop on ten public-domain Met Open Access objects.

Documentation

License and contributing

Apache 2.0, to match the Samvera ecosystem this toolkit feeds. Contributions are welcome and voluntary; see CONTRIBUTING.md for the generalizable-versus-domain test and the PR process, and GOVERNANCE.md for the maintainer model. Contributing improves the commons and never creates a dependency: an institution owns its tuned instance regardless.

About

Open, human-in-the-loop toolkit for AI-assisted descriptive metadata generation for digitized cultural-heritage collections. Clone, tune, and own it.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors