Case Study
OpenJet
Live demoOffline terminal-based agent for self-hosted LLMs. Supports Unified memory systems (Nvidia Jetson), Cuda, ROCm, and Vulkan backends.
Outcomes
Impact
- Decode speed of 39 tok/s for Qwen3.5-27B on RTX 3090.
- Automated local model setup via llama.cpp with hardware profiling for optimised configurations
- Memory management to avoid OOM errors: Automatic model offloading, context size calculations, automatic context compression
- Designed with edge devices and IoT in mind: user can register IO hardware, OpenJet logs the output, and the local LLM reads the log and runs appropiate tools.
- Supports agent tool execution, local context management, and approval gates for state-changing actions
- Python SDK exposes hardware profiling, background agent orchestration, and model tok/s benchmark parameter sweeps
Stack
Tools
Decisions
Key decisions
- Most agentic coding systems still depend on cloud connectivity for orchestration or inference, which breaks in offline and restricted environments.
- Running an OS-level LLM agent directly on self-owned edge hardware enables local device control without sending code, logs, or shell state to external services.
- Security posture improves when execution, model weights, and tool outputs stay local, with explicit approval gates for mutating actions.
Approach
Technical approach
- A setup wizard profiles target hardware, tunes GPU layer offload, and recommends model sizing.
- Inference runs through local llama-server using quantized GGUF models from local files or Ollama pulls.
- The Textual TUI orchestrates chat, slash commands, file mentions, and tool execution requests.
- Session events and resource telemetry are written as structured logs for observability and replay.
Links