model_metadata(model) — returns all
GGUF key-value metadata as a named character vector. Useful for
inspecting model architecture, quantization type, and embedded chat
template. Example:
model_metadata(model)["tokenizer.chat_template"].Fixed apply_chat_template() failing for
Gemma 4 models — Gemma 4 uses a <|turn> /
<turn|> chat format not recognized by
llama_chat_apply_template() (returns -1). The fallback now
calls common_chat_templates_apply() from
common/chat.cpp, which executes the Jinja2 template
embedded in the model’s GGUF directly. This works for any model with a
valid Jinja2 template regardless of the C API whitelist.
enable_thinking defaults to true, so Gemma 4
generates thinking content naturally without a pre-closed thought block.
Tool calls and multimodal content are not handled.
Fixed stop token leaking in generate() and
generate_parallel() for ChatML models (OLMo, Llama
3) — Two separate issues fixed:
<|im_end|>
as 6 separate pieces, with the last piece being >\n
(merging > and newline). The previous exact-suffix check
failed because the response ended with <|im_end|>\n
instead of exactly <|im_end|>. Changed to a windowed
find() that searches within the last
stop.size() + 4 bytes and truncates at the match position.
Applied to both generate() and
generate_parallel().<|start_header_id|> loop: Llama 3.2 3B
sometimes omits <|eot_id|> and jumps directly to
<|start_header_id|> to begin a new turn, causing
infinite repetition. Added <|start_header_id|> to
text_stop_strings in both functions.generate()’s stop list also expanded from
{"<turn|>", "<end_of_turn>"} to
{"<turn|>", "<end_of_turn>", "<|eot_id|>", "<|im_end|>", "<|start_header_id|>"}
to match generate_parallel().Fixed verbosity not forwarded in
quick_llama() — verbosity parameter
was accepted but silently dropped when passed through to
.generate_single() and .generate_multiple(),
so backend logging level had no effect during quick_llama()
calls. Now correctly forwarded to generate() and
generate_parallel().
Fixed backend errors crashing R instead of being
catchable — All Rcpp::stop() calls in
src/interface.cpp replaced with Rf_error().
stop() throws a C++ exception which crosses the C boundary
(.Call() registration) and triggers
std::terminate(), killing the R process.
Rf_error() uses longjmp which R’s condition
system can intercept, so tryCatch() now works correctly for
all backend errors including the OOM guard.
Fixed model-loading progress dots leaking to stderr with
verbosity = 0 —
llama_model_load_from_file() has its own
progress_callback that prints dots to stderr independently
of the log callback system. Now set to a no-op when
verbosity < 2 in
localllm_model_load_safe(). Model loading is fully silent
at the default generation verbosity.
generate_parallel(progress) now defaults to
interactive() — previously defaulted to
TRUE, which printed carriage-return-based progress bars to
log files and R CMD check output. The new default shows the
progress bar only in interactive R sessions and suppresses it in scripts
and automated checks.quick_llama(progress) now defaults to
interactive() — same rationale as above; no effect
on single-prompt calls.quick_llama(stream) parameter
— the stream argument was present in the function signature
but was never passed to any downstream function (it was placeholder code
with a comment “available for future use”). Removed to avoid user
confusion.New localllm_set_verbosity() C API
— added to the backend binary and wired through the proxy layer
(proxy.h/cpp, interface.cpp,
init.cpp). Enables per-call verbosity control at the C
level (integer 0–3, negative = fully silent). Called automatically by
generate(), generate_parallel(),
model_load(), and context_create() before each
C invocation.
C-layer OOM crash guard in
localllm_model_load_safe() — added a last-resort
memory check that fires even when check_memory = FALSE. If
the model file is larger than total physical RAM, the function now
returns a clean error
("Model file (X.X GB) exceeds total physical RAM (Y.Y GB)...")
instead of proceeding to llama_model_load_from_file() and
letting macOS OOM-kill the R process silently. The guard only blocks
provably-impossible loads (file size > total RAM) and does not
interfere with the existing R-layer check. Supported on macOS
(sysctl hw.memsize), Linux
(/proc/meminfo MemTotal), and Windows
(GlobalMemoryStatusEx).
model_load() messages not suppressed by
verbosity = 0 — Two R-level message()
calls in api.R (“Using cached model: …” and the
GPU/unified-memory info line) print unconditionally regardless of
verbosity. The verbosity parameter controls
only the C backend log level; these R-layer informational messages are a
separate code path not yet gated on verbosity. Confirmed against Gemma 4
26B-A4B (IQ2_XXS) on 2026-04-12.generate() and generate_parallel() roxygen
entries now explain why they default to 0L (called in
loops, per-call logs would be noisy) and cross-reference
model_load()/context_create() (default
1L, run once per session, warnings should be visible).generate_parallel() performance
regression introduced by llama.cpp b7825’s new memory APIllama_memory_seq_cp() call was dropped during the
b7825 migration, causing every parallel slot to re-decode the full
prompt instead of sharing the prefixp0=-1, p1=-1), which is compatible with the new APIgenerate()
and generate_parallel() now work on Intel Macs (x86_64);
GPU acceleration is not available on Intel Mac, CPU inference is
usedhardware_profile() crash on
Linux and Windows when GPU diagnostic tools (nvidia-smi,
rocm-smi, clinfo) are not installedvendor/cpp-httplib dependency (required by
updated common/ library)cmake/license.cmakeNo changes to R-level API - All existing R code continues to work without modification.
tempdir() during
automated checks so that R CMD check no longer creates
~/.cache/R/localLLM in the home directory (CRAN policy
violation).hardware_profile() example to use
\donttest{} instead of if (interactive())
guard, per CRAN best practices.Breaking changes in backend (transparent to R
users): - Migrated from llama_kv_self_* API to
llama_memory_* API - Supports heterogeneous model
architectures: - Standard Transformers (LLaMA, Qwen, Mistral, etc.) -
Mamba/RWKV (State Space Models) - Hybrid models (Jamba, LFM2) - Sliding
Window Attention (Qwen2-MLA)
Key improvements: - Better memory management and automatic defragmentation - Enhanced support for parallel inference with shared prefixes - Improved reproducibility of generation results - More efficient batch processing
llama_batch_get_one()llama_batch_init() +
common_batch_add() + llama_batch_free()generate() call starts from clean staten_threads_batch parameter for batch processingNo changes to R-level API - All existing R code continues to work without modification:
library(localLLM)
backend_init()
model <- model_load("model.gguf")
ctx <- context_create(model, n_ctx = 512)
result <- generate(ctx, "Hello", max_tokens = 10)
# All existing code works exactly the samebackend/llama.cpp/build_localllm.shUpdated files: -
custom_files/localllm_capi.cpp (10 locations modified) -
Memory API migration (8 locations) - Batch API modernization (2
locations) - Error handling improvements - Thread configuration
updates
Unchanged: -
custom_files/localllm_capi.h (C API interface) - All R
layer code (R/*.R) - Proxy layer
(src/proxy.cpp) - Test suite
(tests/testthat/*.R) - Documentation
install.packages("localLLM_1.2.0.tar.gz", repos = NULL, type = "source")
library(localLLM)
install_localLLM() # Will download the new b7825 backendremove.packages("localLLM")
install.packages("localLLM_1.2.0.tar.gz", repos = NULL, type = "source")
library(localLLM)
install_localLLM(force = TRUE) # Force reinstall backendNew technical documentation: - UPGRADE_COMPLETE.md -
Complete upgrade report - CRITICAL_CHANGES_REQUIRED.md -
Detailed change checklist -
MIGRATION_ANALYSIS_b5421_to_b7785.md - Full migration
analysis - Architecture deep-dive in planning documents
Potential optimizations for future releases: - Flash Attention support for improved performance - Unified Buffer optimization for multi-sequence inference - SWA (Sliding Window Attention) for ultra-long contexts (128K+)
For more information about llama.cpp, see: - llama.cpp releases - llama.cpp documentation
Previous release notes (if any) would go here…