Flash attention support: Enabled by default in
edge_load_model() via flash_attn = TRUE.
Reduces memory usage and improves attention computation speed on
CPU.
Full hardware thread utilization: Removed the
4-thread cap for small contexts. edge_load_model() now uses
all available CPU threads by default, with n_threads_batch
set to max for prompt processing.
User-configurable threading: New
n_threads parameter in edge_load_model()
allows explicit control over CPU thread count. Pass NULL
(default) for auto-detect or an integer to limit cores.
Apple Accelerate framework (macOS): Automatically links the Accelerate framework on macOS builds, enabling hardware-accelerated vDSP vector operations for faster matrix math.
Compiler auto-vectorization: Added
-ftree-vectorize to GGML compilation flags on all
platforms, allowing GCC/Clang to generate SIMD instructions for eligible
loops beyond the hand-tuned GGML kernels.
SIMD-optimized build system: Replaced generic
scalar fallback with architecture-aware SIMD detection in both
Makevars (Unix) and Makevars.win (Windows)
User-configurable SIMD levels: Set
EDGEMODELR_SIMD environment variable before install to
select optimization level:
GENERIC: Scalar fallback (maximum compatibility)SSE42: SSE4.2 baseline (default on x86_64)AVX: AVX + F16C (Intel Sandy Bridge 2011+)AVX2: AVX2 + FMA + F16C (Intel Haswell 2013+,
recommended)AVX512: AVX-512 (Intel Skylake-X 2017+)NATIVE: Uses -march=native for maximum
performance on the build machineedge_simd_info(): New function to
query compile-time SIMD status including architecture, compiler
features, and GGML optimization flags
x86 architecture-specific quantization: Enabled
optimized x86 quantization kernels (arch/x86/quants.c,
arch/x86/repack.cpp) with SIMD-accelerated dot products and
matrix operations
Fixed donttest examples: Changed
resource-intensive examples from \donttest{} to
\dontrun{} to prevent downloading multi-GB models during
CRAN checks
Fixed M1 Mac compiler warnings: Added explicit
static_cast<> for:
double to float conversions for
temperature/top_p parameterssize_type to int32_t conversions for
buffer size parametersFixed connection handling: Replaced
on.exit() with tryCatch/finally for proper
connection cleanup in loops (thanks @eddelbuettel)
edge_small_model_config() function provides optimized
settings for small models (1B-3B parameters)
edge_find_ollama_models() - Discover all locally
available Ollama models across platforms (Windows, macOS, Linux)edge_load_ollama_model() - Load Ollama models using
convenient SHA-256 hash prefixes instead of full file pathstest_ollama_model_compatibility() - Built-in
compatibility testing for Ollama modelsstd::filesystem on
macOS builds<mach-o/dyld.h> inclusion with direct function
declarations to avoid enum conflicts-march=native, -mtune=native, etc.)
from Makevars for CRAN compatibilityedge_clean_cache() functionedge_load_model() - Load GGUF model files for
inferenceedge_completion() - Generate text completionsedge_stream_completion() - Stream text generation with
real-time callbacksedge_chat_stream() - Interactive chat session with
streaming responsesedge_free_model() - Memory management and cleanupis_valid_model() - Model context validationedge_list_models() - List pre-configured popular
modelsedge_download_model() - Download models from Hugging
Face Hubedge_quick_setup() - One-line model download and
setupThis release provides a complete, production-ready solution for Local Large Language Model Inference Engine in R, enabling private, offline text generation workflows.