MagicAF supports deployment on resource-constrained devices where running Qdrant and a full LLM server is impractical.

Architecture

OE(nmO-bNDeNedXvd/iiCRcnoARoegrGereWmMooCLrta)kecfhlLeEoLddwMgeIn(DMmeeavmgioiFWcrcuheyaleVflne-ccMntoaeorgtreiwS)cotAroFkreoarvcahielsatbrlaetion

Component Availability

ComponentServer DeploymentEdge Deployment
Vector StoreQdrant (magicaf-qdrant)InMemoryVectorStore (in magicaf-core)
Embeddingsllama.cpp / TEI serverOn-device (ONNX, CoreML, TFLite)
LLMvLLM / llama.cppOmitted, or remote when online
PersistenceQdrant handles itInMemoryVectorStore::save() / load()

InMemoryVectorStore

The zero-dependency vector store that ships with magicaf-core:

use magicaf_core::prelude::*;

// Create a new store
let store = InMemoryVectorStore::new();
store.ensure_collection("docs", 1024).await?;
store.index("docs", embeddings, payloads).await?;

let results = store.search("docs", query_vec, 5, None).await?;

// Persist to disk
store.save(Path::new("store.json"))?;

// Load on next startup
let store = InMemoryVectorStore::load(Path::new("store.json"))?;

Performance Characteristics

PointsSearch LatencyMemory (1024-dim)
1,000< 1ms~10 MB
10,000~5ms~100 MB
100,000~50ms~1 GB

Edge Embedding Strategies

Option A — On-Device ONNX Runtime

Implement the EmbeddingService trait using ONNX Runtime for fully offline embeddings:

use magicaf_core::prelude::*;
use async_trait::async_trait;

pub struct OnnxEmbeddingService {
    session: ort::Session,
}

#[async_trait]
impl EmbeddingService for OnnxEmbeddingService {
    async fn embed(&self, inputs: &[String]) -> Result<Vec<Vec<f32>>> {
        // Tokenize + run ONNX inference
        // Runs entirely on-device, no network required
        todo!("Implement with ort crate")
    }

    async fn embed_single(&self, input: &str) -> Result<Vec<f32>> {
        let results = self.embed(&[input.to_string()]).await?;
        results.into_iter().next()
            .ok_or_else(|| MagicError::embedding("empty"))
    }

    async fn health_check(&self) -> Result<()> {
        Ok(()) // Always healthy — local
    }
}

Option B — Apple CoreML (iOS/macOS)

Wrap a CoreML .mlmodel behind the EmbeddingService trait via FFI. The MagicAF FFI surface is designed for this pattern.

Option C — Local llama.cpp

For devices with 4GB+ RAM, run llama.cpp as a subprocess for embeddings only.

Offline Workflow

  1. Pre-index on a server — run the full MagicAF stack and index your documents
  2. Export the storeInMemoryVectorStore::save("portable_store.json")
  3. Bundle with app — ship the JSON file with your mobile app
  4. Load on deviceInMemoryVectorStore::load("portable_store.json")

Edge RAG Without LLM

On mobile, you may not need (or be able to run) an LLM:

  • Retrieval-only — use embedding + vector store to find relevant documents, display directly
  • Remote LLM fallback — call a remote server when network is available
  • Cached responses — pre-compute LLM responses for common queries

Platform Notes

PlatformEmbeddingsVector StoreNotes
iOS (Swift)CoreML .mlmodelInMemoryVectorStore via C FFIBundle store in app bundle
Android (Kotlin)ONNX Runtime Mobile / TFLiteInMemoryVectorStore via JNIStore in internal storage
Linux Edge (RPi, Jetson)llama.cpp on ARM64Qdrant or InMemoryFull stack possible with 8GB+ RAM