MagicAF applications are stateless by design. Every component can be scaled independently.

Scaling Strategy

ComponentStrategyNotes
EmbeddingHorizontal: N replicas behind a load balancerCPU or GPU instances
QdrantDistributed mode (built-in clustering)Sharding + replication
LLMHorizontal: multiple GPU nodes behind a load balancervLLM supports tensor parallelism
ApplicationHorizontal: stateless, scale freelyEach instance gets its own RAGWorkflow

Embedding Service Scaling

Embedding is typically the most CPU/GPU-bound operation. Scale by running multiple replicas:

services:
  embedding:
    image: ghcr.io/huggingface/text-embeddings-inference:1.2
    deploy:
      replicas: 3
    # ...

Use a load balancer (nginx, HAProxy, or service mesh) in front. MagicAF connects to a single URL, so point EMBEDDING_URL at the load balancer.

Qdrant Scaling

Qdrant supports distributed mode with built-in sharding and replication:

# Start a Qdrant cluster
docker compose up -d qdrant-node-1 qdrant-node-2 qdrant-node-3

See the Qdrant distributed deployment docs for cluster configuration.

LLM Scaling

For high-throughput deployments:

  • vLLM: Supports tensor parallelism across multiple GPUs on a single node
  • Multiple nodes: Run separate vLLM instances and load balance
  • Batching: vLLM automatically batches concurrent requests for efficiency
services:
  llm:
    image: vllm/vllm-openai:latest
    deploy:
      replicas: 2
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Application Scaling

MagicAF applications are stateless — each instance constructs its own RAGWorkflow with the same configuration:

// Each application instance creates its own workflow
let workflow = RAGWorkflow::builder()
    .embedding_service(LocalEmbeddingService::new(config)?)
    .vector_store(QdrantVectorStore::new(vector_config).await?)
    .llm_service(LocalLlmService::new(llm_config)?)
    // ...
    .build()?;

Scale horizontally with your orchestrator of choice (Kubernetes, Docker Swarm, systemd, etc.).