Scaling

MagicAF applications are stateless by design. Every component can be scaled independently.

Scaling Strategy

Component	Strategy	Notes
Embedding	Horizontal: N replicas behind a load balancer	CPU or GPU instances
Qdrant	Distributed mode (built-in clustering)	Sharding + replication
LLM	Horizontal: multiple GPU nodes behind a load balancer	vLLM supports tensor parallelism
Application	Horizontal: stateless, scale freely	Each instance gets its own `RAGWorkflow`

Embedding Service Scaling

Embedding is typically the most CPU/GPU-bound operation. Scale by running multiple replicas:

services:
  embedding:
    image: ghcr.io/huggingface/text-embeddings-inference:1.2
    deploy:
      replicas: 3
    # ...

Use a load balancer (nginx, HAProxy, or service mesh) in front. MagicAF connects to a single URL, so point EMBEDDING_URL at the load balancer.

Qdrant Scaling

Qdrant supports distributed mode with built-in sharding and replication:

# Start a Qdrant cluster
docker compose up -d qdrant-node-1 qdrant-node-2 qdrant-node-3

See the Qdrant distributed deployment docs for cluster configuration.

LLM Scaling

For high-throughput deployments:

vLLM: Supports tensor parallelism across multiple GPUs on a single node
Multiple nodes: Run separate vLLM instances and load balance
Batching: vLLM automatically batches concurrent requests for efficiency

services:
  llm:
    image: vllm/vllm-openai:latest
    deploy:
      replicas: 2
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Application Scaling

MagicAF applications are stateless — each instance constructs its own RAGWorkflow with the same configuration:

// Each application instance creates its own workflow
let workflow = RAGWorkflow::builder()
    .embedding_service(LocalEmbeddingService::new(config)?)
    .vector_store(QdrantVectorStore::new(vector_config).await?)
    .llm_service(LocalLlmService::new(llm_config)?)
    // ...
    .build()?;

Scale horizontally with your orchestrator of choice (Kubernetes, Docker Swarm, systemd, etc.).

Scaling Strategy#

Embedding Service Scaling#

Qdrant Scaling#

LLM Scaling#

Application Scaling#

Scaling Strategy

Embedding Service Scaling

Qdrant Scaling

LLM Scaling

Application Scaling