Skip to main content
Alpha Release - Model Deployment is currently in active development. Features and APIs may change. Not recommended for production use.

What is Model Deployment?

Model Deployment enables you to deploy Small Language Models (SLMs) and specialized finetuned models directly alongside your AI agents. By co-locating models with agents and data, you can achieve lower inference latency while eliminating external API dependencies and costs.

Why Co-Located Models?

Latency Reduction
Deploying models where your agents run eliminates network roundtrips to external APIs. Local inference removes the overhead of external API calls.
Cost Efficiency
Avoid per-token API pricing by running your own optimized models. For high-volume workloads, self-hosted models can significantly reduce inference costs.
Data Locality
Models deployed near your data sources access context faster and more securely. No data leaves your infrastructure, improving both performance and compliance.
Model Specialization
Deploy domain-specific finetuned models optimized for your use cases. Smaller, specialized models often outperform larger general-purpose models for specific tasks.
When to Use Co-Located Models:
  • High-volume inference workloads
  • Latency-sensitive applications (voice agents, real-time chat)
  • Domain-specific tasks where SLMs perform well
  • Data privacy and compliance requirements
  • Cost optimization for predictable workloads

Supported Model Types

Small Language Models (SLMs)
  • Phi-3, Llama-3 8B, Mistral 7B, and other compact models
  • Domain-specific finetuned variants
  • Quantized models for efficient inference
  • Multi-lingual and task-specific models
Specialized Models
  • Custom finetuned models from your training pipelines
  • Embedding models for vector search and retrieval
  • Classification and sentiment analysis models
  • Named entity recognition and extraction models
Model Sources
  • HuggingFace model hub
  • Custom model registries
  • Private model repositories
  • Local model artifacts

Deployment Architecture

Agent + Model Co-Location
├── Kubernetes Pod
│   ├── Agent Container
│   │   ├── Business Logic
│   │   ├── Tool Calling
│   │   └── Data Access Layer
│   ├── Model Container
│   │   ├── SLM Inference Engine
│   │   ├── Model Weights (8GB)
│   │   └── Local Cache
│   └── Shared Resources
│       ├── Unix Domain Sockets
│       ├── Shared Memory
│       └── Local Network (localhost)
└── Data Sources
    ├── PostgreSQL (same AZ)
    ├── Vector Store (co-located)
    └── S3 Cache (regional)
Co-location benefits:
  • Reduced network overhead through localhost communication
  • No egress costs as data and inference stay within the same pod/node
  • Efficient resource sharing between agents and models
  • Simplified networking without complex service mesh configuration

Deployment Configuration

Basic Model Deployment

model_deployment:
  name: customer-support-phi3
  model:
    source: microsoft/Phi-3-mini-4k-instruct
    quantization: int8
    max_tokens: 4096
  
  resources:
    gpu: nvidia-t4
    memory: 16Gi
    replicas: 2
  
  co_location:
    agent: support-agent
    prefer_same_pod: true
    prefer_same_node: true

Advanced Configuration

model_deployment:
  name: legal-document-analyzer
  model:
    source: private-registry/legal-llama-7b
    optimization: onnx
    batch_size: 8
  
  scaling:
    min_replicas: 1
    max_replicas: 10
  
  inference:
    framework: vllm
    tensor_parallel: 2
    kv_cache_size: 8Gi

Model Optimization

Quantization
Deploy models with 8-bit or 4-bit quantization to reduce memory footprint with minimal accuracy loss. Ideal for SLMs running on CPU or smaller GPUs.
Inference Frameworks
Choose from optimized inference engines like vLLM (high throughput), TGI (tensor optimization), or Ollama (developer-friendly). Each framework offers different performance tradeoffs.
Resource Allocation
SLMs typically require 8-16GB memory and 4-8 CPU cores. GPU acceleration (T4, A10) provides faster inference for high-throughput workloads.
Caching Strategies
Implement KV-cache for repeated prompts and prompt caching for common prefixes. For agent workflows with consistent system prompts, caching can significantly reduce latency.

Integration with Agents

Models deployed alongside agents communicate via:
  • HTTP/REST: Standard inference API
  • gRPC: Efficient binary protocol
  • Unix Domain Sockets: Fastest option for same-pod communication
  • Shared Memory: Direct memory access for maximum performance
The agent deployment system automatically configures the optimal communication pattern based on co-location settings.

Future: Continuous Model Learning

Coming Soon: Support for continuous model finetuning with real-time data streams is under development. This will enable models to learn from production data and improve over time without retraining cycles. The model deployment architecture is designed to support future continuous learning capabilities:
  • Real-time data ingestion from agent interactions
  • Incremental model updates without downtime
  • A/B testing between model versions
  • Automated evaluation and rollback
This positions VantEdge to support the next generation of adaptive AI systems that continuously improve from production data.
Model Deployment brings inference closer to your agents and data, enabling the performance required for production AI applications.