Model Deployment

Alpha Release - Model Deployment is currently in active development. Features and APIs may change. Not recommended for production use.

What is Model Deployment?

Model Deployment enables you to deploy Small Language Models (SLMs) and specialized finetuned models directly alongside your AI agents. By co-locating models with agents and data, you can achieve lower inference latency while eliminating external API dependencies and costs.

Why Co-Located Models?

Latency Reduction
Deploying models where your agents run eliminates network roundtrips to external APIs. Local inference removes the overhead of external API calls. Cost Efficiency
Avoid per-token API pricing by running your own optimized models. For high-volume workloads, self-hosted models can significantly reduce inference costs. Data Locality
Models deployed near your data sources access context faster and more securely. No data leaves your infrastructure, improving both performance and compliance. Model Specialization
Deploy domain-specific finetuned models optimized for your use cases. Smaller, specialized models often outperform larger general-purpose models for specific tasks. When to Use Co-Located Models:

High-volume inference workloads
Latency-sensitive applications (voice agents, real-time chat)
Domain-specific tasks where SLMs perform well
Data privacy and compliance requirements
Cost optimization for predictable workloads

Supported Model Types

Small Language Models (SLMs)

Phi-3, Llama-3 8B, Mistral 7B, and other compact models
Domain-specific finetuned variants
Quantized models for efficient inference
Multi-lingual and task-specific models

Specialized Models

Custom finetuned models from your training pipelines
Embedding models for vector search and retrieval
Classification and sentiment analysis models
Named entity recognition and extraction models

Model Sources

HuggingFace model hub
Custom model registries
Private model repositories
Local model artifacts

Deployment Architecture

Agent + Model Co-Location
├── Kubernetes Pod
│   ├── Agent Container
│   │   ├── Business Logic
│   │   ├── Tool Calling
│   │   └── Data Access Layer
│   ├── Model Container
│   │   ├── SLM Inference Engine
│   │   ├── Model Weights (8GB)
│   │   └── Local Cache
│   └── Shared Resources
│       ├── Unix Domain Sockets
│       ├── Shared Memory
│       └── Local Network (localhost)
└── Data Sources
    ├── PostgreSQL (same AZ)
    ├── Vector Store (co-located)
    └── S3 Cache (regional)

Co-location benefits:

Reduced network overhead through localhost communication
No egress costs as data and inference stay within the same pod/node
Efficient resource sharing between agents and models
Simplified networking without complex service mesh configuration

Deployment Configuration

Basic Model Deployment

model_deployment:
  name: customer-support-phi3
  model:
    source: microsoft/Phi-3-mini-4k-instruct
    quantization: int8
    max_tokens: 4096
  
  resources:
    gpu: nvidia-t4
    memory: 16Gi
    replicas: 2
  
  co_location:
    agent: support-agent
    prefer_same_pod: true
    prefer_same_node: true

Advanced Configuration

model_deployment:
  name: legal-document-analyzer
  model:
    source: private-registry/legal-llama-7b
    optimization: onnx
    batch_size: 8
  
  scaling:
    min_replicas: 1
    max_replicas: 10
  
  inference:
    framework: vllm
    tensor_parallel: 2
    kv_cache_size: 8Gi

Model Optimization

Quantization
Deploy models with 8-bit or 4-bit quantization to reduce memory footprint with minimal accuracy loss. Ideal for SLMs running on CPU or smaller GPUs. Inference Frameworks
Choose from optimized inference engines like vLLM (high throughput), TGI (tensor optimization), or Ollama (developer-friendly). Each framework offers different performance tradeoffs. Resource Allocation
SLMs typically require 8-16GB memory and 4-8 CPU cores. GPU acceleration (T4, A10) provides faster inference for high-throughput workloads. Caching Strategies
Implement KV-cache for repeated prompts and prompt caching for common prefixes. For agent workflows with consistent system prompts, caching can significantly reduce latency.

Integration with Agents

Models deployed alongside agents communicate via:

HTTP/REST: Standard inference API
gRPC: Efficient binary protocol
Unix Domain Sockets: Fastest option for same-pod communication
Shared Memory: Direct memory access for maximum performance

The agent deployment system automatically configures the optimal communication pattern based on co-location settings.

Future: Continuous Model Learning

Coming Soon: Support for continuous model finetuning with real-time data streams is under development. This will enable models to learn from production data and improve over time without retraining cycles. The model deployment architecture is designed to support future continuous learning capabilities:

Real-time data ingestion from agent interactions
Incremental model updates without downtime
A/B testing between model versions
Automated evaluation and rollback

This positions VantEdge to support the next generation of adaptive AI systems that continuously improve from production data.

Model Deployment brings inference closer to your agents and data, enabling the performance required for production AI applications.

Get Started

Core Concepts

Platform Features

Solutions

What is Model Deployment?

Why Co-Located Models?

Supported Model Types

Deployment Architecture

Deployment Configuration

Basic Model Deployment

Advanced Configuration

Model Optimization

Integration with Agents

Future: Continuous Model Learning

Get Started

Core Concepts

Platform Features

Solutions

​What is Model Deployment?

​Why Co-Located Models?

​Supported Model Types

​Deployment Architecture

​Deployment Configuration

​Basic Model Deployment

​Advanced Configuration

​Model Optimization

​Integration with Agents

​Future: Continuous Model Learning

What is Model Deployment?

Why Co-Located Models?

Supported Model Types

Deployment Architecture

Deployment Configuration

Basic Model Deployment

Advanced Configuration

Model Optimization

Integration with Agents

Future: Continuous Model Learning