Alpha Release - Model Deployment is currently in active development. Features and APIs may change. Not recommended for production use.
What is Model Deployment?
Model Deployment enables you to deploy Small Language Models (SLMs) and specialized finetuned models directly alongside your AI agents. By co-locating models with agents and data, you can achieve lower inference latency while eliminating external API dependencies and costs.Why Co-Located Models?
Latency ReductionDeploying models where your agents run eliminates network roundtrips to external APIs. Local inference removes the overhead of external API calls. Cost Efficiency
Avoid per-token API pricing by running your own optimized models. For high-volume workloads, self-hosted models can significantly reduce inference costs. Data Locality
Models deployed near your data sources access context faster and more securely. No data leaves your infrastructure, improving both performance and compliance. Model Specialization
Deploy domain-specific finetuned models optimized for your use cases. Smaller, specialized models often outperform larger general-purpose models for specific tasks. When to Use Co-Located Models:
- High-volume inference workloads
- Latency-sensitive applications (voice agents, real-time chat)
- Domain-specific tasks where SLMs perform well
- Data privacy and compliance requirements
- Cost optimization for predictable workloads
Supported Model Types
Small Language Models (SLMs)- Phi-3, Llama-3 8B, Mistral 7B, and other compact models
- Domain-specific finetuned variants
- Quantized models for efficient inference
- Multi-lingual and task-specific models
- Custom finetuned models from your training pipelines
- Embedding models for vector search and retrieval
- Classification and sentiment analysis models
- Named entity recognition and extraction models
- HuggingFace model hub
- Custom model registries
- Private model repositories
- Local model artifacts
Deployment Architecture
- Reduced network overhead through localhost communication
- No egress costs as data and inference stay within the same pod/node
- Efficient resource sharing between agents and models
- Simplified networking without complex service mesh configuration
Deployment Configuration
Basic Model Deployment
Advanced Configuration
Model Optimization
QuantizationDeploy models with 8-bit or 4-bit quantization to reduce memory footprint with minimal accuracy loss. Ideal for SLMs running on CPU or smaller GPUs. Inference Frameworks
Choose from optimized inference engines like vLLM (high throughput), TGI (tensor optimization), or Ollama (developer-friendly). Each framework offers different performance tradeoffs. Resource Allocation
SLMs typically require 8-16GB memory and 4-8 CPU cores. GPU acceleration (T4, A10) provides faster inference for high-throughput workloads. Caching Strategies
Implement KV-cache for repeated prompts and prompt caching for common prefixes. For agent workflows with consistent system prompts, caching can significantly reduce latency.
Integration with Agents
Models deployed alongside agents communicate via:- HTTP/REST: Standard inference API
- gRPC: Efficient binary protocol
- Unix Domain Sockets: Fastest option for same-pod communication
- Shared Memory: Direct memory access for maximum performance
Future: Continuous Model Learning
Coming Soon: Support for continuous model finetuning with real-time data streams is under development. This will enable models to learn from production data and improve over time without retraining cycles. The model deployment architecture is designed to support future continuous learning capabilities:- Real-time data ingestion from agent interactions
- Incremental model updates without downtime
- A/B testing between model versions
- Automated evaluation and rollback
Model Deployment brings inference closer to your agents and data, enabling the performance required for production AI applications.