Skip to main contentCreate a Deployment when you want to serve your trained LoRA through a dedicated serverless endpoint (stable deployment_id, GPU selection, autoscaling).
If you don’t need a deployment (you just want on-demand inference and per-request billing), use the Model API instead.
See: Choose a LoRA inference API
Prerequisites
You need a LoRA Asset in RunComfy Trainer:
- Train a LoRA in Trainer (it appears under LoRA Assets), or
- Import a LoRA (
.safetensors). If available, also provide the training config so RunComfy can keep the base model + defaults consistent.
LoRA Assets live here: Trainer > LoRA Assets
Create a Deployment
You can create a deployment from either place:
- From LoRA Assets, click Deploy / Deploy Endpoint (recommended — the LoRA is preselected)
- From Deployments, click Create a deployment
Both routes open the same configuration screen.
1) Name the deployment
Give it a human-readable name for dashboards and logs. API calls use the generated deployment_id.
2) Select a LoRA (base model is pinned)
Select the LoRA you want to serve. The deployment is pinned to the base model that LoRA was trained with.
3) Choose hardware
Pick a GPU tier based on your latency target and VRAM needs. Exact GPU models can vary by region/capacity, but the VRAM tier is what matters.
Autoscaling controls how many instances can run in parallel and whether you keep warm capacity.
Common knobs:
- Minimum instances (
0–3): the baseline number of instances kept running. Set to 0 to minimize cost (idle deployments can stay at $0/hr), but the first request after idle may cold start. Set to 1+ to keep capacity warm and reduce cold starts, with ongoing runtime cost.
- Maximum instances (
1–10): the upper bound on instances that can run in parallel. This caps cost and effectively defines your concurrency ceiling; requests beyond capacity will wait in the queue until an instance is available.
- Queue size (
≥ 1): how many requests can remain pending while the deployment scales up (up to the maximum instances limit). Lower values prioritize latency (fail/pressure sooner), higher values prioritize cost (buffer bursts and scale gradually).
- Keep warm duration (seconds): how long an instance stays up after finishing a request before scaling down. Shorter windows reduce idle cost; longer windows improve responsiveness for bursty traffic by avoiding frequent cold starts.
Deploy
Click Deploy. When it finishes, copy the deployment_id from the deployment details page.
Next: Submit requests via the Async Queue Endpoints.