Skip to main content
Create a Deployment when you want to serve your trained LoRA through a dedicated serverless endpoint (stable deployment_id, GPU selection, autoscaling).
If you don’t need a deployment (you just want on-demand inference and per-request billing), use the Model API instead.
See: Choose a LoRA inference API

Prerequisites

You need a LoRA Asset in RunComfy Trainer:
  • Train a LoRA in Trainer (it appears under LoRA Assets), or
  • Import a LoRA (.safetensors). If available, also provide the training config so RunComfy can keep the base model + defaults consistent.
LoRA Assets live here: Trainer > LoRA Assets

Create a Deployment

You can create a deployment from either place:
  • From LoRA Assets, click Deploy / Deploy Endpoint (recommended — the LoRA is preselected)
  • From Deployments, click Create a deployment
Both routes open the same configuration screen.

1) Name the deployment

Give it a human-readable name for dashboards and logs. API calls use the generated deployment_id.

2) Select a LoRA (base model is pinned)

Select the LoRA you want to serve. The deployment is pinned to the base model that LoRA was trained with.

3) Choose hardware

Pick a GPU tier based on your latency target and VRAM needs. Exact GPU models can vary by region/capacity, but the VRAM tier is what matters.

4) Configure autoscaling

Autoscaling controls how many instances can run in parallel and whether you keep warm capacity. Common knobs:
  • Minimum instances (0–3): the baseline number of instances kept running. Set to 0 to minimize cost (idle deployments can stay at $0/hr), but the first request after idle may cold start. Set to 1+ to keep capacity warm and reduce cold starts, with ongoing runtime cost.
  • Maximum instances (1–10): the upper bound on instances that can run in parallel. This caps cost and effectively defines your concurrency ceiling; requests beyond capacity will wait in the queue until an instance is available.
  • Queue size (≥ 1): how many requests can remain pending while the deployment scales up (up to the maximum instances limit). Lower values prioritize latency (fail/pressure sooner), higher values prioritize cost (buffer bursts and scale gradually).
  • Keep warm duration (seconds): how long an instance stays up after finishing a request before scaling down. Shorter windows reduce idle cost; longer windows improve responsiveness for bursty traffic by avoiding frequent cold starts.

Deploy

Click Deploy. When it finishes, copy the deployment_id from the deployment details page. Next: Submit requests via the Async Queue Endpoints.