RunComfy API Docs

Create a Deployment when you want to serve your trained LoRA through a dedicated serverless endpoint (stable deployment_id, GPU selection, autoscaling).

If you don’t need a deployment (you just want on-demand inference and per-request billing), use the Model API instead.
See: Choose a LoRA inference API

Prerequisites

You need a LoRA Asset in RunComfy Trainer:

Train a LoRA in Trainer (it appears under LoRA Assets), or
Import a LoRA (.safetensors). If available, also provide the training config so RunComfy can keep the base model + defaults consistent.

LoRA Assets live here: Trainer > LoRA Assets

Create a Deployment

You can create a deployment from either place:

From LoRA Assets, click Deploy / Deploy Endpoint (recommended — the LoRA is preselected)
From Deployments, click Create a deployment

Both routes open the same configuration screen.

1) Name the deployment

Give it a human-readable name for dashboards and logs. API calls use the generated deployment_id.

2) Select a LoRA (base model is pinned)

Select the LoRA you want to serve. The deployment is pinned to the base model that LoRA was trained with.

3) Choose hardware

Pick a GPU tier based on your latency target and VRAM needs. Exact GPU models can vary by region/capacity, but the VRAM tier is what matters.

4) Configure autoscaling

Autoscaling controls how many instances can run in parallel and whether you keep warm capacity. Common knobs:

Minimum instances (0–30): the baseline number of instances kept running. Set to 0 to minimize cost (idle deployments can stay at $0/hr), but the first request after idle may cold start. Set to 1+ to keep capacity warm and reduce cold starts, with ongoing runtime cost.
Maximum instances (1–60): the upper bound on instances that can run in parallel. This caps cost and effectively defines your concurrency ceiling; requests beyond capacity will wait in the queue until an instance is available. Need higher limits? Contact us.
Queue size (≥ 1): how many requests can remain pending while the deployment scales up (up to the maximum instances limit). Lower values prioritize latency (fail/pressure sooner), higher values prioritize cost (buffer bursts and scale gradually).
Keep warm duration (seconds): how long an instance stays up after finishing a request before scaling down. Shorter windows reduce idle cost; longer windows improve responsiveness for bursty traffic by avoiding frequent cold starts.

Deploy

Click Deploy. When it finishes, copy the deployment_id from the deployment details page. Next: Submit requests via the Async Queue Endpoints.

Getting Started

Deployment

API Reference

Pricing & Support

Create a Deployment

Prerequisites

Create a Deployment

1) Name the deployment

2) Select a LoRA (base model is pinned)

3) Choose hardware

4) Configure autoscaling

Deploy

Getting Started

Deployment

API Reference

Pricing & Support

​Prerequisites

​Create a Deployment

​1) Name the deployment

​2) Select a LoRA (base model is pinned)

​3) Choose hardware

​4) Configure autoscaling

​Deploy

Prerequisites

Create a Deployment

1) Name the deployment

2) Select a LoRA (base model is pinned)

3) Choose hardware

4) Configure autoscaling

Deploy