deployment_id). Use deployments when you want a stable endpoint with configurable hardware and autoscaling.
RunComfy handles containerization, GPU allocation, and scaling—your application just calls the endpoint.
Note: If you don’t want to use the web UI, you can also create deployments via API. See Deployment Endpoints.
Create a deployment (use web UI)
You can create a deployment from:- Deployments: Deployments → Deploy workflow as API
- My Workflows: My Workflows → select a workflow → Deploy as API
- Explore: Explore → select a workflow → Deploy as API
1) Select a workflow
To deploy as an API, choose either:- a custom workflow from My Workflows (built/modified by you and Cloud Saved with dependencies), or
- a community workflow from Explore (pre-saved and ready to deploy)
2) Configure hardware
Choose GPU hardware based on your workflow’s VRAM requirements and performance needs. Test the workflow in a ComfyUI session first to estimate usage and avoid runtime errors. Typical VRAM tiers include:- 16GB: T4 or A4000
- 24GB: A10G or A5000
- 48GB: A6000
- 48GB Plus: L40S or L40
- 80GB: A100
- 80GB Plus: H100
- 141GB: H200
3) Configure autoscaling
Autoscaling controls how many instances (running containerized copies of your workflow) can run in parallel and how quickly the deployment scales up/down. Common knobs:-
Minimum instances (0–30)
Baseline warm capacity. Setting this to1keeps one instance always warm (avoids cold starts but incurs ongoing cost). Use0to allow scale-to-zero (lowest idle cost), but the first request after idle may take a few minutes to start. -
Maximum instances (1–60)
Upper bound for parallel instances. Requests above capacity will queue. This caps cost and defines your concurrency ceiling. Need higher limits? Contact us. -
Queue size (≥ 1)
How many pending requests are allowed before the deployment tries to add capacity (up to max instances). Lower values prioritize latency; higher values buffer spikes. -
Keep warm (seconds)
How long an idle instance stays active after its last job before shutting down. Longer windows reduce cold-start frequency for bursty traffic, but increase idle cost.
Tip: A reasonable starting point for many apps is minimum0, maximum1, queue size1, keep warm60. Then tune based on real traffic.
4) Deploy
Review your workflow, hardware, and scaling settings, then click Deploy. After creation, the deployment details page shows yourdeployment_id—use it in all inference calls.
Next: Async Queue Endpoints.