Skip to main content
A Deployment turns a cloud-saved ComfyUI workflow into a serverless API endpoint (identified by deployment_id). Use deployments when you want a stable endpoint with configurable hardware and autoscaling. RunComfy handles containerization, GPU allocation, and scaling—your application just calls the endpoint.
Note: If you don’t want to use the web UI, you can also create deployments via API. See Deployment Endpoints.

Create a deployment (use web UI)

You can create a deployment from:
  • Deployments: DeploymentsDeploy workflow as API
  • My Workflows: My Workflows → select a workflow → Deploy as API
  • Explore: Explore → select a workflow → Deploy as API

1) Select a workflow

To deploy as an API, choose either:
  • a custom workflow from My Workflows (built/modified by you and Cloud Saved with dependencies), or
  • a community workflow from Explore (pre-saved and ready to deploy)
For guidance on building your own workflow, see Custom Workflows.

2) Configure hardware

Choose GPU hardware based on your workflow’s VRAM requirements and performance needs. Test the workflow in a ComfyUI session first to estimate usage and avoid runtime errors. Typical VRAM tiers include:
  • 16GB: T4 or A4000
  • 24GB: A10G or A5000
  • 48GB: A6000
  • 48GB Plus: L40S or L40
  • 80GB: A100
  • 80GB Plus: H100
  • 141GB: H200
Exact GPU models and availability can vary. Refer to Pricing for the latest tiers and rates.

3) Configure autoscaling

Autoscaling controls how many instances (running containerized copies of your workflow) can run in parallel and how quickly the deployment scales up/down. Common knobs:
  • Minimum instances (0–30)
    Baseline warm capacity. Setting this to 1 keeps one instance always warm (avoids cold starts but incurs ongoing cost). Use 0 to allow scale-to-zero (lowest idle cost), but the first request after idle may take a few minutes to start.
  • Maximum instances (1–60)
    Upper bound for parallel instances. Requests above capacity will queue. This caps cost and defines your concurrency ceiling. Need higher limits? Contact us.
  • Queue size (≥ 1)
    How many pending requests are allowed before the deployment tries to add capacity (up to max instances). Lower values prioritize latency; higher values buffer spikes.
  • Keep warm (seconds)
    How long an idle instance stays active after its last job before shutting down. Longer windows reduce cold-start frequency for bursty traffic, but increase idle cost.
Tip: A reasonable starting point for many apps is minimum 0, maximum 1, queue size 1, keep warm 60. Then tune based on real traffic.

4) Deploy

Review your workflow, hardware, and scaling settings, then click Deploy. After creation, the deployment details page shows your deployment_id—use it in all inference calls. Next: Async Queue Endpoints.