Deployments in RunComfy transform your cloud-saved ComfyUI workflows into scalable, serverless API endpoints. Create a deployment to turn your workflow into a callable service, while RunComfy manages containerization, GPU allocation, and autoscaling tailored to your needs. Set up a deployment by clicking Deploy workflow as API on the Deployments page, or by selecting a workflow on the My Workflows or Explore pages and clicking Deploy as API. From there, configure the settings to create your deployment and make it a ready-to-use production API.

Select a Workflow

To deploy as an API, choose either a custom workflow from your My Workflows page (built or modified by you and cloud-saved with all dependencies included) or a community workflow from the Explore page (pre-saved and ready to use). For guidance on creating custom workflows, see the Custom Workflows section.

Configure Hardware

Choose GPU hardware based on your workflow’s VRAM requirements and performance demands. Test in a ComfyUI session beforehand to estimate usage and prevent runtime errors. Options include:
  • 16GB: T4 or A4000
  • 24GB: A10G or A5000
  • 48GB: A6000
  • 48GB Plus: L40S or L40
  • 80GB: A100
  • 80GB Plus: H100
For full pricing on GPU usage and plans, visit Pricing.

Autoscaling

Configure autoscaling to automatically scale the number of instances (running containerized versions of your workflow) up or down based on the volume of incoming requests. This helps maintain low latency during traffic spikes while optimizing costs for quieter periods—ideal for apps with variable workloads like user-generated AI content creation.
  • Minimum instances (0–3):
    The baseline number of instances that remain active at all times, even with no requests. Setting it to 1 keeps one instance always warm, avoiding cold-start delays (3–5 minutes to boot), but incurs ongoing costs. For infrequent traffic, use 0 to save costs, though users may experience initial waits.
  • Maximum instances (1–10):
    The upper limit on simultaneous instances during peak demand. For example, 3 means no more than three instances will run at once—excess requests queue instead. This caps costs while handling moderate bursts. For heavier loads (>10 concurrent jobs), contact hi@runcomfy.com to discuss custom limits.
  • Queue size (1+):
    The number of pending requests before spinning up a new instance. If set to 1, the arrival of a second request triggers an additional instance to reduce wait times. Ideal for latency-sensitive apps, though may increase costs during short spikes.
  • Keep warm (seconds):
    How long an idle instance stays active after its last job before shutting down. For example, 60 means it lingers for one minute, ready for reuse without a cold start. Slight costs may occur if no new requests arrive.
Tip: Start with defaults (minimum 0, maximum 1, queue size 1, keep warm 60) for most setups. Then monitor traffic patterns via request and billing data to fine-tune.

Deploy

Review your selections for workflow, hardware, and scaling, then click Deploy. You’ll be taken to the deployment details page, where your deployment_id is displayed—use this in all API calls to the provided endpoints.