Skip to content
Containers

Karpenter vs Cluster Autoscaler: Kubernetes Node Scaling Compared

Cluster Autoscaler scales pre-defined node groups. Karpenter provisions optimal instances in real time. Compare scaling speed, cost savings, Spot handling, multi-arch support, and get a step-by-step EKS migration guide.

A
Abhishek Patel16 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Karpenter vs Cluster Autoscaler: Kubernetes Node Scaling Compared
Karpenter vs Cluster Autoscaler: Kubernetes Node Scaling Compared

Kubernetes Node Scaling Has a New Default

For years, Cluster Autoscaler was the only viable option for automatically scaling Kubernetes nodes. It worked -- but it worked slowly, rigidly, and with a frustrating dependency on pre-configured node groups. Karpenter, originally built by AWS and now a CNCF incubating project, takes a fundamentally different approach: it provisions compute directly from the cloud provider's instance catalog in real time, skipping node groups entirely.

I've migrated three production EKS clusters from Cluster Autoscaler to Karpenter over the past two years. The difference in scaling speed, cost efficiency, and operational overhead is significant enough that I consider Karpenter the default choice for any new Kubernetes deployment on AWS. But Cluster Autoscaler still has its place -- particularly on GKE and AKS, where Karpenter support is either early-stage or nonexistent.

This guide covers how both tools work under the hood, benchmarks their scaling performance, compares cost optimization strategies, and walks through migrating from Cluster Autoscaler to Karpenter on EKS.

What Is Kubernetes Node Autoscaling?

Definition: Kubernetes node autoscaling is the process of automatically adding or removing worker nodes from a cluster based on workload demand. When pods cannot be scheduled due to insufficient resources (CPU, memory, GPUs), the autoscaler provisions new nodes. When nodes are underutilized, it drains and terminates them. Node autoscaling operates independently from pod-level autoscaling (HPA/VPA), which adjusts the number or size of pods within existing nodes.

Both Cluster Autoscaler and Karpenter solve this problem, but they differ in architecture, speed, and flexibility. Understanding the mechanics of each is critical to choosing the right tool.

How Cluster Autoscaler Works

Cluster Autoscaler (CA) is a Kubernetes SIG project that has been the standard node autoscaler since roughly 2017. Its model revolves around node groups -- pre-defined pools of identically configured machines, typically backed by cloud provider constructs like AWS Auto Scaling Groups (ASGs), GCE Managed Instance Groups (MIGs), or Azure VM Scale Sets (VMSS).

The scaling loop works like this:

  1. Watch for unschedulable pods -- CA polls the Kubernetes API server every 10 seconds (configurable via --scan-interval) for pods in the Pending state with scheduling failures.
  2. Simulate scheduling -- For each node group, CA simulates whether the pending pods could be placed on a new node of that type. It picks the node group that satisfies the most pending pods.
  3. Increase the ASG desired count -- CA calls the cloud provider API to increment the node group's desired capacity. The cloud provider then launches an instance.
  4. Wait for the node to join -- The new instance boots, runs its bootstrap script, joins the cluster, and becomes Ready. CA has no control over this process.

For scale-down, CA identifies nodes with utilization below a threshold (default 50%) for a sustained period (default 10 minutes), cordons them, drains pods, and decrements the ASG.

Cluster Autoscaler Configuration

# cluster-autoscaler-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - name: cluster-autoscaler
          image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
          command:
            - ./cluster-autoscaler
            - --v=4
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
            - --balance-similar-node-groups
            - --scale-down-delay-after-add=5m
            - --scale-down-unneeded-time=10m
            - --scan-interval=10s
          resources:
            requests:
              cpu: 100m
              memory: 600Mi

Notice the --node-group-auto-discovery flag. CA discovers ASGs by tag, but you must have already created those ASGs with specific instance types and sizes. If your workload needs a c7g.2xlarge (ARM, compute-optimized) but your ASGs only contain m6i.xlarge (x86, general purpose), CA cannot help. You would need to create a new ASG, tag it, and wait for CA to discover it.

How Karpenter Works

Karpenter takes a group-less approach. Instead of relying on pre-defined node groups, it evaluates pending pods' resource requirements and constraints -- CPU, memory, GPU, architecture, topology, node selectors, tolerations -- and provisions the optimal instance type directly from the cloud provider's full instance catalog.

The scaling loop:

  1. Watch for unschedulable pods -- Karpenter uses informers (not polling) to react to scheduling failures in near real time.
  2. Batch pending pods -- Karpenter waits briefly (default 10 seconds) to batch multiple pending pods into a single provisioning decision, reducing API calls and improving bin-packing.
  3. Compute optimal instance types -- Based on the aggregate resource requirements, Karpenter evaluates hundreds of instance types and selects the cheapest combination that satisfies all constraints. It factors in on-demand vs Spot pricing, architecture (x86/ARM), availability zone capacity, and instance family.
  4. Launch instances directly -- Karpenter calls the EC2 Fleet API (or equivalent) to launch instances, bypassing ASGs entirely. The instance boots with a pre-configured AMI and joins the cluster.

For scale-down, Karpenter continuously evaluates whether nodes can be consolidated -- replacing multiple underutilized nodes with fewer, cheaper, better-fitting ones. This is more aggressive and cost-effective than CA's simple utilization-threshold approach.

Karpenter NodePool and EC2NodeClass Configuration

# karpenter-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        workload-type: general
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["5"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      expireAfter: 720h  # Replace nodes every 30 days
  limits:
    cpu: "1000"
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: KarpenterNodeRole-my-cluster
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125

Compare this to the CA setup. There are no ASGs to create, no instance types to pre-select, no launch templates to maintain. Karpenter's NodePool defines constraints (architecture, capacity type, instance families), and Karpenter chooses the specific instance type at provisioning time based on the actual workload.

Scaling Speed: Karpenter vs Cluster Autoscaler

This is where Karpenter's architectural advantage shows most clearly. I benchmarked both tools on EKS (Kubernetes 1.31, us-east-1) by creating a Deployment with 50 replicas of a pod requesting 1 vCPU and 2 GiB memory on an empty cluster.

MetricKarpenter (v1.1)Cluster Autoscaler (v1.31)
Time to first node Ready45-55 seconds120-180 seconds
Time to all pods Running60-90 seconds180-300 seconds
Instance types selectedMix of c6g.2xlarge, m7g.2xlarge (ARM)m6i.xlarge only (ASG-defined)
Nodes provisioned7 nodes13 nodes
Total vCPU provisioned56 vCPU (tight fit)52 vCPU + overhead from fixed sizing
Estimated hourly cost$0.89 (Spot ARM instances)$1.56 (on-demand x86 instances)

Karpenter was roughly 3x faster end-to-end and 43% cheaper per hour for the same workload. The speed difference comes from three factors: (1) event-driven triggering vs polling, (2) direct EC2 Fleet API calls vs ASG scaling operations, and (3) batched provisioning that optimizes across all pending pods simultaneously instead of scaling one node group at a time.

The cost difference comes from Karpenter's ability to select ARM Spot instances automatically, while CA was constrained to the x86 on-demand instances defined in the ASG.

Cost Optimization: Consolidation vs Scale-Down

Cost optimization is where the two tools diverge most. Cluster Autoscaler has one strategy: remove underutilized nodes. Karpenter has three.

StrategyKarpenterCluster Autoscaler
Remove empty nodesYes (within 30s by default)Yes (after 10min by default)
Remove underutilized nodesYes -- drains and repacks pods onto other nodesYes -- but only if utilization < 50%
Replace with cheaper instancesYes -- actively swaps nodes for better-fitting, cheaper typesNo -- stuck with ASG instance type
Spot-to-Spot replacementYes -- migrates to different Spot pools if current pool pricing risesNo
Right-sizingYes -- replaces oversized nodes with smaller ones as pods are removedNo

Karpenter's consolidation loop continuously evaluates whether the current set of nodes is optimal. If you delete a Deployment and free up 4 vCPUs on a 16-vCPU node, Karpenter will check whether remaining pods could fit on a smaller instance. If they can, it cordons the node, drains the pods, terminates the instance, and launches a cheaper replacement -- all automatically. CA would only act if the node dropped below 50% utilization and stayed there for 10 minutes.

Real-world savings: Across the three EKS clusters I migrated, Karpenter's consolidation reduced compute costs by 28-35% compared to Cluster Autoscaler with the same workloads. Most of the savings came from ARM instance selection (Graviton instances are ~20% cheaper than equivalent x86) and aggressive Spot usage.

Spot Instance Handling

Spot instances offer 60-90% discounts but can be interrupted with 2 minutes of notice. How each tool handles this matters significantly for reliability.

Cluster Autoscaler has no native Spot awareness. You configure Spot instances at the ASG level, and CA treats them like any other node. When AWS reclaims a Spot instance, the node disappears and CA reacts to the newly unschedulable pods -- a reactive approach that causes service disruption.

Karpenter has first-class Spot support:

  • Diversified allocation -- Karpenter spreads Spot requests across many instance types and availability zones using the price-capacity-optimized strategy, reducing interruption probability.
  • Interruption handling -- Karpenter watches for EC2 Spot interruption notices and ITN (Instance Termination Notifications) via SQS. When it detects an upcoming interruption, it proactively cordons the node, drains pods, and provisions a replacement before the 2-minute window expires.
  • Fallback to on-demand -- If Spot capacity is unavailable for any matching instance type, Karpenter seamlessly falls back to on-demand instances. No manual intervention needed.
# Spot-optimized NodePool for batch workloads
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: batch-spot
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["xlarge", "2xlarge", "4xlarge"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      taints:
        - key: workload-type
          value: batch
          effect: NoSchedule
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 60s

Multi-Architecture Support: x86 and ARM

ARM-based instances (AWS Graviton, Ampere on GCP/Azure) offer 20-40% better price-performance than equivalent x86 instances. Using them effectively requires multi-architecture container images and a scheduler that can provision the right architecture.

Cluster Autoscaler requires separate ASGs for x86 and ARM nodes. You need to tag your ARM ASGs, ensure your images are multi-arch, and use node selectors or affinity rules to direct pods appropriately. The expander strategy (--expander=priority) can prefer ARM ASGs, but it's another layer of configuration to maintain.

Karpenter handles this natively. When you include both amd64 and arm64 in the NodePool requirements, Karpenter evaluates instance pricing across both architectures and picks the cheapest option that fits. If your container images are multi-arch (built with docker buildx), Karpenter transparently provisions ARM nodes when they're cheaper -- which they almost always are.

Watch out: Before enabling ARM in your NodePool, verify that every container image in your cluster supports linux/arm64. A single x86-only image will cause CrashLoopBackOff on ARM nodes. Check images with docker manifest inspect <image> and look for arm64 in the platform list. Common offenders: legacy internal images, older database sidecars, and some monitoring agents.

Feature-by-Feature Comparison

FeatureKarpenter (v1.1)Cluster Autoscaler (v1.31)
Scaling triggerEvent-driven (informers)Polling (default 10s interval)
Node group dependencyNone -- group-less provisioningRequires ASGs / MIGs / VMSS
Instance type selectionAutomatic from full catalogFixed per node group
Bin-packingCross-pod batched optimizationPer-node-group simulation
Scale-up speed45-60 seconds2-5 minutes
Scale-downConsolidation (replace + remove)Remove only (utilization threshold)
Spot supportNative (interruption handling, fallback)Via ASG configuration only
Multi-arch (x86/ARM)Native (single NodePool)Separate ASGs required
GPU schedulingAutomatic GPU instance selectionDedicated GPU ASGs
Node expiry / rotationBuilt-in (expireAfter)External tooling needed
Cloud supportAWS (GA), Azure (beta)AWS, GCP, Azure, and 10+ others
CNCF statusIncubating projectPart of Kubernetes SIG Autoscaling

Migration Guide: Cluster Autoscaler to Karpenter on EKS

Migrating a running EKS cluster from Cluster Autoscaler to Karpenter can be done with zero downtime. The key is running both systems in parallel during the transition. Here is the step-by-step process I've used in production.

Step 1: Install Karpenter

Install Karpenter using Helm alongside your existing Cluster Autoscaler. They can coexist because Karpenter uses its own finalizers and annotations to identify nodes it manages.

# Set environment variables
export KARPENTER_VERSION="1.1.0"
export CLUSTER_NAME="my-cluster"
export AWS_ACCOUNT_ID="$(aws sts get-caller-identity --query Account --output text)"

# Install Karpenter
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "${KARPENTER_VERSION}" \
  --namespace kube-system \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueueName=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Step 2: Create NodePools with Taints

Create Karpenter NodePools but initially add a taint so that existing workloads do not get scheduled on Karpenter-managed nodes until you are ready.

# migration-nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: migration
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      taints:
        - key: karpenter.sh/migration
          effect: NoSchedule
  limits:
    cpu: "200"

Step 3: Migrate Workloads Incrementally

Add tolerations to one workload at a time. This forces those pods to schedule on Karpenter-managed nodes. Monitor each workload before proceeding to the next.

# Add toleration to a deployment
spec:
  template:
    spec:
      tolerations:
        - key: karpenter.sh/migration
          operator: Exists
          effect: NoSchedule

Step 4: Remove the Migration Taint

Once all critical workloads are validated on Karpenter nodes, remove the taint from the NodePool. All new pods will schedule on Karpenter-managed nodes by default.

Step 5: Scale Down CA-Managed Node Groups

Gradually reduce the minimum and desired capacity of your ASGs to zero. CA will scale them down as pods migrate to Karpenter nodes. Once all ASG-managed nodes are empty, delete the ASGs and uninstall Cluster Autoscaler.

# Scale down ASG-managed nodes
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name my-cluster-workers \
  --min-size 0 --desired-capacity 0

# Uninstall Cluster Autoscaler after all nodes are drained
kubectl delete deployment cluster-autoscaler -n kube-system

Note: Keep your managed node group with at least 2 nodes running system components (CoreDNS, kube-proxy, Karpenter itself) until you configure Karpenter to handle those with a dedicated system NodePool. Karpenter cannot provision the node it runs on.

Availability Beyond AWS: GKE and AKS

Karpenter was built at AWS, and its AWS provider is the only GA implementation. Here is the current state on other clouds as of early 2026:

Cloud ProviderKarpenter StatusCluster Autoscaler StatusRecommendation
AWS (EKS)GA (v1.1) -- production-readyGA -- fully supportedUse Karpenter for new clusters
GCP (GKE)Not available (GKE has its own NAP)GA -- deeply integratedUse GKE Node Auto-Provisioning (NAP)
Azure (AKS)Beta (AKS Karpenter provider)GA -- fully supportedEvaluate Karpenter beta; default to CA for production

GKE's Node Auto-Provisioning (NAP) offers Karpenter-like capabilities natively: it provisions optimal machine types from GCP's full catalog without pre-defined node pools. If you are on GKE, NAP is the closest equivalent to Karpenter and is GA. On AKS, Microsoft released a Karpenter provider in beta in late 2025 -- promising but not yet recommended for production workloads with strict reliability requirements.

When to Stick with Cluster Autoscaler

Karpenter is not universally better. Use Cluster Autoscaler when:

  • You are on GKE or AKS in production -- CA is the mature, supported option. GKE's NAP is a better alternative than waiting for Karpenter support.
  • You need deterministic instance types -- Some compliance or licensing requirements mandate specific instance types. CA's ASG model gives you explicit control over exactly which instances run in your cluster.
  • You run on bare metal or non-major clouds -- CA supports 15+ cloud providers through its cloud-provider interface. Karpenter only supports AWS (GA) and Azure (beta).
  • Your team is not ready for the migration -- CA works. If your current scaling meets your SLOs and cost targets, migrating for marginal improvements may not be worth the operational risk.

Frequently Asked Questions

Can Karpenter and Cluster Autoscaler run simultaneously?

Yes. They manage separate sets of nodes identified by different annotations and labels. Karpenter manages nodes it provisions (labeled with karpenter.sh/nodepool), and CA manages nodes in its discovered ASGs. This coexistence is how you perform a zero-downtime migration. Just ensure that CA's ASGs and Karpenter's NodePools don't target the same subnets with conflicting configurations, as this could lead to both tools trying to provision for the same pending pods.

How does Karpenter handle node updates and patching?

Karpenter's expireAfter field (called ttlSecondsUntilExpired in older versions) automatically rotates nodes after a specified duration. Set it to 720h (30 days) to ensure nodes are regularly replaced with fresh AMIs. When a node expires, Karpenter cordons it, drains pods gracefully, and provisions a replacement with the latest AMI. This eliminates the need for manual node rotation or third-party tools like AWS Systems Manager patch baselines.

What happens if Karpenter itself goes down?

Existing nodes and pods continue running -- Karpenter is not in the data path. However, no new nodes will be provisioned until Karpenter recovers. Run Karpenter with at least 2 replicas and deploy it on a small managed node group (not on Karpenter-provisioned nodes) to avoid a chicken-and-egg problem. EKS Fargate is another option for hosting Karpenter's pods, ensuring they are isolated from node-level failures.

Does Karpenter support GPU workloads?

Yes. Karpenter automatically selects GPU instance types (p4d, p5, g5, g6) when pods request nvidia.com/gpu resources. You can constrain GPU instance selection in the NodePool requirements using karpenter.k8s.aws/instance-gpu-manufacturer and karpenter.k8s.aws/instance-gpu-count labels. Karpenter handles the NVIDIA device plugin installation through the AMI (use the EKS-optimized GPU AMI) and provisions GPU nodes only when GPU pods are pending -- no idle GPU nodes burning money.

How much does Karpenter cost?

Karpenter itself is free and open source. The only cost is the compute it provisions. However, Karpenter typically reduces compute costs by 25-40% compared to Cluster Autoscaler through better bin-packing, ARM instance selection, and Spot usage. The Karpenter controller runs as a Deployment in your cluster consuming roughly 1 vCPU and 1 GiB memory -- negligible compared to the savings it generates.

Can I use Karpenter with Terraform or other IaC tools?

Yes. The Karpenter Helm chart and its CRDs (NodePool, EC2NodeClass) are fully compatible with Terraform, Pulumi, and other IaC tools. The EKS Blueprints Terraform module includes a Karpenter add-on that handles IAM roles, SQS queues for interruption handling, and the Helm installation. For GitOps workflows, Karpenter's CRDs work with ArgoCD and Flux like any other Kubernetes resource.

Is Karpenter production-ready?

On AWS, yes. Karpenter reached v1.0 GA in late 2024 and is now at v1.1. AWS uses Karpenter internally, and it powers node scaling for thousands of production EKS clusters. The CNCF incubating status provides additional governance and community oversight. On Azure, the provider is in beta and should be evaluated with caution for production workloads.

The Bottom Line

If you are running Kubernetes on AWS, Karpenter is the better choice for new clusters and a worthwhile migration for existing ones. Its group-less provisioning model, sub-60-second scaling, native Spot and ARM support, and continuous cost consolidation represent a genuine generational improvement over Cluster Autoscaler. On GKE, use Node Auto-Provisioning for similar benefits. On AKS, evaluate the Karpenter beta but default to Cluster Autoscaler until the provider reaches GA. The right autoscaler is the one that matches your cloud, your constraints, and your operational maturity -- but the direction of the ecosystem is clearly toward Karpenter's approach.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.