How We Architected Karpenter NodePools for a Multi-Workload EKS Cluster
A practical guide to designing NodePool tiers, weights, taints, and disruption policies based on what actually worked for us in production.
Why I Am Writing This
Most Karpenter content online falls into two buckets. One is the AWS documentation, which is accurate but does not help you make architectural decisions. The other is the "Hello World" blog post that shows a single NodePool with one instance category and stops there.
Neither helped us when we had to run a cluster with mixed workloads, where some pods needed memory-heavy nodes for Elasticsearch and JVM apps, some needed compute-heavy nodes for batch processing, some were latency sensitive web services, and a few were GPU workloads that we did not want anywhere near the rest of the cluster.
So we ended up designing a five-tier NodePool architecture. It has been running in production for a while now and has survived enough incidents that I trust it. This post walks through the design, why each decision was made, where it broke, and what I would tell someone starting fresh.
If you are running Karpenter on a single NodePool with default settings and your cluster is growing, this post is for you.
The Core Problem We Were Solving
When you run a single NodePool, Karpenter does its job well. It picks the cheapest instance that fits your pod requests, launches it, schedules the pod, and consolidates nodes when usage drops. For a single-purpose cluster, that is enough.
The trouble starts when your cluster has workloads with conflicting requirements. A few examples we hit:
- ArgoCD and our monitoring stack would occasionally end up co-located with a memory-hungry data processing job, and a noisy neighbour situation would degrade the control plane tooling.
- Karpenter would happily put a JVM app with a 16 GB heap on a
cfamily instance because it technically fit, but the lack of memory headroom caused GC pauses. - Spot interruptions on our admin tier (where ArgoCD lived) caused a longer recovery than we wanted because everything had to be re-scheduled from scratch.
- Cost reporting was a mess. We could not say "the data crunchers cost X per month" because they were mixed in with everything else.
The root cause was that we were treating compute as a single pool. In reality we had at least four or five distinct workload classes with different characteristics. So we split them.
The Five NodePool Architecture
Here is the high level structure. I will go into each one in detail below.
| NodePool | Weight | Selection mechanism | Workloads |
|---|---|---|---|
| Admin services | 100 | Taints + tolerations | ArgoCD, monitoring agents, cert-manager, ingress controllers |
| Memory optimized | 75 | NodeSelector | Elasticsearch, Redis, JVM apps with large heaps |
| Compute optimized | 75 | NodeSelector | Batch processing, video transcoders, data crunchers |
| GPU | 75 | Taints + tolerations + NodeSelector | ML inference, model training |
| General purpose | 10 | Default (no selector) | Web APIs, microservices, anything else |
The weights are doing important work here, and I will explain why we picked these specific numbers later. First, let me walk through each tier.
Tier 1: Admin Services Pool
This is where everything that keeps the cluster itself running lives. ArgoCD, Prometheus and the monitoring agents, cert-manager, the cluster autoscaler bits, the ingress controller, anything that if it goes down means we cannot easily recover.
We protect this pool with a taint:
taints:
- key: workload-class
value: admin
effect: NoSchedule
Only pods with a matching toleration land here. Application workloads cannot accidentally end up on these nodes, which is the whole point. We learned this the hard way when a misconfigured deployment with high resource requests took down a node that was also running Prometheus, and our alerting went silent for fifteen minutes during an incident. After that, the admin tier got walled off.
The other reason for keeping this separate is disruption tolerance. We tune disruption settings on this pool to be much more conservative than the rest. We do not want Karpenter consolidating an ArgoCD node at 3 AM just to save a few dollars. The cost savings are not worth the risk.
We also use a smaller, more boring set of instance types here. No bleeding edge stuff. Mostly m6i and m6a mid-sized instances. Predictability matters more than cost optimization for this tier.
Tier 2: Memory Optimized Pool
This is for workloads where memory is the binding constraint. Elasticsearch is the obvious one. Redis with large datasets. Any JVM application where the heap is large enough that running it on a balanced instance means the CPU sits idle while memory pressure builds.
The selection mechanism is a nodeSelector:
nodeSelector:
workload-class: memory-optimized
Karpenter is configured to only launch r family instances for this pool, with a preference for the latest generation. We restrict the instance categories explicitly:
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["r"]
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["5"]
One thing worth calling out. We initially tried using both r and x family instances on this pool to give Karpenter flexibility. That was a mistake. The x family is significantly more expensive per GB, and Karpenter would sometimes pick an x instance when an r would have been fine, because the pod requests happened to fit slightly better. We removed x from the pool and made teams justify the cost if they actually needed that ratio.
Tier 3: Compute Optimized Pool
Same idea as memory optimized but for the inverse problem. Workloads that are CPU bound and do not need much memory. Batch jobs, video transcoders, anything doing heavy numerical work.
nodeSelector:
workload-class: compute-optimized
Restricted to c family instances. We are aggressive about using ARM64 here, more on that below.
These workloads tend to be more tolerant of disruption than the admin tier (most are batch and can be retried) so we run them mostly on Spot. The cost savings on a compute-heavy workload running on Spot are substantial.
Tier 4: GPU Pool
GPU instances are expensive and you do not want anything that does not need a GPU running on them. We use both a taint and a nodeSelector here, because belt and suspenders is the right call when an instance type costs ten times more than your default.
taints:
- key: workload-class
value: gpu
effect: NoSchedule
nodeSelector:
workload-class: gpu
The taint stops random pods from accidentally tolerating their way onto a GPU node. The nodeSelector means even pods with the right toleration have to explicitly opt in. This redundancy has saved us from at least one incident where a deployment had a stale toleration left over from testing.
Tier 5: General Purpose Pool
This is the catch-all. No taints, no required nodeSelector. If a pod has no opinion about where it goes, it lands here. Web APIs, microservices, internal tools, anything that does not fit the specialized tiers.
We give this pool a low weight (10) deliberately. The other pools win when their selectors match. This pool only wins when nothing else does. That is exactly what we want.
The instance type mix here is broad. Mostly m family but we also allow some c and r if Karpenter finds them cheaper for a given pod shape.
How Karpenter Actually Picks a Pool
This is where most people get confused, including me when I started. Karpenter does not pick a NodePool first and then pick an instance. It does the opposite.
For a given pending pod, Karpenter goes through this rough sequence:
- Find all NodePools where the pod could possibly schedule. To qualify, the pod must tolerate all of the pool's taints, the pod's nodeSelector must match the pool's labels, and the pool's allowed instance types must be capable of providing the resources the pod requests.
- Of the qualifying pools, pick the one with the highest weight.
- Within that pool, pick the cheapest instance type that fits the pod (and any other pending pods that could batch with it).
- Launch the node. Schedule the pod.
The order matters. Weight is a tiebreaker only among pools that already qualify. If a pod has nodeSelector: workload-class: memory-optimized, it will never schedule on the admin pool, regardless of how high the admin pool's weight is. The selector filters first, weight breaks ties.
This becomes important when you start thinking about overlap.
The Weight Game
Weights are the most misunderstood part of NodePool design. Let me try to make this clear.
Suppose you have two pools, both of which a pod could schedule on. Pool A has weight 100, pool B has weight 50. The pod will go to pool A. Always. Weight is not a probability or a load balancing thing. It is a strict priority.
So why do we have multiple pools at the same weight (memory optimized, compute optimized, and GPU all at 75)?
Because they do not overlap. A pod cannot simultaneously match the memory optimized selector and the compute optimized selector. Their weights only matter relative to pools they actually compete with. Since they never compete with each other, their weights can be the same.
The admin pool is at 100 because it has a taint, and pods with the right toleration could in principle also be eligible for the general purpose pool. We want them to land on admin, not general. The weight forces this.
The general purpose pool is at 10. Low. The reason is defensive. If someone misconfigures a workload and forgets the nodeSelector, we want it to land on general, not accidentally end up on the more specialized (and often more expensive) pools.
The Common Weight Mistakes I See
A few mistakes I have seen in other teams' configurations.
Setting all pools to the same weight. This works fine until two pools both qualify for a pod, and then Karpenter's tiebreaker is implementation defined. You do not want that. Always have intentional weight ordering.
Using weight as a load balancer. Weight is not a percentage. Setting one pool to 60 and another to 40 does not send 60% of pods to the first. It sends 100% of qualifying pods to the first. If you want load balancing, use multiple NodePools with different selectors and let the workload selectors do the routing.
Forgetting that weight only breaks ties among qualifying pools. Weight is a no-op if only one pool qualifies. People sometimes set weights expecting them to do something, when in reality the selector is already doing all the work.
Why We Chose ARM64 Instance Types
We default to ARM64 (Graviton) on the compute optimized and general purpose pools, and we are gradually migrating memory optimized too.
The reason is simple. Graviton instances are roughly 20 percent cheaper than the equivalent x86 instances, and for most of our workloads the performance is equal or better. The only blockers are workloads with native dependencies that do not have ARM64 builds. That list shrinks every quarter.
Our NodePool requirements look like this for ARM-friendly tiers:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["arm64", "amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c"]
We allow both, but Karpenter will pick ARM64 most of the time because it is cheaper. For the few workloads that need x86, we set kubernetes.io/arch: amd64 in the pod spec and Karpenter respects it.
A practical tip. Before flipping a workload to ARM64, build the container image as a multi-arch image. Otherwise Karpenter will launch an ARM64 node and your x86-only image will fail to start, and you will spend an hour debugging it. Use docker buildx and push both architectures to the same tag. The Kubernetes scheduler picks the right one based on the node it lands on.
Guardrails: Limits on CPU and Memory
Every NodePool has a limits section that caps total CPU and memory the pool can scale to.
limits:
cpu: 1000
memory: 4000Gi
This is not optional. Without limits, a runaway workload (or a memory leak, or a misconfigured HPA) can scale a NodePool indefinitely until it hits AWS account limits, and then you have a real bad day.
We size the limits based on a reasonable peak for each tier, plus headroom. The general purpose pool has the largest limits because it is the catch all. The admin pool has small limits because the workloads on it are stable and predictable.
When a pool hits its limits, Karpenter stops scaling it and pending pods stay pending. That triggers our alerts and we can decide whether to raise the limit or whether something is genuinely wrong. This is the behaviour you want. The alternative (no limits) is silently spending money until someone notices the bill.
Disruption: The Most Dangerous Part of Karpenter
This is where I see the most production incidents. Karpenter's disruption controller is powerful and aggressive by default, and if you do not configure it properly it will absolutely move your pods around in ways you did not expect.
There are three disruption mechanisms to know about.
Empty node consolidation. Karpenter detects nodes that have no non-daemonset pods and removes them. This is almost always safe and you should leave it on.
Underutilized node consolidation. Karpenter detects nodes where the pods could fit on other existing nodes (or on a smaller new node) and replaces them. This is where things get spicy.
Drift. Karpenter detects when a node's configuration no longer matches the NodePool spec (for example, the AMI was updated) and replaces the node. Useful for security updates, but you want to control the timing.
Our default disruption config looks roughly like this:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 5m
budgets:
- nodes: "10%"
The 5 minute delay matters. Without it, Karpenter will consolidate aggressively the moment a node looks underutilized, and you can end up in a flapping situation where nodes are constantly being added and removed.
The 10 percent budget means at most 10 percent of nodes in the pool can be disrupted at once. For larger pools this gives you parallelism, for smaller pools it acts as a safety net.
For the admin pool we are more conservative:
disruption:
consolidationPolicy: WhenEmpty
budgets:
- nodes: "1"
Only consolidate when fully empty, and only one node at a time. We do not want Karpenter playing tetris with our control plane tooling.
How To Tell Karpenter Not To Move A Pod
If you have a workload that genuinely should not be moved (a long-running batch job, a stateful application, anything where eviction is expensive), there are a few mechanisms.
PodDisruptionBudgets. This is the Kubernetes-native way. Karpenter respects PDBs the same way kubectl drain does. If draining a node would violate a PDB, Karpenter will not drain it.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
The do-not-disrupt annotation. For workloads where you want to opt out entirely:
metadata:
annotations:
karpenter.sh/do-not-disrupt: "true"
Use this sparingly. A pod with this annotation will pin its node forever, which means consolidation cannot happen on that node. If you do this everywhere, you have effectively disabled consolidation, and you are paying for it.
Accurate resource requests. This is the underrated one. A lot of consolidation churn happens because pods request way more than they use. Karpenter sees a node at 30 percent CPU usage but the requests show it at 90 percent, and it cannot consolidate because the requests do not fit elsewhere. Right-sizing requests is the single biggest improvement you can make to your Karpenter behaviour.
The golden rule I tell my team. If you do not want Karpenter to move your pod, tell it why. Use a PDB or an annotation. Do not assume Karpenter will figure it out.
Taints Are Keep-Out Signs, Not Suggestions
I want to spend a moment on taints because they are misunderstood.
A taint on a node means "do not schedule pods here unless they explicitly tolerate this taint." It is a wall, not a hint. There is no "soft taint" that Karpenter will consider but ignore if convenient.
This is exactly the property you want for the admin pool. Even if the general purpose pool is full and the admin pool has free capacity, application pods will not spill over onto admin nodes. They will stay pending, and Karpenter will scale the general pool. That is correct behaviour.
The reverse direction (admin pods cannot land on general nodes) is enforced by the toleration plus weight combination. Admin pods tolerate the admin taint and have weight 100 routing them to admin. They could in principle land on general (which has no taint), but the higher weight pulls them to admin first. As long as admin has capacity or can scale, they go there.
Worked Example: An Elasticsearch Pod
Let me walk through what happens when we deploy a new Elasticsearch data node.
The pod spec includes:
nodeSelector:
workload-class: memory-optimized
resources:
requests:
memory: 32Gi
cpu: 4
Karpenter sees the pending pod and goes through its decision tree.
First, which pools could this land on? The nodeSelector requires workload-class: memory-optimized, which only the memory pool has. The admin pool has a taint the pod does not tolerate. The compute, GPU, and general pools have different labels. So only the memory pool qualifies.
Second, weight does not matter here because only one pool is in the running.
Third, Karpenter picks the cheapest r family instance that can fit a 32 GB / 4 CPU pod with some headroom. Probably an r6g.2xlarge or r7g.2xlarge depending on availability and Spot pricing.
Fourth, it launches the node, the pod schedules, and life is good.
Now suppose another team deploys a Redis cluster with the same workload-class: memory-optimized selector. Karpenter will batch this with the Elasticsearch pod if they happen to land at the same time and might pick a larger instance that fits both. This batching behaviour is one of Karpenter's nicer features and is the reason you generally want to let it manage instance type selection rather than pinning specific types.
Common Questions I Get
How many NodePools is too many?
We run five and it feels right for our workload mix. I have seen teams run ten or more, and it usually becomes a maintenance burden without clear benefit. The right number is the number where each pool serves a workload class that genuinely has different requirements. If two of your pools differ only in instance generation, merge them.
Should I use one big pool or several smaller ones?
Several smaller ones, almost always. The main argument for one big pool is simplicity, but you give up workload isolation, cost attribution, and disruption control. The cost is a few hundred lines of additional YAML, which is not much for what you get back.
What about Spot?
We use Spot heavily on compute optimized and general purpose, mixed Spot and On-Demand on memory optimized, and pure On-Demand on admin. The pattern is: more critical or harder to reschedule workloads get On-Demand, more tolerant workloads get Spot. Karpenter handles the Spot interruption notices and drains nodes gracefully, so the operational overhead is low.
How do you handle stateful workloads?
Carefully. For Elasticsearch and other stateful systems, we run On-Demand and use PDBs aggressively. We also pin to specific availability zones to avoid PV mounting issues during rescheduling. Karpenter handles AZ pinning via topology spread constraints if you set them, but it is worth being explicit.
Does Karpenter work with Cluster Autoscaler?
You can run both, but you should not. Pick one. We removed Cluster Autoscaler entirely after Karpenter was working well. They have overlapping responsibilities and running both leads to weird scaling fights.
What is the migration path from a single NodePool?
Add new NodePools alongside the existing one. Migrate workloads gradually by adding selectors. Once a workload is fully on its new pool, remove its eligibility for the old pool. Once nothing is left on the old pool, drain and delete it. This took us a few weeks for the full migration but never required a maintenance window.
Glossary
A few terms I have used that are worth defining clearly.
NodePool. A Karpenter resource that defines a class of nodes, what instance types are allowed, what taints they get, how they consolidate, what their limits are.
Taint. A label on a node that prevents pods from scheduling there unless they have a matching toleration.
Toleration. A pod-level setting that says "I am okay scheduling on a node with this taint."
NodeSelector. A pod-level setting that says "I will only schedule on nodes with these labels."
Weight. A NodePool field that breaks ties between pools that both qualify for a pod. Higher weight wins.
Consolidation. Karpenter's process of removing or replacing nodes to reduce cost when usage drops.
PodDisruptionBudget. A Kubernetes resource that limits how many pods of a workload can be disrupted at once. Karpenter respects these.
Drift. The state where a node's actual configuration no longer matches its NodePool's desired configuration. Karpenter detects drift and replaces drifted nodes.
Closing Thought
The thing I would tell anyone starting on Karpenter is this. Spend time on the architecture before you spend time on the YAML. The default configuration works for a single-purpose cluster, and it will keep working as you grow, until one day it does not. The transition from a single pool to a tiered architecture is much easier to plan in advance than to do under pressure during an incident.
The five-tier model is not the only valid design. Some teams do four, some do seven. What matters is that you have explicitly thought about which workloads belong together, which need isolation, and what the disruption story looks like for each class. Once you have that, the YAML writes itself.
If you are running Karpenter and have hit weird behaviour I did not cover here, I would genuinely like to hear about it. Most of what I learned came from incidents, and there is always more to learn.
Comments
Post a Comment