[{"content":"It started with a postmortem I never wanted to write.\nA merchant flash sale launched 20 minutes ahead of schedule. Traffic to the payment authorization services doubled in under a minute. Kubernetes HPA did exactly what it was configured to do — it detected the CPU spike and requested scale-out across over 150 checkout-path services simultaneously. Most new pods fit on existing nodes, but dozens of services exhausted their node pool headroom and triggered provisioner requests in a burst. The node provisioner stalled under the queue pressure. New capacity came up four minutes later.\nSix minutes is an eternity at scale.\nBy the time the payment services stabilized, the fraud detection domain — physically sharing node pools with payment processing — had been starved of resources it needed to run inline fraud scoring for a completely unrelated merchant cohort. Two independent SLO breaches. One correlated traffic event. Zero individual components that had misbehaved.\nThat postmortem was the beginning of HiRL-Scale.\nThe Scale Problem That Changes Everything The cluster that motivated this design ran approximately 100,000 pods across roughly 1,550 applications, grouped into five broad domains: Payment Processing \u0026amp; Checkout, Fraud Detection \u0026amp; Risk, Merchant Services \u0026amp; APIs, Data Pipelines \u0026amp; Analytics, and Internal Platform \u0026amp; Tooling. If you\u0026rsquo;ve operated at this scale, you already know the uncomfortable truth: the tools designed for 100 services don\u0026rsquo;t behave the same way at 1,500.\nHPA was designed for a world where services are relatively independent. A service spikes, its CPU goes up, HPA fires, replicas increase. Clean, simple, correct. The problem is that independence is an assumption that fails completely once you have shared node pools, shared quotas, and correlated traffic patterns across thousands of services.\nThree failure modes emerge at this scale that simply don\u0026rsquo;t exist at smaller ones:\nIn a fintech environment, these aren\u0026rsquo;t abstract availability concerns. A cascade event during peak checkout means failed payment authorizations — immediate, measurable revenue loss. A priority inversion that starves fraud scoring means either transactions process without fraud checks (regulatory exposure) or auth latency spikes while fraud scoring queues (checkout conversion drops). The business cost of cluster-level coordination failures is denominated in dollars and compliance risk, not just SLO breach counts.\nCascade propagation. A spike in Domain A consumes node pool capacity. Domain B\u0026rsquo;s pods, which share that pool, can\u0026rsquo;t schedule. Domain B\u0026rsquo;s latency climbs. Domain B fires alerts. Domain B\u0026rsquo;s HPA fires. Now you have two domains competing for the same pool simultaneously.\nThe provisioner stampede. Hundreds of HPAs across two or three domains firing in the same 30-second window isn\u0026rsquo;t hundreds of independent scaling decisions. It\u0026rsquo;s one massive, uncoordinated resource request that creates backpressure on the node provisioner and delays scheduling for pending pods across the cluster. The aggregate behavior is worse than any individual decision.\nDomain priority inversion. Without any coordination mechanism, there\u0026rsquo;s nothing stopping a low-priority internal service from consuming node capacity that a P0 user-facing service needs. Priority is a concept that exists only in your runbook, not in your autoscaler.\nThe answer to these problems isn\u0026rsquo;t \u0026ldquo;tune HPA more carefully.\u0026rdquo; It\u0026rsquo;s to build a system that understands the cluster as a shared resource and coordinates scaling decisions across all 1,550 applications simultaneously.\nThe Insight: Autoscaling Is a Planning Problem, Not a Control Problem Traditional autoscalers are control systems. They observe current state, compare to a setpoint (target CPU%), and act. This is reactive by design. The controller can only respond to signals that are already present in the system — which means, at minimum, you\u0026rsquo;re always one full metric collection cycle behind reality.\nAt 100k pods, one collection cycle behind reality is the difference between a graceful scale-out and a cascade.\nWhat I wanted was a system that behaves more like an experienced SRE team: one that looks at traffic patterns, understands what\u0026rsquo;s coming based on historical rhythms, and starts preparing capacity before the spike arrives. One that knows which domains are P0 and protects them even during resource contention. One that treats the cluster as a whole, not as 1,500 independent services.\nThis is a planning problem. The right tool for planning under uncertainty, with delayed rewards and complex multi-agent interactions, is reinforcement learning.\nSpecifically, it\u0026rsquo;s hierarchical reinforcement learning — and the rest of this series explains exactly how I designed it.\nIntroducing HiRL-Scale HiRL-Scale is a two-level hierarchical actor-critic RL system. The design mirrors how experienced operations teams actually handle large-scale incidents.\nThe Commander operates at the cluster level. Every 60 seconds, it looks at the state of all five domains — aggregate CPU and memory utilization, latency, SLO headroom, pending pods — and decides how to allocate cluster resources across domains. Critically, it doesn\u0026rsquo;t decide replica counts. It decides budgets: how much of the cluster\u0026rsquo;s capacity each domain is entitled to use.\nThe Soldiers (one per domain) operate at the application level. Every 10 seconds, each Soldier looks at the state of its 150–500 applications and decides which ones to scale and by how much — within the budget the Commander just granted it.\nNeither level can override the other without consequence. The Commander can\u0026rsquo;t set replica counts directly. The Soldiers can\u0026rsquo;t exceed their budget without paying a penalty in their reward function. The hierarchy creates coordination without central planning becoming a bottleneck.\nThe Forecaster: Seeing 30 Minutes Ahead What separates HiRL-Scale from \u0026ldquo;reactive RL that happens to be faster than HPA\u0026rdquo; is the Temporal Fusion Transformer embedded in the Commander\u0026rsquo;s observation.\nBefore the Commander decides how to allocate resources, it receives traffic forecasts for each domain: where is RPS headed in 5 minutes? 15 minutes? 30 minutes? The forecasts come with confidence intervals — wide intervals mean uncertainty, and the Commander learns to request more pre-emptive headroom when it\u0026rsquo;s uncertain.\nThis is the shift from reactive to proactive. When a promotional campaign goes live, the Commander doesn\u0026rsquo;t wait for CPU to spike. It sees traffic arriving in the forecast 10–15 minutes before it lands and starts shifting budget to the payment processing domain in advance. By the time the spike hits, replicas are already up.\nThe forecaster is a separate model, trained offline on 6 months of historical traffic data and frozen during RL training. This decoupling is deliberate — I\u0026rsquo;ll get into why in Post 4.\nWhat\u0026rsquo;s Coming in This Series The next posts cover why HPA fails structurally at this scale and how the Commander/Soldier decomposition addresses it. From there I go deep on the forecaster and the two agent designs — that\u0026rsquo;s where most of the interesting failures are. Then the training curriculum, the platform engineering that makes it safe to operate, and the retrospective.\nOne thing I\u0026rsquo;ll flag now: I don\u0026rsquo;t know yet whether two levels of hierarchy is the right number. There\u0026rsquo;s a reasonable argument for three — a meta-Commander coordinating across clusters, a Commander per cluster, Soldiers per domain. Every time I\u0026rsquo;ve sketched that design, it\u0026rsquo;s felt like the right next step. But I haven\u0026rsquo;t built it, and I\u0026rsquo;d be cautious about anyone who claims hierarchical depth is easy to get right.\nA related caveat: many organizations at this scale run multiple smaller clusters rather than one large one. This design assumes a single large cluster. The multi-cluster case is a different problem with different constraints — I\u0026rsquo;ll discuss it later in the series as an open problem.\nSee you next week.\n","permalink":"https://aashish-sheshadri.github.io/posts/the-cluster-that-learned-to-plan-ahead/","summary":"\u003cp\u003eIt started with a postmortem I never wanted to write.\u003c/p\u003e\n\u003cp\u003eA merchant flash sale launched 20 minutes ahead of schedule. Traffic to the payment authorization services doubled in under a minute. Kubernetes HPA did exactly what it was configured to do — it detected the CPU spike and requested scale-out across over 150 checkout-path services simultaneously. Most new pods fit on existing nodes, but dozens of services exhausted their node pool headroom and triggered provisioner requests in a burst. The node provisioner stalled under the queue pressure. New capacity came up four minutes later.\u003c/p\u003e","title":"The Cluster That Learned to Plan Ahead"},{"content":"Three ways that conventional per-service autoscaling breaks down at 100,000 pods, ~1,550 applications, and five shared domains, and why no amount of HPA tuning makes them go away.\nFailure Mode 1: The Provisioner Stampede HPA is designed to be autonomous. Each deployment has its own HPA object, its own target utilization, its own scale-out logic. This is great for isolation. Changes to one service\u0026rsquo;s autoscaling config don\u0026rsquo;t affect others. It breaks down under coordination pressure.\nWhen a correlated traffic event hits (a campaign launch, a viral moment, a deployment triggering retry storms) it doesn\u0026rsquo;t hit one service. It hits dozens or hundreds of services simultaneously across a domain. Every HPA fires within the same 15–30 second window. Every one of them requests new pods. Every one of those pods, if it can\u0026rsquo;t fit on existing nodes, triggers a node provisioner request.\nThe node provisioner handles bursts — it batches and consolidates requests. But it doesn\u0026rsquo;t prioritize between them. A batch of provisioner requests from a P3 internal tooling domain is processed with the same urgency as a batch from P0 payment authorization. Under a correlated spike, dozens of services land in Pending simultaneously, the cluster autoscaler groups them and fires provisioning requests, and provisioning latency climbs from seconds to minutes. Meanwhile, the traffic that triggered the scale-out is actively serving against undersized deployments.\nI\u0026rsquo;ve measured this at scale: a correlated spike across a single domain can push end-to-end scale-out time (node provisioning plus pod startup plus readiness checks) to 3–4 minutes. That\u0026rsquo;s not a tuning problem. That\u0026rsquo;s an architectural problem. The scaling chain — HPA patches replica counts, the scheduler tries to place pods, unplaceable pods trigger the cluster autoscaler, which calls the cloud provider — has no coordination layer. No step in that chain knows that 200 services just asked for capacity simultaneously, or that the P0 payment service should get its nodes before the P3 batch job.\nWhat makes it worse: The requests are not evenly distributed over time. They arrive in a burst, because all the HPAs fire at roughly the same time, because the traffic event hit all of them at roughly the same time. You get a thundering herd of autoscaling requests hitting a system that was designed to handle a steady trickle.\nIn a payments context, every minute of provisioner queue depth is a minute where payment authorization services are running at capacity without backup. Transaction failures during this window are not retryable for many payment flows. The customer sees a declined transaction and abandons the checkout. The provisioner stampede translates directly to lost revenue.\nFailure Mode 2: Cascade Propagation Through Shared Node Pools At 100,000 pods, you almost certainly have some node pool sharing across services, whether explicit (dedicated pools for specific domains) or implicit (spillover to shared capacity when dedicated pools fill). This sharing is efficient in steady state. It\u0026rsquo;s dangerous during spikes.\nThe cascade works like this: Domain A\u0026rsquo;s spike consumes the shared node pool. Domain B (no traffic event, just a bystander) can\u0026rsquo;t schedule pending pods. Domain B\u0026rsquo;s latency climbs. Domain B\u0026rsquo;s HPA fires. Now two domains are fighting for the same capacity simultaneously. New nodes come online 4–8 minutes later. By then Domain A\u0026rsquo;s spike has usually passed, so the scale-up that triggered the cascade was partially unnecessary.\nDomain B never had a traffic event. It was a bystander. But it got degraded anyway because its node capacity was consumed by Domain A\u0026rsquo;s autoscaling response.\nThis is cascade propagation: a scaling event in one part of the cluster causes degradation in a structurally unrelated part. At ~1,550 applications across five domains, the opportunities for cascade are everywhere. And HPA has no model of it whatsoever. Each HPA knows only about its own deployment.\nIn a fintech environment, the cascade from payment processing to fraud detection is particularly dangerous. Fraud scoring runs inline with payment authorization at this company. If fraud scoring pods can\u0026rsquo;t schedule because payment processing consumed the shared pool, the choice is between processing transactions without fraud checks (regulatory exposure) or queueing auth requests until fraud capacity recovers (latency spike, checkout abandonment). Neither option is acceptable.\nWhere cascade risk lives: In a cluster with five domains sharing node pools, the higher your utilization, the more likely any spike bleeds into other domains, and it gets worse fast. At 70% average utilization, most spikes are absorbed by headroom. At 85%, nearly every spike of meaningful size triggers some cascade effect.\nA cluster at this scale typically runs at 65–70% average CPU utilization (actual usage, not request-based; request-based utilization was higher, since most services over-request). That sounds like it should leave plenty of headroom, until you realize that headroom isn\u0026rsquo;t uniformly distributed across node pools. Cascade events were not edge cases.\nFailure Mode 3: Domain Priority Inversion Your P0 services have SLO budgets. Violating them has consequences: revenue impact, contractual obligations, user trust. Your P3 internal tooling does not. The distinction matters enormously during resource contention.\nHPA does not know which of your services is P0 and which is P3. It doesn\u0026rsquo;t know that the nightly batch job for internal reporting should yield capacity to the payment authorization flow during a peak traffic event. It will scale both, treat both as equally urgent, and let the cluster sort it out, which means \u0026ldquo;whoever scheduled first wins.\u0026rdquo;\nPriority inversion happens when a low-priority service out-competes a high-priority service for node capacity, not because the low-priority service is more important, but because it happened to request resources first, or its HPA fired slightly earlier, or its pod priority class was configured incorrectly by an engineer two years ago.\nAt ~1,550 applications, maintaining correct relative priority across all HPA configurations is not realistic in practice. Configurations drift. Teams change. Priority classes get copy-pasted without adjustment. The pod priority that was correct for a service at launch may be wrong today.\nIn financial services, priority inversion has a regulatory dimension. If a data pipeline batch job out-competes fraud scoring for node capacity during a compliance-critical window, the consequence isn\u0026rsquo;t just degraded internal tooling — it\u0026rsquo;s a potential gap in transaction monitoring that auditors will ask about. Domain priority in fintech isn\u0026rsquo;t just an operational preference; it\u0026rsquo;s a compliance requirement.\nThe failure signature: During resource contention, your monitoring shows Domain 5 (internal) consuming 18% of cluster capacity while Domain 1 (Payment Processing) has pods stuck in Pending. No individual configuration is wrong. The aggregate behavior is exactly backwards from your intent.\nWhy Tuning HPA Doesn\u0026rsquo;t Fix This The natural response to these failure modes is to reach for the dials: tighten stabilization windows, adjust target utilization, add scale-down delays, tune cooldown periods. VPA and resource request rightsizing help with steady-state efficiency (I already had that in place). This is what most platform teams spend months doing.\nIt doesn\u0026rsquo;t work. Here\u0026rsquo;s why:\nThe stampede problem is coordination-shaped, not parameter-shaped. No combination of per-HPA stabilization windows prevents hundreds of HPAs from firing simultaneously when hundreds of services experience a correlated load increase. The root cause is that HPAs make independent decisions. Making them make slower independent decisions helps at the margins; it doesn\u0026rsquo;t solve the coordination problem.\nCascades can\u0026rsquo;t be configured away. HPA doesn\u0026rsquo;t know about shared node pools. Adding podAffinity rules to avoid pool contention is operationally untenable at ~1,550 applications. No single HPA has visibility into what its scaling decisions do to the rest of the cluster.\nPriority is a policy problem, not a threshold problem. Pod priority classes help. Preemption will evict lower-priority pods when a high-priority pod is stuck in Pending. But it\u0026rsquo;s reactive and pod-scoped: it only kicks in after scheduling has already failed, and it has no concept of domains. When Domain 1 is under load and Domain 5 has excess capacity, you want to proactively reallocate, not wait for Domain 1 pods to fail scheduling first.\nThe pattern is consistent: HPA\u0026rsquo;s failure modes at scale are structural, not parametric. You can\u0026rsquo;t tune your way out of an architectural mismatch.\nTo be fair: combining HPA with PriorityClasses, node autoscaler batching, and a priority-weighted rules engine gets you most of the way — likely 60–70% of the target improvement. For many organizations, that\u0026rsquo;s enough. The remaining gap is the gap between four nines and five nines of availability. The structural failures described above — stampedes, cascades, priority inversions — are tail events. They might happen a handful of times per year. But at five nines, your entire annual error budget is about five minutes. A single four-minute provisioner stampede during peak checkout consumes most of it. That\u0026rsquo;s the gap where learned, proactive coordination earns its complexity cost.\nWhat the Right System Needs The failures above define the requirements. Working backwards from each one:\nStampedes need a coordination layer that batches and prioritizes scaling requests before they hit the provisioner. Rather than hundreds of simultaneous \u0026ldquo;I need more capacity\u0026rdquo; signals, the system should produce one coordinated \u0026ldquo;cluster needs N nodes distributed as follows\u0026rdquo; signal.\nCascades need cluster-wide resource awareness. The system needs to know, at the moment it makes a scaling decision for Domain A, what impact that decision has on available capacity for Domains B through E.\nPriority inversion needs a first-class representation of domain priority that is continuously enforced, not configured once and forgotten. During contention, the system should actively protect P0 domains, pulling budget from P3 domains if necessary.\nAll three need proactivity — a system that acts on predicted future state rather than observed current state. By the time a spike is visible in CPU metrics, it\u0026rsquo;s already late. A proactive system sees traffic arriving in the forecast and prepares capacity before the spike lands.\nThere\u0026rsquo;s a less obvious requirement: any system with this much coordinated control also needs coordinated safety. Circuit breakers, rollback triggers, human override. A system that can reallocate budget across five domains can also cause cluster-wide damage if it gets it wrong. I\u0026rsquo;ll cover the safety design in a later post.\nThese are the requirements that drove the HiRL-Scale design. The next post explains the architectural pattern that addresses them: hierarchical reinforcement learning with a Commander and Soldiers.\nA caveat worth stating upfront: proactive systems only win when the forecast is good. When the forecast is wrong (predicting a spike that doesn\u0026rsquo;t materialize, or missing one that does) a proactive system can make worse decisions than a reactive one, because it moved budget before seeing the evidence. I\u0026rsquo;ll get into how I handle that when I cover the forecaster. It\u0026rsquo;s not fully solved.\n","permalink":"https://aashish-sheshadri.github.io/posts/why-1500-hpas-is-not-an-autoscaling-strategy/","summary":"\u003cp\u003eThree ways that conventional per-service autoscaling breaks down at 100,000 pods, ~1,550 applications, and five shared domains, and why no amount of HPA tuning makes them go away.\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"failure-mode-1-the-provisioner-stampede\"\u003eFailure Mode 1: The Provisioner Stampede\u003c/h2\u003e\n\u003cp\u003e\u003cimg alt=\"Provisioner Stampede — correlated HPAs flooding a single provisioner queue\" loading=\"lazy\" src=\"/posts/why-1500-hpas-is-not-an-autoscaling-strategy/diagrams/svg/02_provisioner_stampede.svg\"\u003e\u003c/p\u003e\n\u003cp\u003eHPA is designed to be autonomous. Each deployment has its own HPA object, its own target utilization, its own scale-out logic. This is great for isolation. Changes to one service\u0026rsquo;s autoscaling config don\u0026rsquo;t affect others. It breaks down under coordination pressure.\u003c/p\u003e","title":"Why 1,500 HPAs Is Not an Autoscaling Strategy"},{"content":"Commander and Soldiers: Decomposing the Scaling Problem Part 3 · Series: Teaching Kubernetes to Think Ahead By Aashish Sheshadri — Platform Architecture\nThe design space for RL-based autoscaling has three obvious options. Two of them do not hold up under scrutiny.\nOption 1: One Big Agent The most natural first thought: replace all 1,500 HPAs with a single RL agent that observes the entire cluster and outputs replica counts for every service.\nThe observation space is 1,500 services × (however many features per service). At even a modest 20 features per service, that is a 30,000-dimensional input. The action space is 1,500 replica targets. The policy network needs to learn cross-service dependencies — that scaling down a checkout gateway affects a downstream fraud scorer\u0026rsquo;s latency — but the signal is buried in a sparse observation where 90% of apps are healthy at any given moment.\nThis does not work for three compounding reasons:\nSparsity and sample efficiency. The relevant signals (cross-service interactions, cascade effects) are sparse in the observation and temporally distant from the actions that caused them. Most of the 30,000 dimensions carry no useful signal at any given time. You would need enormous amounts of simulated experience to train a policy that reliably identifies the interactions that matter from the noise that does not.\nCredit assignment. When the reward signal arrives (domain latency went up), which of the 1,500 simultaneous actions is responsible? The credit assignment problem is the fundamental difficulty of RL; a 1,500-dimensional action space with sparse, delayed feedback makes it extremely difficult to solve reliably.\nOperational opacity. When the agent makes a bad decision, how do you debug it? \u0026ldquo;The policy decided to scale the fraud scorer to 47 replicas based on a 30,000-dimensional input vector\u0026rdquo; is not a root cause analysis. Platform teams need to understand why the system did what it did.\nA monolithic agent fails on sparsity, credit assignment, and interpretability simultaneously.\nOption 2: 1,500 Independent Agents The opposite end of the spectrum: each service gets its own RL agent, observing only its own metrics, outputting only its own replica count.\nThis immediately reintroduces the coordination problems from the previous post. You have replaced 1,500 HPAs with 1,500 RL agents, but the fundamental issue (that they make independent decisions without awareness of shared resources) is unchanged. You have added training complexity without solving the structural problem.\nIndependent agents also have a sparse reward problem. A single service\u0026rsquo;s SLO breach is a rare event in steady state. An agent that rarely receives meaningful reward signals converges slowly and unpredictably. This is manageable if you have a few agents; it is operationally unacceptable at 1,500.\nOption 3: Hierarchical RL (The One That Works) When a traffic event hits a cluster with ~1,550 applications (the same services, counted as Kubernetes Deployments), a senior SRE does not immediately start deciding individual replica counts. They make high-level resource allocation decisions first:\n\u0026ldquo;Payment processing is P0, protect it at all costs. Fraud scoring is P0/P1, keep it inline with auth. Data pipelines can shed load temporarily. Internal services yield to everyone.\u0026rdquo;\nOnly after those policy-level decisions are made does the team execute at the service level, and they execute within the constraints those decisions established.\nAbstract policy at the top, concrete execution at the bottom. That is what hierarchical RL does.\nThere are intermediate architectures I considered and rejected. CTDE (centralized training, decentralized execution) uses a shared critic during training but independent policies at inference — the problem is that 1,500 independent execution policies still produce uncoordinated scaling actions, just with better-trained weights. Graph neural network policies over the service dependency graph are appealing in theory but require maintaining an accurate, real-time dependency graph at scale, which is its own infrastructure problem. MPC at the cluster level with RL only at the domain level splits the problem differently but puts the optimization burden on a model-predictive controller that needs an accurate dynamics model of cluster-wide resource contention — and that model does not exist.\nThe Commander-Soldier decomposition was the simplest design that addressed all three failure modes — stampedes, cascades, and priority inversion.\nThe Commander-Soldier Architecture The Commander-Soldier decomposition is not novel in RL theory. The idea of a manager issuing high-level directives to workers goes back to feudal RL (Dayan and Hinton 1993), with modern implementations like FeUdal Networks (Vezhnevets et al. 2017) and goal-conditioned hierarchical RL like HIRO (Nachum et al. 2018). The options framework (Sutton, Precup, and Singh 1999) formalizes the notion of temporally extended actions — the Commander\u0026rsquo;s budget directive functions like one, executed by the Soldiers over six inner ticks, but with a fixed duration rather than a learned termination condition.\nI chose not to use these frameworks directly. The budget-as-action abstraction is simpler to implement and debug than learned option termination conditions or goal embeddings in a continuous space. If you have read the feudal RL papers, you will see the family resemblance.\nThe Commander operates at 60-second intervals on five-domain aggregates. It knows nothing about individual services. It sees only domain-level CPU and memory utilization, latency, SLO headroom, and traffic forecasts. Its action space is a budget allocation: what fraction of cluster capacity each domain is entitled to use, plus an urgency signal and a headroom request. The budget percentages come from a (D+1)-dimensional softmax, where the extra dimension is a reserve pool the Commander can hold unallocated. The five domain shares plus the reserve sum to exactly 1.0, so the Commander can express \u0026ldquo;I don\u0026rsquo;t know where the next spike is coming from, hold 15% back\u0026rdquo; without being forced to distribute everything. Five domains × a few scalars. Tractable.\nOne limitation: the Commander does not see distributional information within domains. If 10% of a domain\u0026rsquo;s apps are at 95% CPU and 90% are at 20%, the p75 shows a healthy domain. I partially compensate with pod restart rates and SLO headroom (which surface CrashLoopBackOff events and leading SLO indicators at the domain level), but a more principled approach would include variance or quantile features in the domain aggregation.\nThe Soldiers (one per domain) operate at 10-second intervals on their domain\u0026rsquo;s application-level metrics. Each Soldier knows about its 150–500 applications, organized into service-tier groups. Rather than outputting replica decisions for every app independently, it allocates budget across groups and focuses detailed scaling decisions on the most at-risk applications — keeping the action space to roughly 50 dimensions regardless of domain size. The Soldier never considers cross-domain resource contention — that is the Commander\u0026rsquo;s problem, constrained by the budget envelope before the Soldier acts.\nThis handles all three coordination failures:\nStampede coordination: The Commander sets coordinated budget envelopes that constrain how much each Soldier can scale. Rather than hundreds of HPAs independently requesting capacity from the node provisioner, aggregate demand is bounded before it reaches the scheduler. Cascade awareness: The Commander\u0026rsquo;s observation includes all domains simultaneously, plus explicit cross-domain dependency signals — an inter-domain call rate matrix and a per-domain latency exposure score derived from service mesh telemetry. When Payment latency climbs because Fraud scoring (called inline) is slow, the Commander can diagnose that the bottleneck is in Fraud and redirect budget there, rather than wastefully scaling Payment. Priority enforcement: Domain priority is encoded at multiple levels: tighter SLO thresholds for P0 domains trigger reward penalties sooner, graduated safety triggers give P0 domains shorter alert windows, and minimum resource floors guarantee baseline capacity. The reward function penalizes the worst-performing domain, not the average — so a single P0 degradation dominates the signal. It is continuously enforced, not configured once and drifted. Why Actor-Critic for Both Levels Both the Commander and the Soldiers use an actor-critic architecture. This is worth explaining because it is not the only option.\nThe actor is the policy: given the current observation, what action should I take? The critic is the value function: given the current state, what is the expected cumulative reward from here?\nThe critic is what enables stable learning in complex environments. Without a value function, the policy gradient estimate is extremely noisy: you are using actual episode returns to estimate expected returns, which has high variance. The critic provides a lower-variance baseline. At the timescales I care about (60-second Commander ticks, 10-second Soldier ticks), the environment changes fast enough that high-variance learning is genuinely destabilizing.\nFor the Commander specifically, the critic\u0026rsquo;s value function serves another role: it encodes the long-horizon consequences of budget allocations. A Commander that allocates too much budget to Domain 1 during a quiet period wastes capacity. The effects of that waste are felt 10–20 minutes later when a spike hits Domain 2 and the budget isn\u0026rsquo;t there. The critic learns to trace these delayed consequences back to the allocation decisions that caused them. The Commander\u0026rsquo;s discount factor is γ=0.995 at its 60-second tick, giving an effective planning horizon of roughly 3.3 hours (200 ticks) — long enough to learn to pre-position for traffic ramps, short enough to avoid vanishing gradients over full daily cycles. The Soldiers use a lower discount factor at their faster 10-second tick, keeping their effective horizon shorter and focused on immediate scaling response rather than long-range planning.\nI used PPO (Proximal Policy Optimization) for both levels rather than off-policy alternatives like SAC — SAC\u0026rsquo;s off-policy nature complicates hierarchical coupling because Soldier training data would include transitions under stale Commander policies. Both levels have separate KL penalty coefficients. The Commander\u0026rsquo;s KL coefficient is higher, meaning its policy changes more slowly per update, and that is deliberate: the Commander needs to be a stable \u0026ldquo;policy environment\u0026rdquo; for the Soldiers to adapt to. Constantly shifting budget allocations make it impossible for Soldiers to learn reliable behavior.\nThe Coupling Problem: Making Two Levels Cooperate The most interesting design challenge in hierarchical RL is getting the two levels to cooperate rather than conflict. An adversarial dynamic is easy to produce by accident: the Commander sets a budget, the Soldier ignores it because its local reward for scaling up everything is higher, the Commander raises the budget to compensate, the Soldier still ignores it, and you end up with unconstrained scaling driven entirely by Soldier-level rewards.\nThe design uses two mechanisms to prevent this:\nHard constraint: The Soldier\u0026rsquo;s action space is subject to a projection that enforces the Commander\u0026rsquo;s budget as a near-hard cap — Soldiers cannot exceed the Commander\u0026rsquo;s allocation by more than 10%. The projection layer normalizes the replica deltas so their total stays within this bound. This is differentiable — gradients flow through the projection during training.\nCoupling reward: The Soldier receives a bonus in its reward function for aligning its actual resource distribution with the Commander\u0026rsquo;s intent. If the Commander signals high urgency for scale-up and the Soldier uses most of its budget on already-healthy applications, it pays a penalty. If the Soldier responds to the urgency signal and concentrates resources where they\u0026rsquo;re most needed, it gets a bonus.\nThe magnitude of the coupling reward requires careful tuning. Too high and Soldiers blindly follow Commander directives even when local evidence says otherwise; too low and Soldiers ignore the Commander entirely. I will get into the tuning methodology when I cover the reward design.\nThere is a credit assignment gap I have not fully solved. When a domain\u0026rsquo;s SLO breaches, was it the Commander\u0026rsquo;s bad budget, the Soldier\u0026rsquo;s bad allocation within a good budget, or the forecaster being wrong? The system has no mechanism to distinguish these three cases — both Commander and Soldier get penalized, and both adjust. Over many episodes this averages out, but it slows convergence and means early training is noisier than it needs to be. HIRO-style off-policy corrections could help here, but I have not implemented them.\nShared Weights Across Soldiers One more non-obvious decision: all five Soldiers share the same network weights.\nFive Soldiers, five domains, five separate reward streams. If you train five independent networks, each one sees at most 20% of the experience — and the reward signal for a single domain is sparse because most applications are healthy most of the time. Convergence is slow, and the policies diverge from each other in ways that are hard to predict.\nShared weights solve the data efficiency problem. All five domains contribute to the same gradient update. The Soldiers are differentiated not by their weights but by their observations — each Soldier receives a domain embedding that encodes the traffic character, priority level, and SLO targets of its specific domain. The same underlying policy learns to behave appropriately in each domain by conditioning on that embedding.\nThis is similar to how multi-task learning works: shared representations, task-specific conditioning. The tradeoff is that you can\u0026rsquo;t have completely different policies per domain. In practice, the domains share enough structural similarity (Kubernetes services with CPU/memory/latency signals and replica-based scaling) that a single policy conditioned on domain embeddings handled the behavioral differences. The domains do have different traffic patterns and SLO requirements, but the domain embedding encodes those differences well enough that the shared policy learned appropriate per-domain behavior.\nWhat This Architecture Buys You Against the three failure modes — stampedes, cascades, and priority inversion — the decomposition addresses all of them structurally: the Commander constrains aggregate resource demand rather than letting hundreds of HPAs fire independently; it observes all domains simultaneously rather than each in isolation; and it enforces priority through a reward function rather than a configuration that drifts.\nThe next post goes deep on the forecasting layer (the Temporal Fusion Transformer that gives the Commander its forward-looking character) and why the design choices there matter as much as the RL design.\nOne thing I am still not sure about: whether shared Soldier weights is the right default. It solved the data efficiency problem cleanly, and the domain embedding was enough to differentiate behavior in practice. But I gave up the ability to have genuinely different policies per domain. For the five domains I worked with (similar enough in structure) that tradeoff felt fine. Whether it holds for domains with more heterogeneous scaling dynamics, I do not know.\n","permalink":"https://aashish-sheshadri.github.io/posts/commander-and-soldiers/","summary":"\u003ch1 id=\"commander-and-soldiers-decomposing-the-scaling-problem\"\u003eCommander and Soldiers: Decomposing the Scaling Problem\u003c/h1\u003e\n\u003cp\u003e\u003cem\u003ePart 3 · Series: Teaching Kubernetes to Think Ahead\u003c/em\u003e\n\u003cem\u003eBy Aashish Sheshadri — Platform Architecture\u003c/em\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003eThe design space for RL-based autoscaling has three obvious options. Two of them do not hold up under scrutiny.\u003c/p\u003e\n\u003cp\u003e\u003cimg alt=\"Three design options for RL-based autoscaling\" loading=\"lazy\" src=\"/posts/commander-and-soldiers/diagrams/svg/03_three_options.svg\"\u003e\u003c/p\u003e\n\u003chr\u003e\n\u003ch2 id=\"option-1-one-big-agent\"\u003eOption 1: One Big Agent\u003c/h2\u003e\n\u003cp\u003eThe most natural first thought: replace all 1,500 HPAs with a single RL agent that observes the entire cluster and outputs replica counts for every service.\u003c/p\u003e","title":"Commander and Soldiers: Decomposing the Scaling Problem"},{"content":"I\u0026rsquo;m an ML architect at PayPal, 12 years. ~1,500 services across cloud, data, and developer productivity — demand forecasting, autoscaling, anomaly detection, model governance.\nI came up through research: crowdsourcing and planetary robotics in grad school (NSF, NASA NIAC funded), then cryptographic systems with Doug Crockford, then building PayPal\u0026rsquo;s enterprise ML platform and shipping applied ML. 6 peer-reviewed papers, 337 citations. AAAI HCOMP Test of Time Award.\nI write about what actually works (and what doesn\u0026rsquo;t) when you apply ML to real infrastructure at scale. This blog focuses on demand forecasting, reinforcement learning for autoscaling, and the engineering between a model and production.\nFind me on LinkedIn or Google Scholar.\n","permalink":"https://aashish-sheshadri.github.io/about/","summary":"About Aashish Sheshadri","title":"About"}]