Commander and Soldiers: Decomposing the Scaling Problem

Part 3 · Series: Teaching Kubernetes to Think Ahead By Aashish Sheshadri — Platform Architecture


The design space for RL-based autoscaling has three obvious options. Two of them do not hold up under scrutiny.

Three design options for RL-based autoscaling


Option 1: One Big Agent

The most natural first thought: replace all 1,500 HPAs with a single RL agent that observes the entire cluster and outputs replica counts for every service.

The observation space is 1,500 services × (however many features per service). At even a modest 20 features per service, that is a 30,000-dimensional input. The action space is 1,500 replica targets. The policy network needs to learn cross-service dependencies — that scaling down a checkout gateway affects a downstream fraud scorer’s latency — but the signal is buried in a sparse observation where 90% of apps are healthy at any given moment.

This does not work for three compounding reasons:

Sparsity and sample efficiency. The relevant signals (cross-service interactions, cascade effects) are sparse in the observation and temporally distant from the actions that caused them. Most of the 30,000 dimensions carry no useful signal at any given time. You would need enormous amounts of simulated experience to train a policy that reliably identifies the interactions that matter from the noise that does not.

Credit assignment. When the reward signal arrives (domain latency went up), which of the 1,500 simultaneous actions is responsible? The credit assignment problem is the fundamental difficulty of RL; a 1,500-dimensional action space with sparse, delayed feedback makes it extremely difficult to solve reliably.

Operational opacity. When the agent makes a bad decision, how do you debug it? “The policy decided to scale the fraud scorer to 47 replicas based on a 30,000-dimensional input vector” is not a root cause analysis. Platform teams need to understand why the system did what it did.

A monolithic agent fails on sparsity, credit assignment, and interpretability simultaneously.


Option 2: 1,500 Independent Agents

The opposite end of the spectrum: each service gets its own RL agent, observing only its own metrics, outputting only its own replica count.

This immediately reintroduces the coordination problems from the previous post. You have replaced 1,500 HPAs with 1,500 RL agents, but the fundamental issue (that they make independent decisions without awareness of shared resources) is unchanged. You have added training complexity without solving the structural problem.

Independent agents also have a sparse reward problem. A single service’s SLO breach is a rare event in steady state. An agent that rarely receives meaningful reward signals converges slowly and unpredictably. This is manageable if you have a few agents; it is operationally unacceptable at 1,500.


Option 3: Hierarchical RL (The One That Works)

When a traffic event hits a cluster with ~1,550 applications (the same services, counted as Kubernetes Deployments), a senior SRE does not immediately start deciding individual replica counts. They make high-level resource allocation decisions first:

“Payment processing is P0, protect it at all costs. Fraud scoring is P0/P1, keep it inline with auth. Data pipelines can shed load temporarily. Internal services yield to everyone.”

Only after those policy-level decisions are made does the team execute at the service level, and they execute within the constraints those decisions established.

Abstract policy at the top, concrete execution at the bottom. That is what hierarchical RL does.

There are intermediate architectures I considered and rejected. CTDE (centralized training, decentralized execution) uses a shared critic during training but independent policies at inference — the problem is that 1,500 independent execution policies still produce uncoordinated scaling actions, just with better-trained weights. Graph neural network policies over the service dependency graph are appealing in theory but require maintaining an accurate, real-time dependency graph at scale, which is its own infrastructure problem. MPC at the cluster level with RL only at the domain level splits the problem differently but puts the optimization burden on a model-predictive controller that needs an accurate dynamics model of cluster-wide resource contention — and that model does not exist.

The Commander-Soldier decomposition was the simplest design that addressed all three failure modes — stampedes, cascades, and priority inversion.


The Commander-Soldier Architecture

Commander-Soldier Architecture

The Commander-Soldier decomposition is not novel in RL theory. The idea of a manager issuing high-level directives to workers goes back to feudal RL (Dayan and Hinton 1993), with modern implementations like FeUdal Networks (Vezhnevets et al. 2017) and goal-conditioned hierarchical RL like HIRO (Nachum et al. 2018). The options framework (Sutton, Precup, and Singh 1999) formalizes the notion of temporally extended actions — the Commander’s budget directive functions like one, executed by the Soldiers over six inner ticks, but with a fixed duration rather than a learned termination condition.

I chose not to use these frameworks directly. The budget-as-action abstraction is simpler to implement and debug than learned option termination conditions or goal embeddings in a continuous space. If you have read the feudal RL papers, you will see the family resemblance.

The Commander operates at 60-second intervals on five-domain aggregates. It knows nothing about individual services. It sees only domain-level CPU and memory utilization, latency, SLO headroom, and traffic forecasts. Its action space is a budget allocation: what fraction of cluster capacity each domain is entitled to use, plus an urgency signal and a headroom request. The budget percentages come from a (D+1)-dimensional softmax, where the extra dimension is a reserve pool the Commander can hold unallocated. The five domain shares plus the reserve sum to exactly 1.0, so the Commander can express “I don’t know where the next spike is coming from, hold 15% back” without being forced to distribute everything. Five domains × a few scalars. Tractable.

One limitation: the Commander does not see distributional information within domains. If 10% of a domain’s apps are at 95% CPU and 90% are at 20%, the p75 shows a healthy domain. I partially compensate with pod restart rates and SLO headroom (which surface CrashLoopBackOff events and leading SLO indicators at the domain level), but a more principled approach would include variance or quantile features in the domain aggregation.

The Soldiers (one per domain) operate at 10-second intervals on their domain’s application-level metrics. Each Soldier knows about its 150–500 applications, organized into service-tier groups. Rather than outputting replica decisions for every app independently, it allocates budget across groups and focuses detailed scaling decisions on the most at-risk applications — keeping the action space to roughly 50 dimensions regardless of domain size. The Soldier never considers cross-domain resource contention — that is the Commander’s problem, constrained by the budget envelope before the Soldier acts.

This handles all three coordination failures:

  • Stampede coordination: The Commander sets coordinated budget envelopes that constrain how much each Soldier can scale. Rather than hundreds of HPAs independently requesting capacity from the node provisioner, aggregate demand is bounded before it reaches the scheduler.
  • Cascade awareness: The Commander’s observation includes all domains simultaneously, plus explicit cross-domain dependency signals — an inter-domain call rate matrix and a per-domain latency exposure score derived from service mesh telemetry. When Payment latency climbs because Fraud scoring (called inline) is slow, the Commander can diagnose that the bottleneck is in Fraud and redirect budget there, rather than wastefully scaling Payment.
  • Priority enforcement: Domain priority is encoded at multiple levels: tighter SLO thresholds for P0 domains trigger reward penalties sooner, graduated safety triggers give P0 domains shorter alert windows, and minimum resource floors guarantee baseline capacity. The reward function penalizes the worst-performing domain, not the average — so a single P0 degradation dominates the signal. It is continuously enforced, not configured once and drifted.

Why Actor-Critic for Both Levels

Both the Commander and the Soldiers use an actor-critic architecture. This is worth explaining because it is not the only option.

The actor is the policy: given the current observation, what action should I take? The critic is the value function: given the current state, what is the expected cumulative reward from here?

The critic is what enables stable learning in complex environments. Without a value function, the policy gradient estimate is extremely noisy: you are using actual episode returns to estimate expected returns, which has high variance. The critic provides a lower-variance baseline. At the timescales I care about (60-second Commander ticks, 10-second Soldier ticks), the environment changes fast enough that high-variance learning is genuinely destabilizing.

For the Commander specifically, the critic’s value function serves another role: it encodes the long-horizon consequences of budget allocations. A Commander that allocates too much budget to Domain 1 during a quiet period wastes capacity. The effects of that waste are felt 10–20 minutes later when a spike hits Domain 2 and the budget isn’t there. The critic learns to trace these delayed consequences back to the allocation decisions that caused them. The Commander’s discount factor is γ=0.995 at its 60-second tick, giving an effective planning horizon of roughly 3.3 hours (200 ticks) — long enough to learn to pre-position for traffic ramps, short enough to avoid vanishing gradients over full daily cycles. The Soldiers use a lower discount factor at their faster 10-second tick, keeping their effective horizon shorter and focused on immediate scaling response rather than long-range planning.

I used PPO (Proximal Policy Optimization) for both levels rather than off-policy alternatives like SAC — SAC’s off-policy nature complicates hierarchical coupling because Soldier training data would include transitions under stale Commander policies. Both levels have separate KL penalty coefficients. The Commander’s KL coefficient is higher, meaning its policy changes more slowly per update, and that is deliberate: the Commander needs to be a stable “policy environment” for the Soldiers to adapt to. Constantly shifting budget allocations make it impossible for Soldiers to learn reliable behavior.


The Coupling Problem: Making Two Levels Cooperate

Coupling mechanism between Commander and Soldiers

The most interesting design challenge in hierarchical RL is getting the two levels to cooperate rather than conflict. An adversarial dynamic is easy to produce by accident: the Commander sets a budget, the Soldier ignores it because its local reward for scaling up everything is higher, the Commander raises the budget to compensate, the Soldier still ignores it, and you end up with unconstrained scaling driven entirely by Soldier-level rewards.

The design uses two mechanisms to prevent this:

Hard constraint: The Soldier’s action space is subject to a projection that enforces the Commander’s budget as a near-hard cap — Soldiers cannot exceed the Commander’s allocation by more than 10%. The projection layer normalizes the replica deltas so their total stays within this bound. This is differentiable — gradients flow through the projection during training.

Coupling reward: The Soldier receives a bonus in its reward function for aligning its actual resource distribution with the Commander’s intent. If the Commander signals high urgency for scale-up and the Soldier uses most of its budget on already-healthy applications, it pays a penalty. If the Soldier responds to the urgency signal and concentrates resources where they’re most needed, it gets a bonus.

The magnitude of the coupling reward requires careful tuning. Too high and Soldiers blindly follow Commander directives even when local evidence says otherwise; too low and Soldiers ignore the Commander entirely. I will get into the tuning methodology when I cover the reward design.

There is a credit assignment gap I have not fully solved. When a domain’s SLO breaches, was it the Commander’s bad budget, the Soldier’s bad allocation within a good budget, or the forecaster being wrong? The system has no mechanism to distinguish these three cases — both Commander and Soldier get penalized, and both adjust. Over many episodes this averages out, but it slows convergence and means early training is noisier than it needs to be. HIRO-style off-policy corrections could help here, but I have not implemented them.


Shared Weights Across Soldiers

One more non-obvious decision: all five Soldiers share the same network weights.

Five Soldiers, five domains, five separate reward streams. If you train five independent networks, each one sees at most 20% of the experience — and the reward signal for a single domain is sparse because most applications are healthy most of the time. Convergence is slow, and the policies diverge from each other in ways that are hard to predict.

Shared weights solve the data efficiency problem. All five domains contribute to the same gradient update. The Soldiers are differentiated not by their weights but by their observations — each Soldier receives a domain embedding that encodes the traffic character, priority level, and SLO targets of its specific domain. The same underlying policy learns to behave appropriately in each domain by conditioning on that embedding.

This is similar to how multi-task learning works: shared representations, task-specific conditioning. The tradeoff is that you can’t have completely different policies per domain. In practice, the domains share enough structural similarity (Kubernetes services with CPU/memory/latency signals and replica-based scaling) that a single policy conditioned on domain embeddings handled the behavioral differences. The domains do have different traffic patterns and SLO requirements, but the domain embedding encodes those differences well enough that the shared policy learned appropriate per-domain behavior.


What This Architecture Buys You

Against the three failure modes — stampedes, cascades, and priority inversion — the decomposition addresses all of them structurally: the Commander constrains aggregate resource demand rather than letting hundreds of HPAs fire independently; it observes all domains simultaneously rather than each in isolation; and it enforces priority through a reward function rather than a configuration that drifts.

The next post goes deep on the forecasting layer (the Temporal Fusion Transformer that gives the Commander its forward-looking character) and why the design choices there matter as much as the RL design.

One thing I am still not sure about: whether shared Soldier weights is the right default. It solved the data efficiency problem cleanly, and the domain embedding was enough to differentiate behavior in practice. But I gave up the ability to have genuinely different policies per domain. For the five domains I worked with (similar enough in structure) that tradeoff felt fine. Whether it holds for domains with more heterogeneous scaling dynamics, I do not know.