Reinforcement-Learning

It started with a postmortem I never wanted to write. A merchant flash sale launched 20 minutes ahead of schedule. Traffic to the payment authorization services doubled in under a minute. Kubernetes HPA did exactly what it was configured to do — it detected the CPU spike and requested scale-out across over 150 checkout-path services simultaneously. Most new pods fit on existing nodes, but dozens of services exhausted their node pool headroom and triggered provisioner requests in a burst. The node provisioner stalled under the queue pressure. New capacity came up four minutes later. ...

Reinforcement-Learning

The Cluster That Learned to Plan Ahead

Commander and Soldiers: Decomposing the Scaling Problem