HPA works fine until your services start fighting each other for capacity. This series is about what I designed to replace it — a hierarchical RL autoscaler for a cluster running ~100k pods across ~1,500 services.

It starts with the postmortem that kicked the whole thing off, goes deep on the architecture and the parts that were hard to get right, and ends with an honest look at what I’d do differently.