How to manage oncall as an engineering manager?

2 points by frugal10 3 days ago

As a relatively new engineering manager, I oversee a team handling a moderate volume of on-call issues (typically 4-5 per week). In addition to managing production incidents, our on-call responsibilities extend to monitoring application and infrastructure alerts.

The challenge I’m currently facing is ensuring that our on-call engineers don't have sufficient time to focus on system improvements, particularly enhancing operational experience (Opex). Often, the on-call engineers are pulled into working on production features or long-term fixes from previous issues, leaving little bandwidth for proactive system improvements.

I am looking for a framework that will allow me to:

Clearly define on-call priorities, balancing immediate production needs with Opex improvements. Manage long-term fixes related to past on-call issues without overwhelming current on-call engineers. Create a structured approach that ensures ongoing focus on improving operational experience over time.

tthflssy 20 hours ago

Without knowing your context, it is hard to give advice, that is ready to be applied. As a manager, you will need to collect and produce data about what is really happening and what is the root cause.

Clear up first what is the charter of your team, what should be in your team's ownership? Do you have to do everything you are doing today? Can you say no to production feature development for some time? Who do you need to convince: your team, your manager or the whole company?

Figure out how to measure / assign value to opex improvements eg you will have only 1-2 on-call issues per week instead of 4-5, and that is savings in engineering time, measurable in reliability (SLA/SLO as mentioned in another comment) - then you will understand how much time it is worth to spend on those fixes and which opex ideas worth pursuing.

Improving the efficiency of your team: are they making the right decisions and taking the right initiatives / tickets?

Argue for headcount and you will have more bandwidth after some time. Or split 2 people off and they should only work on opex improvements. You give administratively priority to these initiatives (if the rest of the team can handle on-call).

matt_s 2 days ago

Think of on-call like medical triage. On-call should triage outage (partial/full) level scenarios and respond to alerts, take immediate actions to remedy the situation (restart services, scale up, etc.) and then create follow-on tickets to address root causes that go into the pool of work the entire team works. Like an ER team stabilizing a patient and identifying next steps or sending the patient off to a different team to take time in solving their longer term issue.

The team needs to collectively work project work _and_ opex work coming from on-call. On-call should be a rotation through the team. Runbooks should be created on how to deal with scenarios and iterated on to keep updated.

Project work and opex work are related, if you have a separate team dealing with on-call from project work then there isn't a sense of ownership of the product since its like throwing things over a wall to another team to deal with cleaning up a mess.

shoo 20 hours ago

if this is just a workload vs capacity thing -- where the workload exceeds capacity, is there a way to add some back-pressure to reduce the frequency of on-call issues that your team is faced with?

are you / your team empowered to push back & decline being responsible for certain services that haven't cleared some minimum bar of stability? e.g. "if you want to put it into prod right away, we wont block you deploying it, but you'll be carrying the pager for it"

__s 3 days ago

Without knowing the scale of company you're at it's hard to give advice

At Microsoft I headed Incident Count Reduction on my team where opex could be top priority & rotating on call would have a common thread between shifts through me (ie, I would know which issues were related or not, what fixes were in the pipe, etc)

I'm guessing the above isn't an option for you, but you can try drive an understanding that while someone is on call there is no expectation for them to work on anything else. That means subtracting on call head count during project planning

  • frugal10 2 days ago

    The team size is 7 people. The organization is medium in size with around 3k employees. The business unit that i work in is relatively in 0-1 stage. So there is some amount of chaos and adhoc requirements coming every now and then

uaas a day ago

Have you looked into SLO/SLA/SLIs?

brudgers 3 days ago

The best way to manage on-call is to not have on-call. On-call means the organization is understaffed. Hiring new positions to handle off hours, will solve the problem. Good luck.

  • tkiolp4 a day ago

    This. Even Burger joints have shifts so that they are operational 24/7.

  • akulbe a day ago

    I came here to say this but for a different reason.

    Have a mature enough development process and pipeline that production deployments are repeatable and predictable at any time.

    Bake testing into the procedure.