Alert-Driven Monitoring

(simpleobservability.com)

24 points | by khazit 1 hour ago

3 comments

  • Yokohiii 1 minute ago
    In my opinion the best method to reduce alerts is to work hard to get rid of the underlying problems or turn them into a non-problems. If you do a good job most errors are 3rd party driven, that can be indeed hard to solve relative to company politics. But at that point you can always tell your boss how it can be solved and that you wont go on pager duty for stuff that is out of your control.
  • stingraycharles 1 hour ago
    Good metrics and alerting systems are designed, from the top down. Not bottom up.

    Lots of metrics are typically available, but almost all of them are noise.

    Start with the business: what is important to the business ? What kind of failures are existential threats ?

    Then work your way down and design your metrics and alerts, instead of just throwing stuff at the wall.

    I’ve had to push back so many times with teams whose manager at one point said “we need better monitoring / alerting” and they interpreted that to mean more metrics / alerts.

    This is rarely the case.

    I personally am really fond of just using a few alerts. The important thing to know that something went wrong. Not necessarily where / why / how something went wrong.

    And yes, inertia is real, and false / invaluable alerts need to be killed immediately, without remorse. They are SRE’s cancer.

    • dandellion 36 minutes ago
      I agree that alerts should just be the vital ones. But in terms of monitoring and metrics, more is generally better. I joined a company where something broke and the only way to figure out what was wrong was to ssh and hop through several services and it was a massive waste of time for something that just having set up basic otel would be trivial to narrow down.
    • b112 57 minutes ago
      If you receive too many emails, alerts, warnings, and so on, you are only training yourself and the team to ignore them.

      As you say, few is better. And a well chosen few.

    • alansaber 49 minutes ago
      Very few alerts, implemented around core business logic, incorporating as many edge cases as possible. This is the way.
  • analogpixel 41 minutes ago
    > Alerts should be actionable. If no action can or should be taken, then the alert is not needed.

    Also, the best alerts come from looking at actual failures you had and not trying to make up "good alerts" from thin air. After you have an outage, figure out what alerts would have caught it, and implement those.