MDD Monitoring Driven Development

From CitconWiki
Jump to navigationJump to search

From the Rubber Chicken to MDD

@jtf's "presentation"

  1. James Shore's Rubber Chicken
    • physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests before commit
    • had to use a separate physical machine (solving the 'It works on my machine' problem)
  1. CI
    • can run more stuff now (fast tests, slow tests) - but separate build for deploy
  1. pipelines with artifact passing
  2. promoting to test/prod
  3. CD - blue green deploy - rolling back based on KPIs CI + monitoring now controls production
    • if any step fails, the change is automatically to be reverted
    • if it made to prod, but business metrics down
      • not reverting code
      • take out from production cluster to investigate

State of the monitoring (first)

  • metrics used in monitoring are not specific (high level business metric down, must have been this change)
  • just like adding tests after writing the code is hard, so is adding monitoring/metrics
  • who tried monitoring first?
    • zsoldosp - checklist item in issue template, but too many issues it didn't apply, so it kinda got ignored after on that project
    • PJ/intent media
      • monitoring can stop deploy/rollout
      • stopped doing acceptance tests in favor of monitoring
    • aparker / TIM - failure analyses: we built it, now that we know how it works, let's figure out
      • how could it fail
      • what impact it would have
      • how would we know (from customers? )
      • it it worth adding it? (metric, alert)

alerting

  • how many alerts should we create
    • high level? e.g.: number failed API requests?
    • more specific - e.g.: we know it after debugging that it failed 'coz the middleware failed. Should we monitor the middleware?
  • metrics vs. monitoring
    • monitoring triggers somene to look at it
    • metrics - kinda like classic OPs - collect data, don't attach metrics, just eyeball "looks to be an unusual shape, let's investigate"
  • who should we call (e.g.: if only high level metrics, who should the alerts wake up?)
  • (pagerduty.com)

"Failure Friday" practice

  • during work hours!
  • we think this should be redundant, so let's shut this off and see the team recover
  • important: do it when you expect the exercise to be successful


Feature validation / AB testing

not the same as monitoring


Alert thresholds

  • it's not always binary (on/off)
  • normal is not the same as yesterday/last week / last year
    • seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it "Mondays are usually about this many pageloads"
    • event driven - e.g.: if you publish tips, it depends on what happens in the world
  • factor into
    • what can we measure
    • what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels

Improving Alerts

  • make them actionable
    • link to wiki of runbook how to fix
    • write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented
  • metrics you don't use is inventory, thus not useful

(question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?)

  • should we alert on causes (disk full) or symptoms (user can't login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don't alert on those)

Workshop on MDD - 2 minutes to dropped jaws

Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels

what can we measure?

  • nr of FAQ views
  • # of calls
  • ask support reps to ask if caller read the FAQ & feed that back to the system?
  • instead of "was this helpful" "yes/no" maybe we could have "yes/Call support (link/phone number)" (talk to UX before doing this at home :-))

=> the way you think of validation/measuring changes the product

Monitoring Embedded into Business

  • SRE handbook only focuses on the tech
  • if decision makers use monitoring data, it's important for the business, thus no need to justify why monitoring

Links

good questions to ask:

  • what does this data mean?
  • If we are not wachting it -> delete it?
  • Should we try "Failure Friday"?
  • Should we use "Daily Red"?
  • Is this indicator fast enough (leading or lagging indicator) to react?