Difference between revisions of "MDD Monitoring Driven Development"

From CitconWiki
Jump to navigationJump to search
(add patrick debois link)
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 +
 +
== From the Rubber Chicken to MDD ==
 +
 +
@jtf's "presentation"
 +
 +
# James Shore's Rubber Chicken
 +
** physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests  before commit
 +
** had to use a separate physical machine (solving the 'It works on my machine' problem)
 +
# CI
 +
** can run more stuff now (fast tests, slow tests) - but separate build for deploy
 +
# pipelines with artifact passing
 +
# promoting to test/prod
 +
# CD - blue green deploy - rolling back based on KPIs '''CI + monitoring now controls production'''
 +
** if any step fails, the change is automatically to be reverted
 +
** if it made to prod, but business metrics down
 +
*** not reverting code
 +
*** take out from production cluster to investigate
 +
 +
== State of the monitoring (first) ==
 +
 +
* metrics used in monitoring are not specific (high level business metric down, must have been this change)
 +
* just like adding tests after writing the code is hard, so is adding monitoring/metrics
 +
* who tried monitoring first?
 +
** zsoldosp - checklist item in issue template, but too many issues it didn't apply, so it kinda got ignored after on that project
 +
** PJ/intent media
 +
*** monitoring can stop deploy/rollout
 +
*** stopped doing acceptance tests in favor of monitoring
 +
** aparker / TIM - failure analyses: we built it, now that we know how it works, let's figure out
 +
*** how could it fail
 +
*** what impact it would have
 +
*** how would we know (from customers? )
 +
*** it it worth adding it? (metric, alert)
 +
 +
== alerting ==
 +
 +
* how many alerts should we create
 +
** high level? e.g.: number failed API requests?
 +
** more specific  - e.g.: we know it after debugging that it failed 'coz the middleware failed. Should we monitor the middleware?
 +
* metrics vs. monitoring
 +
** monitoring triggers somene to look at it
 +
** metrics - kinda like classic OPs - collect data, don't attach metrics, just eyeball "looks to be an unusual shape, let's investigate"
 +
* who should we call (e.g.: if only high level metrics, who should the alerts wake up?)
 +
* (pagerduty.com)
 +
 +
== "Failure Friday" practice ==
 +
* during work hours!
 +
* we think this should be redundant, so let's shut this off and see the team recover
 +
* important: do it when you expect the exercise to be successful
 +
 +
 +
== Feature validation / AB testing ==
 +
 +
not the same as monitoring
 +
 +
 +
== Alert thresholds ==
 +
 +
* it's not always binary (on/off)
 +
* normal is not the same as yesterday/last week / last year
 +
** seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it "Mondays are usually about this many pageloads"
 +
** event driven - e.g.: if you publish tips, it depends on what happens in the world
 +
* factor into
 +
** what can we measure
 +
** what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels
 +
 +
== Improving Alerts ==
 +
 +
* make them actionable
 +
** link to wiki of runbook how to fix
 +
** write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented
 +
* metrics you don't use is inventory, thus not useful
 +
 +
(question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?)
 +
 +
* should we alert on causes (disk full) or symptoms (user can't login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don't alert on those)
 +
 +
== Workshop on MDD - 2 minutes to dropped jaws ==
 +
 +
Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels
 +
 +
what can we measure?
 +
 +
* nr of FAQ views
 +
* # of calls
 +
* ask support reps to ask if caller read the FAQ & feed that back to the system?
 +
* instead of "was this helpful" "yes/no" maybe we could have "yes/Call support (link/phone number)" (talk to UX before doing this at home :-))
 +
 +
=> the way you think of validation/measuring changes the product
 +
 +
== Monitoring Embedded into Business ==
 +
 +
* SRE handbook only focuses on the tech
 +
* if decision makers use monitoring data, it's important for the business, thus no need to justify why monitoring
 +
 +
== Links ==
 +
 
* My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
 
* My Philosophy on Alerting (based my observations while I was a Site Reliability Engineer at Google) by Rob Ewaschuk: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit
 
* Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/
 
* Patrick Debois: Codifying devops practices: https://jedi.be/blog/2012/05/12/codifying-devops-area-practices/
 +
* Doing the impossible fifty times a day: http://timothyfitz.com/2009/02/10/continuous-deployment-at-imvu-doing-the-impossible-fifty-times-a-day/
 +
 +
== good questions to ask: ==
 +
* what does this data mean?
 +
* If we are not wachting it -> delete it?
 +
* Should we try "Failure Friday"?
 +
* Should we use "Daily Red"?
 +
* Is this indicator fast enough (leading or lagging indicator) to react?

Latest revision as of 01:47, 18 October 2022

From the Rubber Chicken to MDD

@jtf's "presentation"

  1. James Shore's Rubber Chicken
    • physical token you had to get to commit (push) to main (it was svn back then), and you ran the build/tests before commit
    • had to use a separate physical machine (solving the 'It works on my machine' problem)
  1. CI
    • can run more stuff now (fast tests, slow tests) - but separate build for deploy
  1. pipelines with artifact passing
  2. promoting to test/prod
  3. CD - blue green deploy - rolling back based on KPIs CI + monitoring now controls production
    • if any step fails, the change is automatically to be reverted
    • if it made to prod, but business metrics down
      • not reverting code
      • take out from production cluster to investigate

State of the monitoring (first)

  • metrics used in monitoring are not specific (high level business metric down, must have been this change)
  • just like adding tests after writing the code is hard, so is adding monitoring/metrics
  • who tried monitoring first?
    • zsoldosp - checklist item in issue template, but too many issues it didn't apply, so it kinda got ignored after on that project
    • PJ/intent media
      • monitoring can stop deploy/rollout
      • stopped doing acceptance tests in favor of monitoring
    • aparker / TIM - failure analyses: we built it, now that we know how it works, let's figure out
      • how could it fail
      • what impact it would have
      • how would we know (from customers? )
      • it it worth adding it? (metric, alert)

alerting

  • how many alerts should we create
    • high level? e.g.: number failed API requests?
    • more specific - e.g.: we know it after debugging that it failed 'coz the middleware failed. Should we monitor the middleware?
  • metrics vs. monitoring
    • monitoring triggers somene to look at it
    • metrics - kinda like classic OPs - collect data, don't attach metrics, just eyeball "looks to be an unusual shape, let's investigate"
  • who should we call (e.g.: if only high level metrics, who should the alerts wake up?)
  • (pagerduty.com)

"Failure Friday" practice

  • during work hours!
  • we think this should be redundant, so let's shut this off and see the team recover
  • important: do it when you expect the exercise to be successful


Feature validation / AB testing

not the same as monitoring


Alert thresholds

  • it's not always binary (on/off)
  • normal is not the same as yesterday/last week / last year
    • seasonality - e.g.: black friday, but can be different for each industry. And you kinda know it "Mondays are usually about this many pageloads"
    • event driven - e.g.: if you publish tips, it depends on what happens in the world
  • factor into
    • what can we measure
    • what should be alert on (i.e.: wake people up). Some things can wait till next business days - use different channels

Improving Alerts

  • make them actionable
    • link to wiki of runbook how to fix
    • write it for your future self who alerted at 2am at a party, not with your present knowledge of the context of the feature you just implemented
  • metrics you don't use is inventory, thus not useful

(question: any logging frameworks that would only flush logs on exceptions? but then on DEBUG level?)

  • should we alert on causes (disk full) or symptoms (user can't login) (symptoms more useful? some tools allow dependencies, i.e.: if this is down, these others will be down too, don't alert on those)

Workshop on MDD - 2 minutes to dropped jaws

Story: Given that currently our support lines are overwhelmed, if we added an FAQ about it, support calls would drop back to managable levels

what can we measure?

  • nr of FAQ views
  • # of calls
  • ask support reps to ask if caller read the FAQ & feed that back to the system?
  • instead of "was this helpful" "yes/no" maybe we could have "yes/Call support (link/phone number)" (talk to UX before doing this at home :-))

=> the way you think of validation/measuring changes the product

Monitoring Embedded into Business

  • SRE handbook only focuses on the tech
  • if decision makers use monitoring data, it's important for the business, thus no need to justify why monitoring

Links

good questions to ask:

  • what does this data mean?
  • If we are not wachting it -> delete it?
  • Should we try "Failure Friday"?
  • Should we use "Daily Red"?
  • Is this indicator fast enough (leading or lagging indicator) to react?