canvas-lms/gems/canvas_errors
Isaac Moore 3e748c1c7e Upgrade Sentry SDKs
This commit replaces `sentry-raven` with the more modern Sentry SDKs.

Settings related to this change:

- `sentry_disabled`: completely disables Sentry, in case anything goes
  sideways (defaults to `false`)

refs DE-922
flag=none

test plan:
- make sure to check out the correct patchset in the
  `gems/plugins/uuid_provisioner` directory (see the tag below for ref)
- create a new project in Sentry (it may be easiest to set up Sentry to
  run locally; check out `getsentry/onpremise` on GitHub for setup
  details; it's not too onerous. note that you will need to add
  `VIRTUAL_HOST` and `VIRTUAL_PORT` to the `nginx` container in that
  `docker-compose.yml` in order to have your Sentry accessible via Dory
  at `sentry.docker`)
- configure `config/sentry.rb` with your DSN
- restart rails and job servers
- confirm that, by default:
  - any errors raised in controllers appear in the "Issues" section
  - any errors raised in jobs appear in the "Issues" section
  - user context associated with issues includes user IP and global ID
  - job issues contain context from the job under the "INST-JOBS" section
- set `sentry_disabled` setting to true and restart rails/jobs servers
- confirm that the app/jobs start fine and there are no issues
  - also confirm that Sentry is disabled (no errors received)
- remove `config/sentry.yml` and restart rails and jobs servers
- confirm that the app/jobs start fine and there are no issues

[pin-commit-uuid_provisioner=735c2102fb0020ac5aa80734a323bf08322d002d]

Change-Id: Id0fa4b4ee57ab812cd75d21d2e8ab5e21177af1a
Reviewed-on: https://gerrit.instructure.com/c/canvas-lms/+/279454
Reviewed-by: Ben Rinaca <brinaca@instructure.com>
QA-Review: Isaac Moore <isaac.moore@instructure.com>
Product-Review: Isaac Moore <isaac.moore@instructure.com>
Tested-by: Service Cloud Jenkins <svc.cloudjenkins@instructure.com>
2022-01-10 16:58:07 +00:00
..
lib RuboCop: Style/StringLiterals, Style/StringLiteralsInInterpolation 2021-11-25 14:03:06 +00:00
spec RuboCop: Style/StringLiterals, Style/StringLiteralsInInterpolation 2021-11-25 14:03:06 +00:00
Gemfile RuboCop: Style/StringLiterals, Style/StringLiteralsInInterpolation 2021-11-25 14:03:06 +00:00
README.md Upgrade Sentry SDKs 2022-01-10 16:58:07 +00:00
canvas_errors.gemspec RuboCop: Style/StringLiterals, Style/StringLiteralsInInterpolation 2021-11-25 14:03:06 +00:00
test.sh pull canvas_errors out into a gem 2021-03-05 17:00:50 +00:00

README.md

CanvasErrors

A callback-hub for taking actions when an error is reported somewhere in the app.

Usage

When things go wrong, we want to know about it. When things error out in expected ways because thats how were communicating to the user theyve done something wrong, we DONT want to know about it. This gem organizes where our exceptions go, and how you should deal with sending them there (or NOT sending them there!).

Where do we capture exceptions for analysis?

The short answer is that they get sent to sentry. Mostly we use the Sentry integration (see config/initializers/sentry.rb in canvas-lms). This means that any time an UNHANDLED exception pops all the way out of the rails process, well tell Sentry.

We also sometimes report errors that we handle if its important to report them (because its unexpected) but also not explode (because were in the middle of doing something important that is continuable). See page view logging in canvas for an example: app/controllers/application_controller.rb. Because we register sentry as a callback from Canvas Errors, things that we send there also get sent to Sentry. An error can be captured from anywhere using the capture method:

begin
  # risky thing
rescue ExpectedError => e
  Canvas::Errors.capture(e, extra_context: "foobar")
end

Sentry is not the only system that can be rigged up to the Canvas Errors system as a callback. If something else should happen as part of an error being declared, you can define a callback to make that happen:

Rails.configuration.to_prepare do
  # write a database record to our application DB capturing useful info for looking
  # at this error later
  CanvasErrors.register!(:error_report) do |exception, data, level|
    report = ErrorReport.log_exception_from_canvas_errors(exception, data)
    report.try(:global_id)
  end
end

callbacks have a name, so only one callback with the same name can be registered, but this is also useful for sending callback "responses". the return value of every block registered as a callback gets packed into a hash that gets returned from "capture" calls so that you can inspect and use some identifier from a given callback if you need to.

Which exceptions should get reported?

Actionable ones. We aspire to be in a state where any error that ends up in sentry provokes a response. Possibly to fix the code, sometimes to catch and handle an exception that is really more of an operational signal. We often send those to sentry because “we want to know if theyre happening a lot”, but sentry isnt great at surfacing that information. If something is an error that is going to happen sometimes as part of doing business, but shouldnt happen too much (think of like transient networking failures when talking to an upstream service), then we want that to get sent to datadog as a metric we can alarm on. As long as its not happening so often that we need to fire an alarm, it can just be incorporated into a dashboard. That is the kind of signal that should NOT get sent to sentry.

The CanvasErrors has a mechanism for this, the “level” parameter:

def some_action
  # ... hard important work
rescue ErrorClassA => e
  CanvasErrors.capture_exception(:important_subsystem, e, :info)
  render :plain => 'unauthorized', :status => :unauthorized
rescue ErrorClassB => e
  CanvasErrors.capture_exception(:important_subsystem, e, :warn)
  render :plain => "Service is currently unavailable. Try again later.",
          :status => :service_unavailable
rescue ErrorClassC => e
  CanvasErrors.capture_exception(:important_subsystem, e, :info)
  render :plain => 'Bad Request', :status => :bad_request
rescue ErrorClassD => e
  CanvasErrors.capture_exception(:important_subsystem, e, :error)
  render :plain => 'Unknown Error', :status => :service_unavailable
end

Above, the same action can fail in many ways, but some of them are part of doing business, and some of them are surprises. The third argument to "capture_exception" lets you declare which type of failure this is, and that parameter is available to the callback blocks you write so you can decide for a given callback whether or not to take action for a given error based on it's level.

Best practices@

  • Throw and catch SPECIFIC errors. New error classes are cheap, and can help localize problems when they occur. If everything is a RuntimeException or ArgumentError, both sentry and the stats systems loose their fidelity. Instead use things like a custom MissingGoogleDriveParameterError, which will show up as its own stat in datadog, and its own “issue” in sentry (and will be easy to localize to one spot in the code). Its perfectly acceptable to INHERIT from ArgumentError or other general exception if they do fall into those categories, but try to throw errors that are specifically named for your use case. ERRORS THAT BUBBLE UP TO ABORT THE REQUEST OR JOB STILL GET CAPTURED, BUT THEY WILL ALWAYS BE OF THE “:error” TYPE BECAUSE WE DONT WANT EXPECTED ERRORS TO RESULT IN 500s.

  • Capture your exceptions explicitly with CanvasErrors.capture (or capture_exception). This will make sure that ANY callbacks we register to keep track of our errors will get run (logs, stats, sentry, etc), even ones we add in the future.

  • Use the type parameter for errors to specify a subsystem, just like the example above. If you use “capture_exception”, this is the first parameter. This will let us capture stats (and tags in sentry) for ALL the errors that are part of a given subystem (oauth, global_lookups, local_cache, etc).

  • Use the level argument to indicate severity. The default is “:error”. This is what we want for errors that are a surprise, and we should do something about them. If youre capturing something thats going to happen from time to time (upstream service timeouts, auth failures, parsing user content errors, etc, etc), use “:warn” if its something that might need attention if it climbs above a certain level (think redis or db timeouts or connection failures, things that were resiliant to but want to watch out for spikes in), and :info if its something like user input validation that we WANT to fail in order to make business logic work.

Running Tests

This gem is tested with rspec. You can use test.sh to run it, or do it yourself with bundle exec rspec spec.