Repair capability (emergency fixes)

Rational

It is desired that we add an “Repair” mechanism for Ubuntu Core (all-snap) devices to deal with extraordinary situations that require an out-of-band emergency update when the regular snapd update mechanism is not working for whatever reason (like a previous bad update, an incompatible combination of software or something that we did not foresee now).

Because of the powerful nature of this feature we need a very clear design to ensure this mechanism is secure, resilient, effective and transparent.

Running repairs

What repairs to run

The snap-repair code will retrieve the repairs, distributed as assertions, to run in sequence, retrieving and executing one at a time.

When running snap-repair after booting for the first time, a device will start from the beginning of the sequence.
In the future we don’t want devices to download and run all repairs that ever existed, many of which would not be relevant anymore. We expect snap-repair to grow mechanisms to decide a starting-point in the sequence, combining information from the image and possibly querying a service. Given that repairs are assertions that can be revisioned and updated, we expect to be able to postpone detailing this starting-point mechanism by knowing that if needed we can use the first repair in the sequence that will be a NOP/testing repair to control this for images with the first iteration(s) of snap-repair.

When run the repair must echo one of the following states to $SNAP_REPAIR_STATUS_FD: (done, retry). The retry state is important, because the repair may act on a system that does not yet need a repair. E.g. core breaks in r100 but the assertion is already downloaded when core is at r99. we must ensure we re-run the script until r100 is reached.

The snap-repair run infrastructure will expose a repair helper (likely a symlink on PATH back to snap-repair) to help with those details:

  • repair done
  • repair retry
  • to control skipping over parts of the sequence: repair skip ID

If a repair script finishes without having emitted its state it will be assumed to be retry.

We also want a mechanism such that repairs can be presented to a device using a USB stick.

When to run repairs

We will run the repair fetcher/runner every 4h+random(4h) via a systemd timer unit. All new assertions or in retry state will be run and states updated.

We also ideally want to run repairs once per boot early (from initrd even if possible).

Assertion

We add a new assertion called “repair”. The primary key of the assertion is (brand-id, repair-id). The repair-id is initially defined as an increasing number starting from 1.

In order to fetch a repair assertion in the sequence, snap-repair will do a GET on an http repair url that takes the same form as the assertion endpoints.

The very first iteration of the mechanism will consider one sequence with brand-id canonical, useful to repair any Core device. It’s easy to extend this to have per brand sequences as well to consider, and later possibly model specific sequences (by extending the repair-id format and fetch and run logic).

summary is mandatory and should concisely document what the repair addresses.

A repair assertion contains no since/until header because we cannot trust the system clock. The timestamp header is just for reference about when the repair was created. The code that is run via the assertion should be as minimal as possible and just be enough to make the regular snapd update mechanism work again. It also needs to be idempotent and typically to check whether if the problem is not present it could instead occur later (e.g. broken update likely to come yet). It contains also optional lists of targeted series, architectures and models, where an omitted list means any. The run mechanism will use these lists to decide whether the repair should be run at all for the device.

There’s also an optional disabled boolean header used to mark fully retired or known-to-be-broken repairs.

Example:

type: repair
authority-id: acme
brand-id: acme
repair-id: 42
summary: this fixes everything
architectures: 
  - amd64
series:
  - 16
models:
  - acme/frobinator
  - acme/hal-10*
timestamp: 2017-06-19T09:13:05Z
body-length: 432
sign-key-sha3-384: Jv8_JiHiIzJVcO9M55pPdqSDWUvuhfDIBJUS-3VW7F_idjix7Ffn5qMxB21ZQuij

#!/bin/sh
set -e
echo "Unpack embedded binary data"
match=$(grep --text --line-number '^PAYLOAD:$' $0 | cut -d ':' -f 1)
payload_start=$((match + 1))
tail -n +$payload_start $0 | uudecode | tar -xzf -
# run embedded content
./fixup
exit 0
# payload generated with, may contain binary data
#   printf '#!/bin/sh\necho hello from the inside\n' > hello
#   chmod +x hello
#   tar czvf - hello | uuencode --base64 -
PAYLOAD:
begin-base64 644 -
H4sIAJl991gAA+3SSwrCMBSF4Yy7iisuoAkxyXp8RBOoDTR1/6Y6EQQdFRH+
b3IG9wzO4KY4DEWtSzfBuSVNcPo1n3ZOGdsqPjhrW89o570SvfKuh1ud95OI
ipcyfup9u/+p7aY/5LGvqYvHVCQt7yDnqVxlTlHyWPMpdr8eCQAAAAAAAAAA
AAAAAAB4cwdxEVGzACgAAA==

AXNpZw==
====

Straw-man for the implementation

There are some key properties we want ensure:

  • secure - we ensure the security of this feature by using assertions as the mechanism to implement them. The use of signatures ensure we have confidence to only allow legitimate repair assertions. Things to consider:

  • being able to revoke repair assertions via a revoke-repair assertion (alternatively we just publish a new revision of the existing assertion) to ensure that a repair assertion with bad code can not be used to attack.

  • Limit the authority who can issue repair assertions to Canonical initially (to ensure the system is not abused for things that are not the job of the repair assertion)

  • resilient - TBD

  • effective - we use the body to include a script that is run as the repair action. The content will be written to disk/tmpfs (in case the disk is full) and executed. This way we can ship easy shell (or perl/python) based fixes. But it also allows us to ship binaries by just embedding them into the script bia base64 encoding. An example will be included in the tests. We will also need to make sure that we handle big payloads, i.e. ensure that the assertion system can deal with multi-megabyte lines without choking. In addition we should send the output and error result of the script back to a repair-tracker (similar to our error tracker) to ensure that we can detect failing repair actions and act accordingly. In phase1 we might consider using the error tracker for this and only monitor failing actions.

  • transparent - when a repair runs we add information to syslog about it. In addition for each of the repairs we create a directory /var/lib/snapd/repair/run/{$BRAND_ID}/${REPAIR_ID}/ and put the following files in there:

    • r{assertion revision}.script: the repair scripts that were run
    • r{assertion revision}.done|retry]skip: the full output of the scripts run with the outcome status indicated by the file extension

    OTOH /var/lib/snapd/repair/assertions/{$BRAND_ID}/${REPAIR_ID}/r{assertion revision}.repair will contain the full
    repair assertion together with the auxiliary assertions as fetched in a stream.

Last updated 6 months ago. Help improve this document in the forum.