=============================
Dynamic cloud compute scaling
=============================

Requirements
============

To support spikes in work requests, debusine needs to be able to dynamically
make use of CPU resources in clouds.  Static workers may still be useful for
various reasons, such as:

* locality
* confidentiality
* supporting an expected base load
* customers who are sensitive to making use of artifacts built in clouds
  they don't control
* support for exotic architectures
* in some cases, the ability to launch virtual machines in the worker

However, dynamic workers offer more flexibility and may be cheaper overall.

Dynamic workers typically have a cost based at least in part on their
uptime, so debusine must keep track of them and stay within resource limits
configured by the administrator and/or scope owners.

debusine must be able to provision dynamic workers automatically, including
any additional facilities such as setting up Incus; and it must be able to
tear down idle dynamic workers based on appropriate criteria.  Some
provisioning decisions may be handled outside debusine proper, such as by
providing pre-built images.

Dynamic workers will have various properties that can be considered when
scheduling work requests, such as their size.  (Worker metadata already
handles requirements such as checking whether workers support particular
executor backends.)

Static workers should have higher priority for assigning work requests, so
that the use of dynamic workers can be limited to load spikes.

The UI's list of workers needs to indicate whether workers are static or
dynamic, and allow inspecting the profiles of dynamic workers.

Dynamic worker names can be reused within certain parameters (e.g. provider
and architecture), as long as multiple active workers don't share a name.
This avoids the list of historical workers growing unreasonably long just
due to dynamic worker instances being created and destroyed.

Initially, this feature only needs to support Amazon EC2, but it must be
easy to extend it to add support for other cloud providers.

Expected changes
================

Assets
------

To support :ref:`debusine:cloud-provider-account assets
<asset-cloud-provider-account>`, ``Asset.workspace`` needs to become
nullable, and only required for certain asset categories (currently
``debusine:signing-key``).  There will be some associated refactoring for
this: for instance, ``Asset.can_create``, ``Asset.__str__``,
``AssetSerializer``, the client code for creating assets, and
``Playground.create_asset`` currently assume that every asset has a
workspace.

Dynamic worker pools
--------------------

debusine needs configuration for each cloud provider, such as API keys.
This is conceptually similar to file stores, and it would make sense to
handle it similarly.  It also needs to keep track of pools of workers that
are configured similarly, potentially including inactive ones.  There may be
multiple pools for a single cloud provider, using different accounts; for
example, there may be pools on the same provider for different scopes with
different billing arrangements.

Add a new ``WorkerPool`` model, including at least the following fields (all
JSON objects are modelled using Pydantic):

* ``name`` (string): the name of the pool
* ``provider_account`` (foreign key to ``Asset``): a
  :ref:`debusine:cloud-provider-account asset
  <asset-cloud-provider-account>` with details of the provider account to
  use for this pool
* ``enabled`` (boolean, defaults to True): if True, this pool is available
  for creating instances
* ``architectures`` (array of strings): the task architectures supported by
  workers in this pool
* ``tags`` (array of strings): the `worker tags
  <https://salsa.debian.org/freexian-team/debusine/-/issues/326>`__
  supported by workers in this pool (note that implementing worker pools
  does not require worker tags to be fully implemented yet)
* ``specifications`` (JSON): public information indicating the type of
  instances to create, in a provider-dependent format; for some providers
  this may simply be instance type and image names, whereas for others it
  may include a collection of minimum values for parameters such as number
  of cores and RAM size
* ``instance_wide`` (boolean, defaults to True): if True, this pool may be
  used by any scope on this debusine instance; if False, it may only be used
  by a single scope (i.e. there is a unique constraint on
  ``Scope``/``WorkerPool`` relations where ``WorkerPool.instance_wide`` is
  False)
* ``ephemeral`` (boolean, defaults to False): if True, configure the worker
  to shut down and require reprovisioning after running a single work
  request
* ``limits`` (JSON): instance limits, as follows:

  * ``max_active_instances`` (integer, optional): the maximum number of
    active instances in this pool
  * ``target_max_seconds_per_month`` (integer, optional): the maximum number
    of instance-seconds that should be used in this pool per month (this is
    a target maximum rather than a hard maximum, as debusine does not
    destroy instances that are running a task; it may be None if there is no
    need to impose such a limit)
  * ``max_idle_seconds`` (integer, defaults to 3600): destroy instances that
    have been idle for this long

.. note::

    Public cloud providers typically have billing cycles corresponding to
    calendar months, which are not all the same length.  debusine keeps
    track of instance run-times for each month in the Gregorian calendar in
    an attempt to approximate this.  It will not always produce an exactly
    accurate prediction of run-time for the purpose of provider charges,
    since providers vary in terms of how they account for things like
    instances that run for less than an hour.

    It is the administrator's responsibility to calculate appropriate limits
    based on the provider's advertised pricing for the configured instance
    type.

.. note::

    To avoid wasting resources, debusine does not destroy instances that are
    actively running a task; this may cause it to overrun
    ``target_max_seconds_per_month``.  As a result, to keep resource usage
    under control even if some tasks take a very long time, administrators
    should normally also set ``max_active_instances``, and should
    independently set up billing alerts with their cloud providers.

.. note::

    ``max_idle_seconds`` has a conservative default to minimize accidental
    billing.  Administrators should tune it in production taking the
    observed provisioning time into account, so that an acceptable fraction
    of instance run-time is spent actually running tasks.

.. note::

    Although it isn't initially required, there may be a "static" provider
    corresponding to manually-provisioned workers, allowing them to have the
    same kinds of flexible prioritization.

Worker names are constructed as ``f"{pool.name}-{instance.number:03d}"``,
where instance numbers are allocated sequentially within the pool, and the
lowest available instance number is used when creating a new instance.

Add an optional ``Worker.worker_pool`` foreign key.

Scope-level controls
--------------------

Add a many-to-many ``Scope.worker_pools`` relationship pointing to
``WorkerPool``, with extra data on the relationship as follows:

* ``priority`` (integer): The priority of this worker pool for the purpose
  of scheduling work requests and creating dynamic workers to handle load
  spikes; pools with a higher priority will be selected in preference to
  pools with a lower priority.  Workers that do not have a pool implicitly
  have a higher priority than any workers that have a pool.

* ``limits`` (JSON): scope-level limits, as follows:

  * ``target_max_seconds_per_month`` (integer, optional): the maximum number
    of instance-seconds that should be used by work requests in this scope
    per month (this is a target maximum rather than a hard maximum, as
    debusine does not destroy instances that are running a task; it may be
    None if there is no need to impose such a limit; note that idle worker
    time is not accounted to any scope)
  * ``target_latency_seconds`` (integer, optional): a target for the number
    of seconds before the last pending work request in the relevant subset
    of the work queue is dispatched to a worker, which the provisioning
    service uses as a best-effort hint when scaling dynamic workers

Worker-level accounting
-----------------------

Add a ``Worker.instance_created_at`` field, which is set for dynamic workers
when their instance is created.  This is similar to
``Worker.registered_at``, but that field indicates when the ``Worker`` row
was created and remains constant across multiple create/destroy cycles,
while ``Worker.instance_created_at`` can be used to determine the current
runtime of an instance by subtracting it from the current time.

While :ref:`RuntimeStatistics <runtime-statistics>` contains the runtime
duration of each task and thus allows calculating how many seconds have been
spent on behalf of each scope by each worker, doing so on the fly would
involve a complex database query.  To avoid that, add a
``Worker.durations_by_scope`` many-to-many field, with the accumulated
duration for that worker and scope as extra data on the relationship.  When
the server is notified of a completed work request, it adds the work
request's duration to that field.  When a dynamic worker is (re-)created, it
sets that field to zero.

Workers do not currently notify the server when a task is aborted.  They
will need to start doing so, at least in order to send runtime statistics.

Image building
--------------

debusine workers on cloud providers need a base image.  While this could be
a generic image plus some dynamic provisioning code, it's faster and more
flexible to have pre-built images that already contain the worker code and
only need to be given a token and a debusine server API URL.

The process of building and publishing these images should eventually be a
debusine task, but to start with it can be an ad-hoc script.  However, the
code should still be in the debusine repository so that we can develop it
along with the rest of the code.

Image builds will need at least the following options (which might be
command-line options rather than this JSON-style design):

* ``source`` (string, optional): a deb822-style APT source to add; for
  example, this would allow using the `latest debusine-worker development
  build <https://freexian-team.pages.debian.net/debusine/repository/>`__
  rather than the version of ``debusine-worker`` in the base image's default
  repositories
* ``enable_backends`` (list of strings, defaults to ``["unshare"]``):
  install and configure packages needed for the given executors
* ``enable_tasks`` (list of strings, defaults to ``["autopkgtest",
  "sbuild"]``): install packages needed for the given tasks; most tasks do
  not need explicit support during provisioning, but ``autopkgtest``,
  ``mmdebstrap``, ``sbuild``, and ``simplesystemimagebuild`` do

The initial image building code can be derived based on Freexian's current
Ansible setup for the ``debusine_worker`` role.

Provisioning
------------

For each provider, debusine must have a backend that knows how to provision
a new instance based on ``WorkerPool.specifications``.

Provisioning must be non-interactive, so the provisioning code must provide
enabled tokens to workers.  It must be careful that tasks running on the
worker cannot access the token (e.g. via cloud metadata endpoints) once the
instance is up.

A new Celery service controls the provisioning process.  (While this is
somewhat related to the scheduler, it has very different performance
characteristics - even in the best case, provisioning is typically much
slower than scheduling work requests - and so it's better to keep it
separate.)  That service periodically monitors the number of pending work
requests per scope and decides whether to create new dynamic workers or
destroy idle dynamic workers to meet demand.  When doing so, it only
considers the subset of pending work requests that would require the dynamic
workers in question, taking into account `worker tags
<https://salsa.debian.org/freexian-team/debusine/-/issues/326>`__ and any
restrictions declared by work requests: for example, if there are idle
dynamic workers with a given tag and no pending work requests that require
that tag, those workers can be destroyed.

The provisioning service must not create workers in a pool if
``WorkerPool.enabled`` is False, or if doing so would take it over any of
the limits specified in ``WorkerPool.limits`` (considering
``Worker.instance_created_at``) or ``ScopeWorkerPool.limits`` (considering
``Worker.durations_by_scope``).  It must destroy workers if they have been
idle for longer than ``WorkerPool.limits.max_idle_seconds``, or if they are
idle and exceed any of the other limits in ``WorkerPool.limits``.

``ScopeWorkerPool.limits.target_latency_seconds`` acts as a best-effort hint
for how aggressively to scale up workers.  The provisioning service should
aim to scale up the number of workers until its estimate of the time before
the last pending work request is dispatched to a worker reaches that target
for each scope where it is set, while not exceeding other limits.  debusine
does not guarantee to meet or even necessarily approach this target, but it
allows administrators to tune how hard it should try.  :ref:`Task statistics
<runtime-statistics>` will be required for good time estimates, but a first
draft can use rough estimates such as the observed mean of work request
durations regardless of subject or context.

The highest-priority pool may be unavailable for various reasons: there
might be an outage, or the highest-priority pool might be a discounted
provider option with low availability guarantees such as spot instances.
The provisioning service should fall back to lower-priority pools as needed
to satisfy the constraints above.  If lower-priority pools are more
expensive, then administrators can assign them a lower
``target_latency_seconds`` value so that debusine will not scale up workers
as aggressively in those pools.

.. todo::

    Because workers are only destroyed when idle, the provisioning service
    is in practice only able to scale down worker pools once the relevant
    part of the work queue has been exhausted.  This means that the
    provisioning service should normally avoid being too aggressive when
    creating new dynamic workers.

.. note::

    Some of this functionality overlaps with auto-scaling features that
    already exist in some cloud providers, and in principle it would be
    possible for debusine to provide metrics to those providers that would
    allow creating auto-scaling policies.  However, we handle scaling
    ourselves instead because this allows us to use providers that don't
    have that feature, and in order to allow scaling across multiple
    providers.

Scheduler
---------

The scheduler currently checks worker metadata on a per-worker basis to
decide whether a worker can run a given work request.  This was already a
potential optimization target, but it will need to be optimized to support
deciding whether to provision dynamic workers, since the decision will need
to be made in bulk for many workers at once.  Instead of the current
``can_run_on`` hook that runs for a single work request, tasks will need
some way to provide Django query conditions selecting the workers that can
run them, relying on worker tags.

Once we have task statistics, we are likely to want to select workers (or
worker pools) that have a certain minimum amount of disk space or memory.
It may be sufficient to have a small/medium/big classification, but a
clearer approach would be for some tags to have a numerical value so that
work requests can indicate the minimum value they need.  These would be a
natural fit for worker pools: ``WorkerPool.specifications`` will usually
already specify instance sizes in some way, and therefore
``WorkerPool.tags`` can also communicate those sizes to the scheduler.

User interface
--------------

Add a indication to ``/-/status/workers/`` showing each worker's pool.

Make each worker on ``/-/status/workers/`` be a link to a view for that
single worker.  Where available, that view includes information from the
corresponding ``WorkerPool``, excluding secret details of the provider
account.

Exclude inactive dynamic workers from ``/-/status/workers/``, to avoid
flooding users with irrelevant information.
