.. _debusine-concepts:

=================
Debusine concepts
=================

.. _explanation-artifacts:

Artifacts
=========

Artifacts are at the heart of Debusine. Artifacts are both inputs
(submitted by users) and outputs (generated by tasks). An artifact
combines:

* an arbitrary set of files
* arbitrary key-value data (stored as a JSON-encoded dictionary)
* a category

The category is just a string identifier used to recognize artifacts sharing
the same structure. You can create and use categories as you see fit but we
have defined a basic :ref:`ontology <artifacts>` suited for the case of a
Debian-based distribution.

Artifacts can have relations with other artifacts:

* *built-using*: indicates that the build of the artifact used the target
  artifact (ex: "binary-packages" artifacts are built using
  "source-package" artifacts)
* *extends*: indicates that the artifact is extending the target artifact
  in some way (ex: a "source-upload" artifact extends a "source-package"
  artifact with target distribution information)
* *relates-to*: indicates that the artifact relates to another one in
  some way (ex: a "binary-upload" artifact relates-to a "binary-package",
  or a "package-build-log" artifact relates to a "binary-package").

Artifacts are not deleted:

* as long as they are referenced by another artifact (through one of the
  above relationships)
* as long as their expiration date is not over
* as long as they are not manually deleted (if they don't have any
  expiration date)
* as long as they are referenced by items of a collection

Artifacts can have additional properties:

* immutable: when set to True, nothing can be changed in the artifact
  through the API
* creation timestamp: timestamp indicating when the artifact has been
  created
* last updated timestamp: timestamp indicating when the artifact has been
  last modified/updated

The following operations are possible on artifacts:

* create a new artifact
* upload content of one of its file
* set key-value data
* attach/remove a file
* add/remove a relationship
* delete an artifact

Files in artifacts are content-addressed (stored by hash) in the
database, so a single file can be referenced in multiple places without
unnecessary data duplication.

.. _explanation-collections:

Collections
===========

A Collection is a set of artifacts or other collections that are intended to
be used together. The following are some example use cases:

* A suite in the Debian archive (e.g. "Debian bookworm")
* A Debian archive (a.k.a. repository) containing multiple suites
* For a source package name, the latest version in each suite in Debian
  (compare ``https://tracker.debian.org/pkg/foo``)
* Results of a QA scan across all packages in unstable and experimental
* Buildd-suitable ``debian:system-tarball`` artifacts for all Debian suites
* Extracted ``.desktop`` files for each package name in a suite

.. todo::

   Another possible idea is to use collections for the output of each task,
   either automatically or via a parameter to the task.

Collections have the following properties:

* ``category``: a string identifier indicating the structure of additional
  data; see the :ref:`ontology <collections>`
* ``name``: the name of the collection
* ``workspace``: defines access control and file storage for this collection; at
  present, all artifacts in the collection must be in the same workspace
* ``full_history_retention_period``, ``metadata_only_retention_period``:
  optional time intervals to configure the retention of items in the
  collection after removal; see :ref:`explanation-collection-item-retention`
  for details

Collections are unique by category and name.  They may be looked up by
category and name, providing starting points for further lookups within
collections.

Each item in a collection is a combination of some metadata and an optional
reference to an artifact or another collection. The permitted categories for
the artifact or collection are limited depending on the category of the
containing collection. The metadata is as follows:

* ``category``: the category of the artifact, copied for ease of lookup and
  to preserve history
* ``name``: a name identifying the item, which will normally be derived
  automatically from some of its properties; only one item with a given
  name and an unset removal timestamp (i.e. an active item) may exist in any
  given collection
* key-value data indicating additional properties of the item in the
  collection, stored as a JSON-encoded dictionary with a structure
  :ref:`depending on the category of the collection <collections>`; this
  data can:

  * provide additional data related to the item itself
  * provide additional data related to the associated artifact in the
    context of the collection (e.g. overrides for packages in suites)
  * override some artifact metadata in the context of the collection (e.g.
    vendor/codename of system tarballs)
  * duplicate some artifact metadata, to make querying easier and to
    preserve it as history even after the associated artifact has been
    expired (e.g. architecture of system tarballs)

* audit log fields for changes in the item's state:

  * timestamp, user, and workflow for when it was created
  * timestamp, user, and workflow for when it was removed

This metadata may be retained even after a linked artifact has been expired
(see :ref:`explanation-collection-item-retention`). This means that it is
sometimes useful to design collection items to copy some basic information,
such as package names and versions, from their linked artifacts for use when
inspecting history.

The same artifact or collection may be present more than once in the same
containing collection, with different properties. For example, this is
useful when debusine needs to use the same artifact in more than one similar
situation, such as a single system tarball that should be used for builds
for more than one suite.

A collection may impose additional constraints on the items it contains,
depending on its category. Some constraints may apply only to active items,
while some may apply to all items. If a collection contains another
collection, all relevant constraints are applied recursively.

Collections can be compared: for example, a collection of outputs of QA
tasks can be compared with the collection of inputs to those tasks, making
it easy to see which new tasks need to be scheduled to stay up to date.

.. _explanation-collection-item-retention:

Retention of collection items
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Collection items and the artifacts they refer to may be retained in
debusine's database for some time after the item is removed from the
collection, depending on the values of ``full_history_retention_period`` and
``metadata_only_retention_period``.  The sequence of events is as follows:

* item is removed from collection: metadata and artifact are both still
  present
* after ``full_history_retention_period``, the link between the collection
  item and the artifact is removed: metadata is still present, but the
  artifact may be expired if nothing else prevents that from happening
* after ``full_history_retention_period`` +
  ``metadata_only_retention_period``, the collection item itself is deleted
  from the database: metadata is no longer present, so the history of the
  collection no longer records that the item in question was ever in the
  collection

If ``full_history_retention_period`` is not set, then artifacts in the
collection and the files they contain are never expired.  If
``metadata_only_retention_period`` is not set, then metadata-level history
of items in the collection is never expired.

.. _explanation-workspaces:

Workspaces
==========

A Workspace is a concept tying together a set of Artifacts and
a set of Users. Since Artifacts have to be stored somewhere, Workspaces
also tie together the set of FileStore where files can be stored.

Workspaces have the following important properties:

* public: a boolean which indicates whether the Artifacts are publicly
  accessible or if they are restricted to the users belonging to the
  workspace
* default_expiration_delay: the minimal time (in days) that a new
  artifact is kept in the workspace before being expired. This value
  can be overridden in the artifact afterwards. If this value is 0,
  then Artifacts are never expired until they are manually removed.
* default_file_store: the default FileStore where newly uploaded files
  are stored.

Workspaces are also the entities upon which access control rules
are built. Each workspace has a set of users that can have 3 different
levels of access:

* read-only access: can access all objects within the workspace but not
  make any change
* upload-only access: same as read-only but can create new artifacts and
  can modify their own artifacts.
* read-write access: can access all objects within the workspace and
  modify them, even those created by others.
* admin access: same as read-write and can also add/remove users to the
  workspace, and change generic properties of the workspace itself

.. _explanation-tasks:

Tasks
=====

Tasks are time-consuming operations that are typically offloaded to
dedicated workers. They consume artifacts as input and generate artifacts
as output. The generated artifacts automatically have *built-using*
relationships linking to the artifacts used as input.

Tasks can require specific features from the workers on which it will
run. This will be used to ensure things like:

* architecture selection (when managing builders on different
  architectures)
* required memory amount
* required free disk space amount
* availability of specific build chroot

Each category of task specifies whether it should run on a
``debusine-worker`` instance or on a shared server-side Celery worker. The
latter must be used only for tasks that do not execute any user-supplied
code, and it provides direct access to the Debusine database.

Tasks that run on ``debusine-worker`` instances are required to use the
public API to interact with artifacts. They are passed a dedicated token
that has the proper permissions to retrieve the required artifacts and to
upload the generated artifacts.

Executor Backends
~~~~~~~~~~~~~~~~~

Debusine supports multiple different virtualisation backends to execute
tasks. From lightweight containers (e.g. ``unshare``) to VMs (e.g.
``incus-vm``).

When tasks are executed in an executor backend, one of the task inputs
is an environment, an artifact containing a system image that the task
is executed in.

.. _explanation-work-requests:

Work Requests
=============

Work Requests are the way Debusine schedules tasks to workers and monitors
their progress and success.

A Work Request contains the task information, the processing status and the
processing result, and once the task has been scheduled, the worker that has
been allocated to run it.

Work requests may be part of a :ref:`workflow instance
<explanation-workflows>`, in which case they have a reference to that
workflow instance.

Some work requests run on a Celery worker with direct access to the Debusine
database, rather than on a less-privileged external worker.

.. _explanation-workflows:

Workflows
=========

Workflows are advanced server-side logic entirely driven by code. They can
trigger tasks, analyze their results, and use the API to create/modify
artifacts. They often have associated web interface for users to inspect the
results and/or to provide inputs in some of the steps.

Workflows can be started by users or external events, through the web
interface or through the API.

Workflows can only be started if they have been registered by an admin
in a workspace. This process:

* grants a unique name to the workflow so that it can be easily identified
  and started by users
* defines all the input parameters that will not change between runs of
  the registered workflow

The input parameters that are not set during registration are called
run-time parameters and they have to be provided by the user that starts
the workflow. Those parameters are stored in a WorkflowInstance model
that will be used during the whole duration of the process controlled
by the workflow.

Workflow instances have the following properties:

* status: running / aborted / completed
* result: success / failure / error / neutral
* parameters: dictionary of run-time parameters
* secrets: dictionary of secret run-time parameters; these are accessible to
  running tasks, but their values are not shown in the web interface

They additionally have a directed acyclic graph, each of whose nodes is a
work request. When executed, each work request has access to a snapshot of
the state of the graph at the time it started, as well as to all artifacts
generated by all previously-completed work requests in the same workflow.
This allows tasks to make use of the output of earlier tasks, and allows
tasks that create other work requests to avoid creating duplicates.

Work requests in workflow instances have a list of dependencies on other
work requests in the same workflow instance; they must not be scheduled for
execution until all of their dependencies have completed successfully.

The graph of work requests in a workflow instance is not static. Server-side
tasks may add additional work requests to a workflow instance, perhaps after
analyzing the results of previously-completed work requests.

Once completed, the remaining lifetime of the workflow instances is
controlled by their expiration date and/or they can be tied to the
existence of some associated artifact.

To reduce the risk of accidental disclosure, Debusine should make a best
effort to redact the values of secrets from log files produced as outputs
from tasks.

To begin with, workflows may only be registered based on templates that
specify their initial structure and that are hardcoded in Debusine. We
expect to add a more flexible method for registering workflows once we have
more experience with them.

An example use case of workflows is as follows:

 * Package build: source upload → sbuild → { binary upload, lintian, blhc,
   autopkgtest, autopkgtests of reverse-dependencies, piuparts, reprotest }.
   The reverse-dependencies whose autopkgtests should be run cannot be
   identified until the sbuild task has completed, so this would be
   implemented using a server-side task that analyzes the output of sbuild
   and adds corresponding additional work requests that depend on it.
