Backend settings#

See Backends for a top-level overview. Backend settings are controlled via the opts.backend section of the configuration namespace, and can be augmented on a per-recipe, per-cab and per-step basis, by defining a separate backend section therein with the required subset of settings.

The backend section defined separate sub-sections per each backend, described below, and a few top-level options:

  • select: a string or a list of strings specifying the backend(s) to use. The first available backend listed will be used (note that the native backend is always available, while Singularity and Kubernetes are contingent on the respective packages being installed).

  • default_registry to use if no registry is specified in a cab’s image definition.

  • override_registries can be used to replace one image registry with another. This is useful is you have a local registry that functions as a pull-through caches. For example:

    backend:
        override_registries:
            quay.io/stimela2: 800133935729.dkr.ecr.af-south-1.amazonaws.com/quay/stimela2
    
  • verbose: increasing this number produces more log messages from the backends – useful for debugging.

  • rlimits can be used to set various resource limits during the run. E.g. to increase the max number of open files, use:

    backend:
        rlimits:
            NOFILE: 10000
    

    See https://docs.python.org/3/library/resource.html for details: all of the symbols starting with RLIMIT_ are recognized and applied. Note that rlimits only apply to the native and Singularity backends running locally – Kubernetes and Slurm have their own resource management options.

See also comments in the source code for more information.

Native backend settings#

The native backend has only a couple of settings:

backends:
    native:
        enable: true
        virtual_env: ~/venvs/my_venv

It is enabled by default. The optional virtual_env setting activates a Python virtual environment before running commands. This can be useful to tweak on a per-cab basis, when playing with experimental cabs.

Singularity backend settings#

The Singularity backend has the following settings:

backends:
    singularity:
        enable: true
        image_dir: ~/.singularity
        auto_build: true
        rebuild: false
        executable: path
        remote_only: false

The backend is enabled by default, if the singularity executable (or whatever is specified by the executable setting) is found in the path. Set enable to false to disable.

Singularity works with local copies of application images (in SIF format) that can be built from Docker-format images served by a remote Docker registry. The image_dir setting determines where these SIF images are cached. If auto_build is set, Stimela will attempt to build any missing Singularity images on-demand. If rebuild is set, it will rebuild images anew even if they are already present.

Note that in a cluster environment, it may be useful to disable auto-build, and work with prebuilt images only. The stimela build command can be used to pre-build images.

Finally, remote_only tells Stimela to not bother checking for a local install of Singularity. This can be useful in combination with Slurm, if the login node (or whatever node Stimela is executed on) does not support Singularity, but the compute nodes on which jobs are scheduled do.

Slurm wrapper settings#

Slurm is a wrapper, not a backend per se. It can be used in combination with the native and Singularity backends to schedule steps as Slurm jobs (using srun). Enabling it can be as simple as setting enable to true:

backends:
    slurm:
        enable: false
        srun_path:              # optional path to srun executable
        srun_opts: {}           # extra srun options
        srun_opts_build: {}     # extra srun options for build commands
        build_local: true

provided you’re running in a cluster environment where Slurm is configured. Instead of running a step locally, Stimela then invokes srun to pass the job off to Slurm, and waits for srun to finish.

A typical usage scenario is running Stimela on the cluster login (head) node, in a persistent console session (using tmux or screen). The Stimela process itself is pretty lightweight and can be executed on the login node, while every step of the workflow is passed off to Slurm.

The srun command has a veritable cornucopia of options controlling all aspects of job and resource management. Any of these can be configured here: Stimela will blindly pass through the contents of the srun_opts mapping (prepending a double-dash to each mapping key). An example of using this feature to tweak CPU and RAM allocation is discussed here.

If Singularity images need to be built, Stimela will schedule the singularity build command via srun as well, unless build_local is set to true, in which case singularity build will execute on the same node that Stimela is running on. If builds are being done via srun, then you can control its options via the srun_opts_build mapping. If this is not provided, srun_opts are used instead.

Kubernetes backend settings#

The Kubernetes backend can be pretty arcane to configure, and is still under active development at time of writing. The best reference for its options are the comments in the source code. Here are some settings from a working example:

opts:
    backend:
        kube:
            context: osmirnov-rarg-test-eks-cluster         # k8s context to run in, this determines which cluster to connect to etc.

            debug:  # options useful during debugging
                verbose: 0
                log_events: 1                               # logs all k8s events to Stimela
                save_spec: "kube.{info.fqname}.spec.yml"    # saved pod manifests for inspection

            dir: /mnt/data/stimela-test                     # directory in which the workflow runs

            volumes:   # this defines filesystem volumes of each pod
                rarg-test-compute-efs-pvc:                  # this is a k8s PersistentVolumeClaim
                    mount: /mnt/data                        # ...which is mounted here in the pod
                    at_start: must_exist

            provisioning_timeout: 0                         # timeout (secs) to start a pod before giving up, 0 waits forever
            connection_timeout: 5                           # timeout (secs) to restore lost connection

            # this is the UID/GID that the pod will run as
            user:
                uid: 1000
                gid: 1000

            # RAM limit -- should be tweaked per-cab and per-step, really
            memory:
                limit: 16Gi

            # some predefined pod specs. Keys are labels -- content is determined by the k8s cluster administrator
            predefined_pod_specs:
                admin:
                    nodeSelector:
                        rarg/node-class: admin
                thin:
                    nodeSelector:
                        rarg/node-class: compute
                        rarg/instance-type: m5.large
                medium:
                    nodeSelector:
                        rarg/node-class: compute
                        rarg/instance-type: m5.4xlarge
                fat:
                    nodeSelector:
                        rarg/node-class: compute
                        rarg/instance-type: m6i.4xlarge

            # default pod type to use -- must be in predefined_pod_types
            job_pod:
                type: admin

            # start a dask cluster along with the pod, if enabled
            dask_cluster:
                enable: false
                num_workers: 4
                name: qc-test-cluster
                threads_per_worker: 4
                worker_pod:
                    type: thin
                scheduler_pod:
                    type: admin


## some cab-specific backend tweaks
cabs:
    breizorro:
        backend:
            kube:
                job_pod:               # don't need a big pod for breizorro
                    type: thin
                memory:
                    limit: 3Gi
    wsclean:
        backend:
            kube:
                job_pod:               # wsclean could do with a big pod
                    type: fat
                memory:
                    limit: 64Gi
    quartical:
        backend:
            kube:
                dask_cluster:           # enable Dask cluster for QuartiCal
                    enable: true

Bat country! Backend settings and substitutions#

Backend settings are amenable to substitutions and formula evaluations, in a somewhat limited way. Only string-type settings support substitutions and formulas. (Note also that at image build time, only the info namespace is available.)

Like everything else in the Stimela config namespace, the global backend settings may be manipulated via assign-sections. For example:

my-recipe:
    inputs:
        ncpu: int = 16
    assign:
        config.opts.backend.slurm.srun_opts.cpus-per-task: =recipe.ncpu

We can only recommend this feature to ninja-level users hacking on some kind of development or experimental workflows. Use with great caution, as great confusion may ensue! Also, this hardly promotes reproducable and portable recipes.