Cab definition reference#
A cab represents an atomic task that can be executed by Stimela. Cabs come in a number of flavours: executables, Python functions, direct Python code snippets, and CASA tasks. Cabs have inputs and outputs defined by their schema.
Basics of cab definitions#
A cab is defined by an entry in the cabs
section of the global namespace. Alternatively, a cab definition may be given
directly inside’s a step’s cab
section. For example, the following two are equivalent:
cabs:
cp:
command: /bin/cp
my-recipe:
steps:
copy:
cab: cp
and:
my-recipe:
steps:
copy:
cab:
command: /bin/cp
…though the former is probably more useful if you plan to invoke the cab more than once.
Here is a typical cab definition from the cult-cargo package – this one is for the breizorro
masking tool:
cabs:
breizorro:
command: breizorro
image:
_use: vars.cult-cargo.images
name: breizorro
policies:
replace: {'_': '-'}
inputs:
restored-image:
dtype: File
mask-image:
dtype: File
merge:
dtype: Union[str, List[str]]
subtract:
dtype: Union[str, List[str]]
threshold:
dtype: float
default: 6.5
dilate:
dtype: int
number-islands:
dtype: bool
...
outputs:
mask:
dtype: File
nom_de_guerre: outfile
required: true
This illustrates a number of imporant points:
all cabs have a
command
attribute. In this case, it specifies the command that must be invoked;the
image
mapping specifies an image (as name, version and registry – the latter two attributes being reused from a global cult-cargo variable). The image setting takes effect if a containerized backend such as Singularity or Kubernetes is in use;the
policies
section defines how names of inputs and outputs are mapped into command-line arguments of the breizorro command. There are many rich options for this, seehe Parameter policies reference for details. In this case, we are merely telling Stimela to turn underscores into dashes.the
inputs
andoutputs
sections specify the IOs of the cab. These use the standard schema language.
Other cab properties that you may come across are:
a
parameter_passing
property determines how inputs are passed to the cab. The default isargs
, i.e. they are mapped to command-line arguments using specified policies. (The rather exotic alternative isyaml
, which passes inputs as a YaML string via the first command-line parameter. This is not used anywhere at time of writing, but is retained for historical reasons.)a
backend
section allows you to specify a non-default backend for the cab, or to tweak backend options. See Backend settings for details.a
management
section, explained below.
Advanced cab features#
The management
section can specify some interesting cab behaviours. Here is a real-life example from cult-cargo:
casa.flagsummary:
info: Uses CASA flagdata to obtain a flag summary
command: flagdata
flavour: casa-task
image:
_use: vars.cult-cargo.images
name: casa
inputs:
ms:
dtype: MS
required: true
nom_de_guerre: vis
spw:
dtype: str
default: ""
mode:
implicit: 'summary'
outputs:
percentage:
dtype: float
management:
wranglers:
'Total Flagged: .* Total Counts: .* \((?P<percentage>[\d.]+)%\)':
- PARSE_OUTPUT:percentage:float
- HIGHLIGHT:bold green
This demonstrates the use of output wranglers. Wranglers tell Stimela to trigger certain actions based on seeing certain patterns of text in the cab’s console output (i.e. stdout/stderr). This can be a very powerful way to wrangle (pun intended) information out of third-party packages, or even just to prettify their console output.
Output wranglers#
The (entirely optional) management.wranglers
section consists of a mapping. The keys of the mapping are regular expressions (often containing named groups – via the (?P<name>)
construct – see Python re
module for documentation). These are matched to every line of the cab’s console output. The values of the mapping are lists of wrangler actions, which are applied to each matching line one by one. The following actions are currently implemented:
PARSE_OUTPUT[:name]:groupname:type
converts the text matched by the named group to the given type, and returns it as the named output of the cab. In the example above, we use this to extract the flag percentage out of the CASA task’s output.HIGHLIGHT:style
applies a Rich text style when dispaying the matching line (e.g. to draw the user’s attention). The above example highlights the output line that is reporting the flag percentage.REPLACE:text
replaces the entire text matching the regex by the specified replacement text (which can reference named groups from the regex – see Pythonre.sub()
for details).SEVERIY:level
issues the output line to the logger at a given severity level (i.e.warning
,error
), as opposed to the default level (which is normallyinfo
).SUPPRESS
suppressed the matching line from the output entirely.WARNING:message
notes a warning message, which will be displayed by Stimela when the cab is finished running.ERROR[:message]
declares an error condition. The cab’s run will be marked as a failure, even if its exit code indicates success. An optional error message can be supplied.DECLARE_SUCCESS
declares the cab’s run a success, even if a non-zero exit code is returned.
Two other actions can be used to parse out output values in a specific way (Stimela uses these internally to pass information out of some specific cab flavours – see below – but they’re also available to all user-defined cabs):
PARSE_JSON_OUTPUTS
parses the text matching each named group in the regex as JSON, and associates the resulting value with an output of the same name.PARSE_JSON_OUTPUT_DICT
parses the text matching the first ()-group in the regex as JSON. The result is expected to be adict
, whose keys are assigned to outputs of matching names.
Other management features#
The optional management.environment
section can be used to tell Stimela to set up some specific environment variables before invoking a cab.
The management.cleanup
section can be used to specify a list of filename patterns that need cleaning up after the cab has been run. Use this if the underlying tool generates some junk output files you don’t want to keep (the cleanup feature is currently not implemented as of 2.0, but will be implemented in a future version).
Cab flavours#
A cab can also correspond to a Python function or a CASA task. This is specified via the flavour
attribute – we saw an example of this just above with the casa.flagsummary
cab. Its definition tells Stimela that the cab is implemented by invoking a CASA task underneath. Other flavours are python
(for Python functions) and python-code
(for inline Python code). The default flavour, corresponding to a binary command, is called binary
.
Specifying flavour options#
An alternative way to specify flavours is to make flavour
a sub-section, and use its kind
attribute to specify the flavour. This then allows for some flavour-related options to be specified:
cabs:
casa.flagman:
info: "Uses CASA flagmanager to save/restore/list flagversions"
command: flagmanager
flavour:
kind: casa-task
path: /usr/local/bin/casa
opts: [--nologger]
log_full_command: true
The above tells Stimela to use a non-default CASA intepreter, and to pass it specific extra options on the command line. The log_full_command: true
setting will cause the complete command line, including the encoded inputs (as opposed to just the command name) to be logged when the cab is invoked, which can be useful for debugging.
Note that the CASA path and option settings can also be defined globally via Stimela configuration.
Callable flavours#
The casa-task
and python
flavours are very similar, in that they both invoke an external interpreter, and call a function within. The command
field of the cab then names a CASA task, or a Python callable (using the normal package.module.function
Python naming). In the latter case, package.module
will be imported using the normal Python mechanisms: this can refer to a standard Python module, or your own code (in which case it must be installed appropriately so the import statement can find it.)
Arguments to the function or task are described using the normal can inputs/outputs schema; Stimela will convert these appropriately and invoke the function or task. The return value of the function can be treated as a cab output and propagated out to Stimela. Here is a notional example:
cabs:
get-load-avg:
info: "returns the system load averages using Python's os.getloadavg() function"
flavour:
kind: python
output: load
command: os.getloadavg
outputs:
load: Tuple[float, float, float]
The flavour.output
option here _names_ an output (“load”), and the outputs schema decribes what data type to expect.
What if you would like to provide some Python code returning several outputs? This can be done by having your function return a dict
, setting the flavour.output_dict
option to true, and providing an outputs schema. In this case, the returned dict is expected to contain a key for every output named in the schema.
Inline Python code#
The python-code
flavour allows for snippets of Python code to be specified directly in the cab:
cabs:
simple-addition:
info: "returns c=a+b"
flavour: python-code
command: |
c = a + b
inputs:
a: float *
b: float *
outputs:
c: float
Note how we use abbreviated schemas here for succinctness, and the ": |"
feature of YaML, which starts a multiple-line string, and uses indentaton to detect where the string ends.
The operation of the python-code
flavour is quite intuitive. All inputs are converted into Python variables with the corresponding name, the Python code specified by command
is invoked, and any outputs are collected from Python variables of the corresponding name.
A few flavour attributes can be used to tweak this behaviour. If you would prefer to pass the inputs in as a dict
(keyed by input name), set input_dict: true
. If you define cab outputs but don’t want them to be picked up from Python variables for some reason (perhaps because you’re using output wranglers instead?), you can set output_vars: false
. Finally, unlike the other flavours, the command
field is by default not subject to {}-substitution (as this usually adds nothing but hassle to inline code), but this can be changed by setting subst: true
.
Bat country! Dynamic schemas#
Some cabs need to support variadic interfaces, in the sense that their set of available parameters can change depending on the settings of other parameters. Two specific examples are:
QuartiCal has a solver.terms input that determines the set of active Jones matrices. If this is set to, e.g.,
[G,B]
, a whole slew of options associated with G and B (G.type
,B.type
, etc.) becomes available.The structure of WSClean’s file outputs changes substantially depending on whether polarization imaging, MFS imaging, etc. is enabled or not.
Stimela supports these scenarios via the concept dynamic schemas. The dynamic_schema
attribute of the cab definition can be set to the name of a callable Python function (using the standard package.module.function
syntax). Here is an example from cult-cargo, and here is the corresponding function itself. The function takes three arguments: params
, inputs
and outputs
, and returns a tuple of inputs, outputs
that have been modified based on the contents of params
.
Dynamic schemas should be deployed with great care – the complexity can get quite confusing. Consider simpler alternatives. For example, if the underlying tool has clearly distinct “modes of operation” (i.e., a mode setting, and different subsets of parameters applicable to different modes), it can be much simpler to provide a separate cab definition for each mode, using an implicit mode parameter within each. Here is an example.