.. _my-reference-label:

Basic concepts
==============

TaskBlaster is a Python-based workflow utility with a customized workflow syntax.
Here we provide a short reference for the different concepts you will encounter when running the tutorial and when you develop you own workflows.


Tasks
------
Tasks are the smallest building block of the workflow and are represented by Python functions.
A task ideally contains a single computation of some quantity, such as a ground state calculation or structural relaxation.
Below is a simple example of a ground state task where the ground state energy is calculated using GPAW.

.. literalinclude:: gs.py

There are a few things that one can notice in the example above. First of
all, all input that is required for the task is provided as input to the Python function.
Similarly, all information needed to retrieve the information that was
computed by the task (in this case the Path to the gpw-file) is returned as output from the function.
This is different from workflows written using e.g. myqueue where the input/output
is handled by reading and writing to files whose paths are hardcoded in the
workflow. As we will see below, having explicit inputs/outputs for the tasks
makes it possible for TaskBlaster to automatically keep track of the dependencies
between different tasks and make sure that they are executed in the correct order.

Secondly, one can notice the decorator ``@tb.mpi``. By providing this decorator
you make sure that the correct communicator is used by the function (``mpi.comm``).
This is important when using subworkers, whose ``world`` is different from
the ``world`` of the main worker. More on this later.

Tasks can be stored in a file `tasks.py` in the main working directory or
can be imported from external packages. TaskBlaster can find these external
packages when you initialize the repository with the ``tb init <package_name>``
command.

Workflows
---------
A workflow is represented by a Python class with the decorator ``@tb.workflow``.
The different tasks are methods on the class. Below is an example of a workflow
for making a structural relaxation followed by a subsequent ground-state calculation.

.. literalinclude:: workflow.py
   :end-before: literalinclude-marker

Here all tasks are assumed to be located in the `tasks.py` file in the main
working directory, however tasks can also be imported from external libraries (see tutorial).
The groundstate task is exactly the groundstate task above, while the relax
task is a Python function that takes an initial atoms as input together with
parameters for the structural relaxation (``optimizer_params``) as well as input
parameters for the calculator used to compute the energy and forces (``calc_params_relax``).
The output of the ``relax`` task is the relaxed atoms.

There are a few things that one can notice here:

1. The decorator ``@tb.workflow`` which is needed for taskblaster to interpret
   the class as a workflow class.

2. Input arguments to the workflow are provided using the function ``tb.var()``.

3. Tasks in the workflow are methods with the decorator ``@tb.task``.

4. Tasks are methods on the class that returns taskblaster nodes (``tb.node``)
   A node contains a reference to a Python function as a string as its first
   argument, followed by the input arguments to the function.

5. The groundstate task takes the output from the relax task as input
   (``atoms=self.relax``). Thus defining the dependencies between each task(s)
   in an intuitive and implicit manner.

6. The mpi argument to the groundstate task is not explicitly given. This
   argument is provided automatically by using the ``@tb.mpi`` decorator in the
   task definition.

Subworkflows
------------
Apart from containing any number of tasks, a workflow can also contain
subworkflows. Let's say you want to write a workflow that performs the following steps:

1. Makes a structural relaxation

2. Performs a ground state calculation

3. Calculates the band structure (or performs some other post-processing)

One can then write a workflow (here simply called ``MyWorkflow``) that uses the
``RelaxGsWorkflow`` defined above as a subworkflow and has an additional postprocess task.

.. literalinclude:: workflow.py
   :start-after: literalinclude-marker

In the example above it was assumed that ``RelaxGsWf`` was in the same file as
``MyWorkflow``, but it can also be imported as a regular Python package.

Notice how the input to the ``postprocess`` task is defined as the output from
the ``groundstate`` task of the ``RelaxGsWf``.


Workers
-------
Tasks can be run directly (``tb run <tree/path/to/tasks>``) or submitted to
HPC resources. When submitting jobs to HPC resources, what you actually
submit are Taskblaster `workers`. Each worker can pick up any number of tasks.
Due to the explicit dependence between the tasks, TaskBlaster assures that
the tasks are executed in the correct order. To assure optimal use of resources
for small tasks each worker can be further divided into `subworkers`, where each
subworker picks up a single task. This makes it possible to submit a worker
(for a full node for e.g. 24h) which is divided into e.g. four subworkers, where
each subworker will pick up and execute tasks until the worker times out.

States
------
How to run and submit tasks is explained in the tutorial. Here we provide a
quick reference to the different states that a task can be in.

new (n): A task which is added to the tree.

queue (q): A task is in the TaskBlaster queue. This task can be picked up by a
worker once its dependencies are met.

running (r): A task that has been picked up by a worker and is running.

done (d): A task that completed successfully is automatically given the
state done.

fail (F): A task that failed upon execution.

cancel (C): A task with failed parents.

A failed task can be restored to a `new` state by unrunning the task
(``tb unrun <task name>``). This will also remove the output from the task.
One can then resubmit the task and submit workers with more resources (if the
reason for the failure was time-out out-of-memory). Alternatively one can also
change the input parameters to the workflow and rerun the task.

The below graph shows the most important states and how different commands
may change the state of a task.

.. graphviz:: tb-states.dot
   :align: center
   :caption: Different task states and how commands may affect those states.

Conflicts
---------
In the ideal world one would never change the input parameters for the workflow
during the calculations. However, during a high throughput study it is quite
likely that one encounters unexpected situations which require some input
parameters to be changed e.g. to improve the convergence for some materials. In
TaskBlaster this is handled by introducing an additional kind of state (conflict state).
Tasks whose input parameters are changed are marked with the conflict state 'conflict'.
To allow for full flexibility this state is only provided as information to the user.
What it means is that the task was executed with a different input than what is
provided in the current workflow.  The state of the task is still done and one
can continue to do calculations for the children tasks. However, if it is an
essential change the user can choose to unrun the task, which will recursively
removes the output from the tasks and all its descendants. If the user knows
that the conflict is acceptable it can be marked as 'resolved', which will change
the conflict state to resolved so that it is easy to distinguish from new conflicts.
See the :ref:`tutorial <conflict_tutorial>` for an explicit example.
