Resources

TaskBlaster provides various management tools so you can allocate the jobs that you want to run when, where, and how you want, handle failures, and manage different resource requirements of various tasks.

In this tutorial, we will cover how to submit jobs to the TaskBlaster queue, how to tag tasks, and finally, the different ways you can execute the tasks e.g. submitting jobs to the slurm queue or interactively.

Handling states

TaskBlaster tasks have their own set of states to help manage . You can view all available states using the tb stat command.

$ tb init
Created repository using module "taskblaster.repository" in "/home/myuser/tmprepo".
$ tb stat
new       0
queue     0
run       0
done      0
fail      0
partial   0
cancel    0

Tasks in the new state can have their inputs updated. Tasks in the queue state are able to be picked up by workers. You can place tasks into the queue by specifying the following command:

$ tb submit
state    deps  tags        folder
────────────────────────────────────────────────────────

Submitted 0 tasks

Once a task has been picked up by a worker, the state is updated to run. If the jobs fails, TaskBlaster automatically updates the state to fail with all dependencies updated to the cancel state. If the task is successful, the state is updated to done. The state partial is reserved for dynamical tasks. This indicates dynamical tasks which have started and partially completed but have stopped before reaching the “finalizer” stage.

Unrunning a task in the state fail will put the task into the new state along with any tasks in the cancel state dependent on it.

Submitting

The first task management tool is the TaskBlaster “queue”. In order for a job to be picked up by a worker, the task must be in the “queue” state. You can see the state of a task with tb ls <tree/or/tree/path>. Changing the state of a task from new to queue is easily achieved by simply executing the following command:

tb submit <path>

from within the TaskBlaster repository. Now the task is in the state queue. Once a TaskBlaster worker is submitted and starts, the worker will be able to see the task and pick it up.

Note

The TaskBlaster queue is completely separate from the SLURM queue. Furthermore, unless tasks are tagged and a worker is submitted with a specific tag, tasks picked up by the workers are not associated with any specific SLURM job. Tasks are only assigned to a worker once the SLURM job starts and the worker picks up a task.

Tagging tasks

Tagging is a versatile way of pairing specific tasks or set of tasks with specific workers during submission time when submitting a worker to the slurm queue.

Managing Failures

If you have a specific type of error related to resources allocation you can targe that Failure string as follows:

tb tag -F FAILURE_STRING <tree/> --add TAG_FAILURE_NAME

and executing

tb tag | grep TAG_FAILURE_NAME

will list all the tasks with that tag.

To remove a tag from a task, you can specify the --untag argument

tb tag -F FAILURE_STRING <tree/> --untag TAG_FAILURE_NAME

Targeting specific tasks

You can target specific tasks to tag by specifying the path to the task’s directory name:

tb tag <tree/**/target_taskname> --add TAG_LABEL

where ** will match all files and directories in the current directory and subdirectories, locating all target_taskname tasks along the path.

Another way to utilize tags to manage resources is discussed below in the workers section.

Executing and submitting tasks

There are 2 ways to run a task in TaskBlaster.

  1. tb run <path> which will launch a worker to execute queued tasks.

  2. tb workers submit interfaces with MyQueue to submit slurm jobs.

1.1 Interactively run tasks

The first run mode can be broken down into 2 distinct run modes. You can think of this as an interactive mode you will executed from the terminal. From within the TaskBlaster repo, you would simply run tb run <path> and all queued tasks on that path will be picked up. The logging information is dumped to the screen. This mode is useful for very fast tasks or if you are trying to debug code and using ‘breakpoint()’.

1.2 Submitting workers via slurm scripts

The most familiar, albeit manual, way to submit a TaskBlaster worker would be to create a slurm script that loads the python environment, then calls the same srun command from within the top level tree directory of an existing TaskBlaster repository. All the tasks which are in a queued state are picked up and executed on an HPC resource by that worker. One such example slurm script is provided below.

#!/bin/bash -l
#SBATCH -J <job_name>
#SBATCH --output=<job_output>-%j.out
#SBATCH --time=0-01:00:00
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --partition=partition_name

source /active/python/env/command
cd /path/to/root/tree

srun tb run .

This process is time consuming as one would have to manually edit the submission script every time the resources/worker specification/directory needs to be changed. A more streamlined way to submit jobs using TaskBlaster is to utilize TaskBlasters MyQueue interface, described below.

2.1 Setup myqueue

TaskBlaster is integrated with MyQueue, and with a bit of configuration you can efficiently use TaskBlasters MyQueue interface to handle the submission of workers to various HPC resource queues.

After installing MyQueue, execute the command:

mq config

which detects and prints a configuration file for your HPC resources to the terminal. You should copy and paste the text into .myqueue/config.py. MyQueue provides detailed documentation on configuration.

This allows the user to specify via the command line using the TaskBlaster UI resource specifications with -R that is forwarded to MyQueue.

Now, when submitting a job with TaskBlaster, it will use MyQueue as well as displaying convenient runtime information obtained from MyQueue in the tb ls command to report job statuses and failures, such as the information listed under the worker and time columns below:

state    deps  tags        worker        time     folder
───────────────────────────────────────────────────────────────────────────────
done     0/0               7654463-3/4   00:00:00 tree/Ag/material
done     1/1               7654902-0/4   00:00:00 tree/Ag/niggli_normalize
done     1/1               7654902-0/4   00:00:01 tree/Ag/reduce_to_primitive

2.2 Submitting workers

With MyQueue set up workers can be submitted by specifying the resource flag -R, followed by MyQueue resources as follows:

tb workers submit -R 8:xeon24:10h

An additional configuration file can be specified for associating workers with resources. You can think of this as a persistent resource-tag allocation, giving TaskBlaster a way to associated tags with resources during submission time.

To set this up you can execute the command:

tb workers config

which will display the location of the resources.py file used to define worker configurations. An example configuration file is shown below:

resources = {
    'alberich1': {'tags': {'hello'}},
    'alberich2': {'required_tags': {'eggs', 'spam'}},
    'mq_alberich1': {'tags': {'hello'}, 'resources': '1:alberich1:1m'},
    'superworker': {'tags': {'hello', 'highmem', 'gpu'}},
    'parallel_hello': {'resources': '40:hello:1h'},
    'parallel': {'resources': '40:1h'},
    'lowmem': {'tags': {'memoryerror'}, 'resources': '1:alberich1:1m'},
    'highmem': {},
}

And you can submit workers with resources specified in the resources.py file as follows:

tb workers submit parallel_hello

Subworkers

You can run multiple tasks within a single SLURM job using TaskBlaster’s subworker commands. These options allow you to partition available resources by either specifying the number of CPUs per subworker with --subworker-size=num or setting the number of concurrent subworkers with --subworker-count=num.

The use of this command requires special communicator setup and care must be taken when calling specific commands as nested srun commands are not allowed. You can specify how many subworkers/resources in the following way:

tb workers submit -R 8:xeon24:10h --subworker-size=4

or:

tb workers submit -R 8:xeon24:10h --subworker-count=2

where these 2 commands are equivalent specifying 2 subworkers, each with 4 processors per “worker” or SLURM job.