modular-jobs.rst 6.03 KB
Newer Older
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
.. include:: system.rst

.. _het_modular_jobs:

Heterogeneous and Cross-Module Jobs
===================================

.. _het_modular_jobs_overview:

Overview
--------

.. _het_modular_jobs_slurm:

Slurm Support for Heterogeneous Jobs
------------------------------------
For detailed information about Slurm, please take a look on the :ref:`Quick Introduction <quickintro>` and :ref:`Batch system <batchsystem>` page.

19
With Slurm 17.11 support for Heterogeneous Jobs was introduced. This allows to spawn a job across multiple partitions of a cluster, and across different Modules of our Supercomputers. See the official Slurm documentation (SlurmHetJob_) for additional information on this feature.
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

.. _SlurmHetJob: https://slurm.schedmd.com/heterogeneous_jobs.html

**salloc/srun**

.. code-block:: none

  salloc -A <budget account> -p <batch, ...> : -p <booster, ...> [ : -p <booster, ...> ]
  srun ./prog1 : ./prog2 [ : ./progN ]

**sbatch**

.. code-block:: none

  #!/bin/bash
35
  #SBATCH -A <budget account>
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
  #SBATCH -p <batch, ...>
  #SBATCH packjob
  #SBATCH -p <booster, ...>

  srun ./prog1 : ./prog2


.. _het_modular_jobs_software:

Loading Software in a Heterogeneous Environment
-----------------------------------------------
Executing applications in a modular environment, especially when different Modules have different architectures or the dependencies of programs are not uniform, can be a challenging tasks.

**Uniform Architecture and Dependencies**

  As long as the Architecture of the given modules are uniform and there are not mutually exclusive dependencies for the binaries that are going to be executed, one can rely on the ``module`` command. Take a look on the :ref:`Quick Introduction <quickintro>` if ``module`` is new for you.

  .. code-block:: none

     #!/bin/bash -x
     #SBATCH ...
     module load [...]
     srun ./prog1 : ./prog2

**Non Uniform Architectures and Mutual Exclusive Dependencies**

  A tool called ``xenv`` was implement to ease the task of loading modules for heterogeneous jobs. For details on supported command line arguments, execute ``xenv -h`` on the given system.

  .. code-block:: none

      srun --account=<budget account> --partition=<batch, ...> xenv -L intel-para IMB-1 : --partition=<knl, ...> xenv -L Architecture/KNL intel-para IMB-1

.. ifconfig:: system_name == 'jureca'

.. _het_modular_jobs_mpi_bridges:

MPI Traffic Across Modules
--------------------------
When the nodes of a job belong to different interconnects and MPI communication is used, bridging has to take place. To support this workflow, e.g. run a job on a Cluster with Infiniband and a Booster with OmniPath, a Gateway Daemon (psgwd, ParaStation Gateway Daemon) was implemented that takes care of moving packages across fabrics.

76
To request gateway nodes for a job the mandatory option --gw_num has to be specified at submit/allocation time. In addition, communication with the psgwd has to be ensured via loading the software module **pscom-gateway** either via ``xenv`` or the ``module`` command.
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
77

78
**April 2019** For the time being, prefixing binaries via ``msa_fix_ld`` is necessary, due to a libmpi version that does not support the psgwd. We hope this will go away soon.
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
79

80
To start an interactive pack job using two gateway nodes the following command must be used:
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
81

82
.. code-block:: none
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
83

84
  srun -A <budget account> -p <batch, ...> --gw_num=2 xenv -L pscom-gateway msa_fix_ld ./prog1 : -p <booster, ...> xenv -L pscom-gateway msa_fix_ld ./prog2
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
85

Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
86 87 88 89
When submitting a job that will run later, you have to specify the number of gateways at submit time:

.. code-block:: none

90 91 92 93 94 95 96 97 98 99
  sbatch --gw_num=2 ./submit-script.sbatch

.. code-block:: none

  #SBATCH -A <budget account>
  #SBATCH -p <batch, ...>
  #SBATCH packjob
  #SBATCH -p <booster, ...>

  srun xenv -L pscom-gateway msa_fix_ld ./prog1 : xenv -L pscom-gateway msa_fix_ld ./prog2
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
100

101 102 103
Debugging
~~~~~~~~~

Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
104 105 106 107 108 109 110 111 112 113 114 115 116
For debugging purposes, and to make sure the gateways are used, you might use

.. code-block:: none

  export PSP_DEBUG=3

You should see output like

.. code-block:: none

  <PSP:r0000003:CONNECT (192.168.12.34,26708,0x2,r0000003) to (192.168.12.41,29538,0x2,r0000004) via gw>
  <PSP:r0000004:ACCEPT  (192.168.12.34,26708,0x2,r0000003) to (192.168.12.41,29538,0x2,r0000004) via gw>

117 118 119 120 121 122 123 124 125
JuRoPA3
~~~~~~~
Because JUROPA3 has only one high-speed interconnect, using the ``psgwd`` is only possible using ``PSP_GATEWAY=2``. Via exporting this variable the Gateway protocols priority is boosted over the default interconnect.

.. code-block:: none

  export PSP_GATEWAY=2
  srun -A <budget account> -p <cluster, ...> --gw_num=2 xenv -L pscom-gateway ./prog1 : -p <booster, ...> xenv -L pscom-gateway ./prog2

Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165
PSGWD
~~~~~
The psgw plugin for the ParaStation management daemon extends the Slurm commands salloc, srun and sbatch with the following options:

.. code-block:: none

  --gw_file=path Path to the gateway routing file
  --gw_plugin=string Name of the route plugin
  --gw_num=number Number of gateway nodes

A routing file will be generated in $HOME/psgw-route-$JOBID. The routing file is
automatically removed when the allocation is revoked. With the option --gw_file an
alternative location using an absolute path for the routing file can be specified:

.. code-block:: none

  srun --gw_file=/home-fs/rauh/route-file --gw_num=2 -N 1 hostname : -N 2 hostname

The route plugin can be changed using the --gw_plugin option. Currently only the
default plugin “plugin01” is available.

.. code-block:: none

  srun --gw_plugin=plugin01 --gw_num=2 -N 1 hostname : -N 2 hostname

If more gateways were requested than available the slurmctld prologue will fail for
interactive jobs

.. code-block:: none

  srun --gw_num=3 -N 1 hostname : -N 2 hostname
  srun: psgw: requesting 3 gateway nodes
  srun: job 158553 queued and waiting for resources
  srun: job 158553 has been allocated resources
  srun: PrologSlurmctld failed, job killed
  srun: Force Terminated job 158553
  srun: error: Job allocation 158553 has been revoked

If batch jobs run out of gateway resources they will be re-queued and have to wait for
10 minutes before becoming eligible to start again.