modular-jobs.rst 9.34 KB
Newer Older
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
.. include:: system.rst

.. _het_modular_jobs:

Heterogeneous and Cross-Module Jobs
===================================

.. _het_modular_jobs_overview:

Overview
--------

.. _het_modular_jobs_slurm:

Slurm Support for Heterogeneous Jobs
------------------------------------
For detailed information about Slurm, please take a look on the :ref:`Quick Introduction <quickintro>` and :ref:`Batch system <batchsystem>` page.

19
With Slurm 17.11 support for Heterogeneous Jobs was introduced. This allows to spawn a job across multiple partitions of a cluster, and across different Modules of our Supercomputers. See the official Slurm documentation (SlurmHetJob_) for additional information on this feature.
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34

.. _SlurmHetJob: https://slurm.schedmd.com/heterogeneous_jobs.html

**salloc/srun**

.. code-block:: none

  salloc -A <budget account> -p <batch, ...> : -p <booster, ...> [ : -p <booster, ...> ]
  srun ./prog1 : ./prog2 [ : ./progN ]

**sbatch**

.. code-block:: none

  #!/bin/bash
35
  #SBATCH -A <budget account>
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
  #SBATCH -p <batch, ...>
  #SBATCH packjob
  #SBATCH -p <booster, ...>

  srun ./prog1 : ./prog2


.. _het_modular_jobs_software:

Loading Software in a Heterogeneous Environment
-----------------------------------------------
Executing applications in a modular environment, especially when different Modules have different architectures or the dependencies of programs are not uniform, can be a challenging tasks.

**Uniform Architecture and Dependencies**

  As long as the Architecture of the given modules are uniform and there are not mutually exclusive dependencies for the binaries that are going to be executed, one can rely on the ``module`` command. Take a look on the :ref:`Quick Introduction <quickintro>` if ``module`` is new for you.

  .. code-block:: none

     #!/bin/bash -x
     #SBATCH ...
     module load [...]
     srun ./prog1 : ./prog2

**Non Uniform Architectures and Mutual Exclusive Dependencies**

  A tool called ``xenv`` was implement to ease the task of loading modules for heterogeneous jobs. For details on supported command line arguments, execute ``xenv -h`` on the given system.

  .. code-block:: none

      srun --account=<budget account> --partition=<batch, ...> xenv -L intel-para IMB-1 : --partition=<knl, ...> xenv -L Architecture/KNL intel-para IMB-1

.. ifconfig:: system_name == 'jureca'

.. _het_modular_jobs_mpi_bridges:

MPI Traffic Across Modules
--------------------------
When the nodes of a job belong to different interconnects and MPI communication is used, bridging has to take place. To support this workflow, e.g. run a job on a Cluster with Infiniband and a Booster with OmniPath, a Gateway Daemon (psgwd, ParaStation Gateway Daemon) was implemented that takes care of moving packages across fabrics.

76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Loading MPI
~~~~~~~~~~~

**JURECA Cluster**

  Communication with the psgwd has to be ensured via loading the software module **pscom-gateway** either via ``xenv`` or the ``module`` command.

**JURECA Booster, Current MPI Workaround (April/May/... 2019)**

  For the time being, prefixing JURECA **Booster** binaries via ``msa_fix_ld`` is necessary. This is due to the fact that the installed libmpi version does not support the psgwd. We hope this will go away soon.
  ``msa_fix_ld`` is modifying the environment, so it might influence the modules you load.


.. code-block:: none

  #!/bin/bash
  export PSP_PSM=1
  export LD_LIBRARY_PATH="/usr/local/jsc/msa_parastation_mpi/lib:/usr/local/jsc/msa_parastation_mpi/lib/mpi-hpl-gcc/:${LD_LIBRARY_PATH}"
  $*

Requesting Gateways
~~~~~~~~~~~~~~~~~~~

To request gateway nodes for a job, the mandatory option ``gw_num`` has to be specified at submit/allocation time.
100

101 102 103
- There are in total 198 Gateways available.
- The Gateways are exclusive resources, they are not shared across user jobs. This may change in the future.
- There is currently no enforced maximum on the number of Gateways per job, beside of the total number of Gateways. This may change in the future.
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
104

105 106
Submitting Jobs
~~~~~~~~~~~~~~~
107
To start an interactive pack job using two gateway nodes the following command must be used:
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
108

109
.. code-block:: none
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
110

111
  srun -A <budget account> -p <batch, ...> --gw_num=2 xenv [-L ...] -L pscom-gateway ./prog1 : -p <booster, ...> xenv [-L ...] msa_fix_ld ./prog2
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
112

Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
113 114 115 116
When submitting a job that will run later, you have to specify the number of gateways at submit time:

.. code-block:: none

117 118 119 120
  sbatch --gw_num=2 ./submit-script.sbatch

.. code-block:: none

Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
121
  #!/bin/bash
122 123 124 125 126
  #SBATCH -A <budget account>
  #SBATCH -p <batch, ...>
  #SBATCH packjob
  #SBATCH -p <booster, ...>

127
  srun xenv [-L ...] -L pscom-gateway ./prog1 : xenv [-L ...] msa_fix_ld ./prog2
128 129


Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
130 131
PSGWD
~~~~~
132 133 134 135

PSGWD Slurm Extension
+++++++++++++++++++++

Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
136 137 138 139
The psgw plugin for the ParaStation management daemon extends the Slurm commands salloc, srun and sbatch with the following options:

.. code-block:: none

140
  --gw_num=number Number of gateway nodes
141
  --gw_file=path of the routing file
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
142 143
  --gw_plugin=string Name of the route plugin

144 145 146 147 148
A routing file will be generated in $HOME/psgw-route-$JOBID. With the option ``gw_file`` a user-defined absolute path for the generation of the routing file can be specified:

.. code-block:: none

  srun --gw_file=custom-path-to-routing-file --gw_num=2 -N 1 -n 1 hostname : -N 2 -n 2 hostname
149

150 151
PSGWD Routing
+++++++++++++
152

153 154 155 156
The routing of MPI traffic across the Gateway nodes is performed by the ParaStation Gateway daemon on a per-node-pair basis.
When a certain number of gateway nodes is requested, an instance of psgwd is launched on each gateway.
By default, given the list of Cluster and Booster nodes obtained at allocation time, the system assigns each one of the Cluster node - Booster node pair to one of the instances of psgwd previously launched.
This mapping between Cluster and Booster nodes is saved into the routing file and used for the routing of the MPI traffic across the gateway nodes.
157

Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
158
**Currently not available, will be available again with the next update:**
159 160
Since creating a routing file requires knowledge of the list of nodes prior to their allocation, it is more convenient to modify the logic with which the node pairs are assigned to the gateway daemons.
This can be done via the ``gw_plugin`` option:
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
161 162 163

.. code-block:: none

Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
164
  srun --gw_plugin=$HOME/custom-route-plugin --gw_num=2 -N 1 hostname : -N 2 hostname
165

166 167 168
The ``gw_plugin`` option accepts either a label for a plugin already installed on the system, either a path to a user-defined plugin.

Currently two plugins are available on the JURECA system:
169

170
* ``plugin01`` is the default plugin (used when the ``gw_file`` is not used).
171
* ``plugin02`` is better suited for applications that use point-to-point communication between the same pairs of processes between Cluster and Booster, especially when the number of gateway nodes used is low.
172

173
The plugin file must include the functions associating a gateway node to a cluster node - booster node pair.
174 175 176 177 178 179 180 181 182 183 184
As an example, the code for ``plugin01`` is reported here:

.. code-block:: python

  # Route function: Given the numerical Ids of nodes in partition A and B, the function
  # returns a tuple (error, numeral of gateway)
  def routeConnectionS(sizePartA, sizePartB, numGwd, numeralNodeA, numeralNodeB):
  numeralGw = (numeralNodeA + numeralNodeB) % numGwd

  return None, numeralGw

185
  # Route function (extended interface): Make decision based on names of nodes to
186 187 188 189 190 191 192 193
  # take topology into account
  # def routeConnectionX(nodeListPartA, nodeListPartB, gwList, nodeA, nodeB):
  #       return Exception("Not implemented"), gwList[0]
  routeConnectionX = None


In the case of 2 Cluster nodes, 2 Booster nodes and 2 Gateway nodes, this function results in the following mapping:

194 195 196 197 198 199 200 201
============  ============  ============
Cluster node  Booster node  Gateway node
============  ============  ============
           0             0             0
           1             0             1
           0             1             1
           1             1             0
============  ============  ============
202 203


204 205 206 207
PSGWD Gateway Assignment
++++++++++++++++++++++++

If more gateways were requested than available the slurmctld prologue will fail for a interactive jobs
Benedikt von St. Vieth's avatar
Benedikt von St. Vieth committed
208 209 210 211 212 213 214 215 216 217 218

.. code-block:: none

  srun --gw_num=3 -N 1 hostname : -N 2 hostname
  srun: psgw: requesting 3 gateway nodes
  srun: job 158553 queued and waiting for resources
  srun: job 158553 has been allocated resources
  srun: PrologSlurmctld failed, job killed
  srun: Force Terminated job 158553
  srun: error: Job allocation 158553 has been revoked

219
If batch jobs run out of gateway resources they will be re-queued and have to wait for 10 minutes before becoming eligible to start again.
220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235

Debugging
~~~~~~~~~

For debugging purposes, and to make sure the gateways are used, you might use

.. code-block:: none

  export PSP_DEBUG=3

You should see output like

.. code-block:: none

  <PSP:r0000003:CONNECT (192.168.12.34,26708,0x2,r0000003) to (192.168.12.41,29538,0x2,r0000004) via gw>
  <PSP:r0000004:ACCEPT  (192.168.12.34,26708,0x2,r0000003) to (192.168.12.41,29538,0x2,r0000004) via gw>
236 237 238 239 240 241 242 243 244 245 246

JuRoPA3
~~~~~~~
Because JUROPA3 has only one high-speed interconnect, using the ``psgwd`` is only possible using ``PSP_GATEWAY=2``. Via exporting this variable the Gateway protocols priority is boosted over the default interconnect.

.. code-block:: none

  export PSP_GATEWAY=2
  srun -A <budget account> -p <cluster, ...> --gw_num=2 xenv -L pscom-gateway ./prog1 : -p <booster, ...> xenv -L pscom-gateway ./prog2

JuRoPA3 has 4 Gateways available.