Upstream.

parent 816c3d7f
......@@ -78,7 +78,7 @@ Like ``salloc`` and ``sbatch``, ``srun`` can be used to specify different option
.. code-block:: none
$ srun <options and command 0> : <options and command 1> [ : <options and command 2>... ]
$ srun <options and command 0> : <options and command 1> [ : <options and command 2> ]
For example, in a heterogeneous job with two components, ``srun`` accepts up to two blocks of arguments and commands:
......@@ -137,4 +137,194 @@ For details on supported command line arguments, execute ``xenv -h`` on the give
srun --account=<budget account> \
--partition=<batch, ...> xenv -L intel-para IMB-1 : \
--partition=<knl, ...> xenv -L Architecture/KNL intel-para IMB-1
--partition=<knl, ...> xenv -L Architecture/KNL -L intel-para IMB-1
.. _het_modular_jobs_mpi_bridges:
MPI Traffic Across Modules
--------------------------
When the nodes of a job belong to different interconnects and MPI communication is used, bridging has to take place.
To support this workflow, e.g. run a job on a cluster with Infiniband and a booster with OmniPath, a gateway daemon (``psgwd``, ParaStation Gateway Daemon) was implemented that takes care of moving packages across fabrics.
Loading MPI
~~~~~~~~~~~
JURECA Cluster
++++++++++++++
Communication with the ``psgwd`` has to be ensured via loading the software module **pscom-gateway** either via ``xenv`` or the ``module`` command.
JURECA Booster, Current MPI Workaround (July/... 2019)
++++++++++++++++++++++++++++++++++++++++++++++++++++++
For the time being, a specific version of ParaStationMPI from the Devel-2019a stage must be used.
This is due to the fact that the default ParaStationMPI version for the booster does not support the gateway protocol.
We hope this will go away soon.
To load the corresponding ParaStationMPI, use the code snippet below for the **booster** pack of the job.
.. code-block:: none
srun ... : xenv -P -U /usr/local/software/jureca/OtherStages \
-L Stages/Devel-2019a [-L ...] -L ParaStationMPI/5.2.2-1 <binary> [ : ... ]
Requesting Gateways
~~~~~~~~~~~~~~~~~~~
To request gateway nodes for a job, the mandatory option ``gw_num`` has to be specified at submit/allocation time.
- There are in total 198 gateways available.
- The gateways are exclusive resources, they are not shared across user jobs.
This may change in the future.
- There is currently no enforced maximum on the number of gateways per job, beside of the total number of gateways.
This may change in the future.
Submitting Jobs
~~~~~~~~~~~~~~~
To start an interactive pack job using two gateway nodes the following command must be used:
.. code-block:: none
srun --gw_num=2 -A <budget account> \
-p <batch, ...> xenv [-L ...] -L pscom-gateway ./prog1 : \
-p <booster, ...> xenv -P -U /usr/local/software/jureca/OtherStages \
-L Stages/Devel-2019a [-L ...] -L ParaStationMPI/5.2.2-1 ./prog2
When submitting a job that will run later, you have to specify the number of gateways at submit time:
.. code-block:: none
sbatch --gw_num=2 ./submit-script.sbatch
.. code-block:: none
#!/bin/bash
#SBATCH -A <budget account>
#SBATCH -p <batch, ...>
#SBATCH packjob
#SBATCH -p <booster, ...>
srun xenv [-L ...] -L pscom-gateway ./prog1 : xenv -P \
-U /usr/local/software/jureca/OtherStages \
-L Stages/Devel-2019a [-L ...] -L ParaStationMPI/5.2.2-1 ./prog2
PSGWD
~~~~~
PSGWD Slurm Extension
+++++++++++++++++++++
The psgw plugin for the ParaStation management daemon extends the Slurm commands ``salloc``, ``srun`` and ``sbatch`` with the following options:
**--gw_num=#**
Number of gateway nodes that have to be allocated.
**--gw_file=path**
Path of the routing file.
**--gw_plugin=string**
Name of the route plugin.
As long as no other path is specified, the routing file will be generated in the directory where the job was submitted/started.
With the option ``gw_file`` a user-defined absolute path for the generation of the routing file can be specified.
PSGWD Routing
+++++++++++++
The routing of MPI traffic across the gateway nodes is performed by the ParaStation Gateway Daemon on a per-node-pair basis.
When a certain number of gateway nodes is requested, an instance of ``psgwd`` is launched on each gateway.
By default, given the list of cluster and booster nodes obtained at allocation time, the system assigns each one of the cluster node - booster node pair to one of the instances of ``psgwd`` previously launched.
This mapping between cluster and booster nodes is saved into a routing file and used for the routing of the MPI traffic across the gateway nodes.
The routing can be influenced via the ``gw_plugin`` option:
.. code-block:: none
srun --gw_plugin=$HOME/custom-route-plugin --gw_num=2 -N 1 hostname : -N 2 hostname
The ``gw_plugin`` option accepts either a label for a plugin already installed on the system, or the path to a user-defined plugin.
Currently two plugins are available on the JURECA system:
* ``plugin01`` is the default plugin (used when the ``gw_file`` is not used).
* ``plugin02`` is better suited for applications that use point-to-point communication between the same pairs of processes between cluster and booster, especially when the number of gateway nodes used is low.
The plugin file must include the functions associating a gateway node to a cluster node - booster node pair.
As an example, the code for ``plugin01`` is reported here:
.. code-block:: python
# Route function: Given the numerical Ids of nodes in partition A and B, the function
# returns a tuple (error, numeral of gateway)
def routeConnectionS(sizePartA, sizePartB, numGwd, numeralNodeA, numeralNodeB):
numeralGw = (numeralNodeA + numeralNodeB) % numGwd
return None, numeralGw
# Route function (extended interface): Make decision based on names of nodes to
# take topology into account
# def routeConnectionX(nodeListPartA, nodeListPartB, gwList, nodeA, nodeB):
# return Exception("Not implemented"), gwList[0]
routeConnectionX = None
In the case of 2 cluster nodes, 2 booster nodes and 2 gateway nodes, this function results in the following mapping:
============ ============ ============
Cluster node Booster node Gateway node
============ ============ ============
0 0 0
1 0 1
0 1 1
1 1 0
============ ============ ============
PSGWD Gateway Assignment
++++++++++++++++++++++++
If more gateways were requested than available, the ``slurmctld`` prologue will fail for interactive jobs.
.. code-block:: none
srun --gw_num=3 -N 1 hostname : -N 2 hostname
srun: psgw: requesting 3 gateway nodes
srun: job 158553 queued and waiting for resources
srun: job 158553 has been allocated resources
srun: PrologSlurmctld failed, job killed
srun: Force Terminated job 158553
srun: error: Job allocation 158553 has been revoked
If batch jobs run out of gateway resources they will be re-queued and have to wait for 10 minutes before becoming eligible to be scheduled again.
Debugging
~~~~~~~~~
For debugging purposes, and to make sure the gateways are used, you might use
.. code-block:: none
export PSP_DEBUG=3
You should see output like
.. code-block:: none
<PSP:r0000003:CONNECT (192.168.12.34,26708,0x2,r0000003) to (192.168.12.41,29538,0x2,r0000004) via gw>
<PSP:r0000004:ACCEPT (192.168.12.34,26708,0x2,r0000003) to (192.168.12.41,29538,0x2,r0000004) via gw>
JUROPA3
~~~~~~~
Because JUROPA3 has only one high-speed interconnect, using the ``psgwd`` is only possible using ``PSP_GATEWAY=2``.
Via exporting this environment variable the gateway protocols priority is boosted over the default interconnect.
.. code-block:: none
export PSP_GATEWAY=2
srun -A <budget account> \
-p <cluster, ...> --gw_num=2 xenv -L pscom-gateway ./prog1 : \
-p <booster, ...> xenv -L pscom-gateway ./prog2
JUROPA3 has 4 gateways available.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment