next iteration

parent 070c34d0
......@@ -82,7 +82,7 @@ Loading MPI
**JURECA Booster, Current MPI Workaround (April/May/... 2019)**
For the time being, prefixing JURECA **Booster** binaries via ``msa_fix_ld`` is necessary. This is due to the fact that the installed libmpi version does not support the psgwd. We hope this will go away soon.
For the time being, prefixing JURECA **Booster** binaries with ``msa_fix_ld`` is necessary. This is due to the fact that the installed libmpi version does not support the psgwd. We hope this will go away soon.
``msa_fix_ld`` is modifying the environment, so it might influence the modules you load.
......@@ -96,7 +96,7 @@ Loading MPI
Requesting Gateways
~~~~~~~~~~~~~~~~~~~
To request gateway nodes for a job, the mandatory option ``gw_num`` has to be specified at submit/allocation time.
To request Gateway nodes for a job, the mandatory option ``gw_num`` has to be specified at submit/allocation time.
- There are in total 198 Gateways available.
- The Gateways are exclusive resources, they are not shared across user jobs. This may change in the future.
......@@ -104,13 +104,13 @@ To request gateway nodes for a job, the mandatory option ``gw_num`` has to be sp
Submitting Jobs
~~~~~~~~~~~~~~~
To start an interactive pack job using two gateway nodes the following command must be used:
To start an interactive pack job using two Gateway nodes the following command must be used:
.. code-block:: none
srun -A <budget account> -p <batch, ...> --gw_num=2 xenv [-L ...] -L pscom-gateway ./prog1 : -p <booster, ...> xenv [-L ...] msa_fix_ld ./prog2
When submitting a job that will run later, you have to specify the number of gateways at submit time:
When submitting a job that will run later, you have to specify the number of Gateways at submit time:
.. code-block:: none
......@@ -135,42 +135,40 @@ PSGWD Slurm Extension
The psgw plugin for the ParaStation management daemon extends the Slurm commands salloc, srun and sbatch with the following options:
.. code-block:: none
--gw_num=number Number of gateway nodes
--gw_file=path of the routing file
--gw_plugin=string Name of the route plugin
**--gw_num=#**
Number of Gateway nodes that have to be allocated.
A routing file will be generated in $HOME/psgw-route-$JOBID. With the option ``gw_file`` a user-defined absolute path for the generation of the routing file can be specified:
**--gw_file=path**
Path of the routing file.
.. code-block:: none
**--gw_plugin=string**
Name of the route plugin.
srun --gw_file=custom-path-to-routing-file --gw_num=2 -N 1 -n 1 hostname : -N 2 -n 2 hostname
As long as no other path is specified, the routing file will be generated in $HOME/psgw-route-$JOBID for every job.
With the option ``gw_file`` a user-defined absolute path for the generation of the routing file can be specified:
PSGWD Routing
+++++++++++++
The routing of MPI traffic across the Gateway nodes is performed by the ParaStation Gateway daemon on a per-node-pair basis.
When a certain number of gateway nodes is requested, an instance of psgwd is launched on each gateway.
When a certain number of Gateway nodes is requested, an instance of psgwd is launched on each Gateway.
By default, given the list of Cluster and Booster nodes obtained at allocation time, the system assigns each one of the Cluster node - Booster node pair to one of the instances of psgwd previously launched.
This mapping between Cluster and Booster nodes is saved into the routing file and used for the routing of the MPI traffic across the gateway nodes.
This mapping between Cluster and Booster nodes is saved into a routing file and used for the routing of the MPI traffic across the Gateway nodes.
**Currently not available, will be available again with the next update:**
Since creating a routing file requires knowledge of the list of nodes prior to their allocation, it is more convenient to modify the logic with which the node pairs are assigned to the gateway daemons.
This can be done via the ``gw_plugin`` option:
The routing can be influenced via the ``gw_plugin`` option:
.. code-block:: none
srun --gw_plugin=$HOME/custom-route-plugin --gw_num=2 -N 1 hostname : -N 2 hostname
The ``gw_plugin`` option accepts either a label for a plugin already installed on the system, either a path to a user-defined plugin.
The ``gw_plugin`` option accepts either a label for a plugin already installed on the system, or the path to a user-defined plugin.
Currently two plugins are available on the JURECA system:
* ``plugin01`` is the default plugin (used when the ``gw_file`` is not used).
* ``plugin02`` is better suited for applications that use point-to-point communication between the same pairs of processes between Cluster and Booster, especially when the number of gateway nodes used is low.
* ``plugin02`` is better suited for applications that use point-to-point communication between the same pairs of processes between Cluster and Booster, especially when the number of Gateway nodes used is low.
The plugin file must include the functions associating a gateway node to a cluster node - booster node pair.
The plugin file must include the functions associating a Gateway node to a cluster node - booster node pair.
As an example, the code for ``plugin01`` is reported here:
.. code-block:: python
......@@ -204,7 +202,7 @@ Cluster node Booster node Gateway node
PSGWD Gateway Assignment
++++++++++++++++++++++++
If more gateways were requested than available the slurmctld prologue will fail for a interactive jobs
If more Gateways were requested than available, the slurmctld prologue will fail for a interactive jobs
.. code-block:: none
......@@ -216,12 +214,12 @@ If more gateways were requested than available the slurmctld prologue will fail
srun: Force Terminated job 158553
srun: error: Job allocation 158553 has been revoked
If batch jobs run out of gateway resources they will be re-queued and have to wait for 10 minutes before becoming eligible to start again.
If batch jobs run out of Gateway resources they will be re-queued and have to wait for 10 minutes before becoming eligible to start again.
Debugging
~~~~~~~~~
For debugging purposes, and to make sure the gateways are used, you might use
For debugging purposes, and to make sure the Gateways are used, you might use
.. code-block:: none
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment