Reorganizing/adding some psgwd blocks.

parent a3baed3e
...@@ -73,7 +73,7 @@ MPI Traffic Across Modules ...@@ -73,7 +73,7 @@ MPI Traffic Across Modules
-------------------------- --------------------------
When the nodes of a job belong to different interconnects and MPI communication is used, bridging has to take place. To support this workflow, e.g. run a job on a Cluster with Infiniband and a Booster with OmniPath, a Gateway Daemon (psgwd, ParaStation Gateway Daemon) was implemented that takes care of moving packages across fabrics. When the nodes of a job belong to different interconnects and MPI communication is used, bridging has to take place. To support this workflow, e.g. run a job on a Cluster with Infiniband and a Booster with OmniPath, a Gateway Daemon (psgwd, ParaStation Gateway Daemon) was implemented that takes care of moving packages across fabrics.
To request gateway nodes for a job, the mandatory option --gw_num has to be specified at submit/allocation time. In addition, communication with the psgwd has to be ensured via loading the software module **pscom-gateway** either via ``xenv`` or the ``module`` command. To request gateway nodes for a job, the mandatory option ``gw_num`` has to be specified at submit/allocation time. In addition, communication with the psgwd has to be ensured via loading the software module **pscom-gateway** either via ``xenv`` or the ``module`` command.
There are in total 198 Gateways available. There are in total 198 Gateways available.
...@@ -114,6 +114,10 @@ For the time being, prefixing binaries via ``msa_fix_ld`` is necessary. This is ...@@ -114,6 +114,10 @@ For the time being, prefixing binaries via ``msa_fix_ld`` is necessary. This is
PSGWD PSGWD
~~~~~ ~~~~~
PSGWD Slurm Extension
+++++++++++++++++++++
The psgw plugin for the ParaStation management daemon extends the Slurm commands salloc, srun and sbatch with the following options: The psgw plugin for the ParaStation management daemon extends the Slurm commands salloc, srun and sbatch with the following options:
.. code-block:: none .. code-block:: none
...@@ -122,23 +126,27 @@ The psgw plugin for the ParaStation management daemon extends the Slurm commands ...@@ -122,23 +126,27 @@ The psgw plugin for the ParaStation management daemon extends the Slurm commands
--gw_plugin=string Name of the route plugin --gw_plugin=string Name of the route plugin
--gw_num=number Number of gateway nodes --gw_num=number Number of gateway nodes
A routing file will be generated in $HOME/psgw-route-$JOBID. The routing file is A routing file will be generated in $HOME/psgw-route-$JOBID. The routing file is automatically removed when the allocation is revoked.
automatically removed when the allocation is revoked. With the option --gw_file an
alternative location using an absolute path for the routing file can be specified: PSGWD Routing Plugins
+++++++++++++++++++++
With the option ``gw_file`` an alternative location using an absolute path for the routing file can be specified:
.. code-block:: none .. code-block:: none
srun --gw_file=/home-fs/rauh/route-file --gw_num=2 -N 1 hostname : -N 2 hostname srun --gw_file=/home-fs/rauh/route-file --gw_num=2 -N 1 hostname : -N 2 hostname
The route plugin can be changed using the --gw_plugin option. Currently only the The route plugin can be changed using the ``gw_plugin`` option. Currently only the default plugin ``plugin01`` is available.
default plugin “plugin01” is available.
.. code-block:: none .. code-block:: none
srun --gw_plugin=plugin01 --gw_num=2 -N 1 hostname : -N 2 hostname srun --gw_plugin=plugin01 --gw_num=2 -N 1 hostname : -N 2 hostname
If more gateways were requested than available the slurmctld prologue will fail for PSGWD Gateway Assignment
interactive jobs ++++++++++++++++++++++++
If more gateways were requested than available the slurmctld prologue will fail for a interactive jobs
.. code-block:: none .. code-block:: none
...@@ -150,8 +158,7 @@ interactive jobs ...@@ -150,8 +158,7 @@ interactive jobs
srun: Force Terminated job 158553 srun: Force Terminated job 158553
srun: error: Job allocation 158553 has been revoked srun: error: Job allocation 158553 has been revoked
If batch jobs run out of gateway resources they will be re-queued and have to wait for If batch jobs run out of gateway resources they will be re-queued and have to wait for 10 minutes before becoming eligible to start again.
10 minutes before becoming eligible to start again.
Debugging Debugging
~~~~~~~~~ ~~~~~~~~~
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment