1) Code executed before and after communication operators are initialized and finalized respectively is implementation dependent, 2) command line arguments queried before operator initialization, 3) this can lead to many error messages

I have decided to bundle these three issues together, since solving issue 1 and 2 (which are strongly related) makes issue 3 easily solvable. Note that issues 1 and 2 will probably not affect most Linux distributions.

There is code executed before MPI_init this is not strictly supported, and may not work with e.g. FG-MPI, or other FORK based MPI implementations (I am not really knowledgeable on this though). This is an inherent problem for all of the communication methods, which seem to have undefined behavior for code executed before initialization and after finalization. Executing code there seems to work but the result seem to be implementation defined, i.e. not portable.

Potential Solution: Users should use a communication operator implementations that allows for code execution by processes before initialization and after finalization. This is the users responsibility.

The command line is queried to determine the virtual-cluster implementation, which is required for communication operations to be initialized. The virtual-cluster implementation occurs in one of two ways: 1) the option --mode or -m is looked up, or the suffix of the linktest executable is queried. The C++ Standard, however, does not guarantee that argc must be greater than zero. It only guarantees that if argc is greater than zero that the first element of argv is a pointer to the executable name including the system specific path used to call it, or that the first element is a null pointer. Method 1) is guaranteed to fail if --mode or -m is not given, which is always the case if argc<=1. Method 2) is guaranteed to fail if argc=0 or argv[0]=NULL.

An alternative for determining the virtual-cluster implementation is to use the environment variable. Environment variables, as well as the current implementation, however, are operating system specific.

Potential Solution: Users should only execute linktest in such a way that the command line arguments are passed to all processes. This is the users responsibility. If not feasible a compliant environment variable system should be used.

This leads to the next problem, since a single-instruction multiple-data execution model is used this means that a number of processes will execute the same instructions before communication initialization, only after which ranks can be determined. If retrieving the virtual cluster implementation fails each of these processes will generate an error, potentially clogging up logging and causing race conditions. This is unavoidable if only one executable is desired.

Potential Solution: Initialize an emergency MPI environment and have only rank 0 print out the unique encountered error messages and associated ranks. This only works if Have_MPI!=0. Otherwise we need to think of something else, maybe implement alternatives based on compiled options.

Alternative Solution: Supply a wrapper for srun/mpiexec etc. that ensures that always a valid virtual-cluster implementation is used. This way it is guaranteed that the error messages are never generated.

Alternative Solution: Have the function return a non-zero error-code and document what these mean, this might be a bit old school though.

Edited Feb 11, 2021 by Max Holicki

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information