In our HPC application we are using the MPI.Net as our communication platform.
For some unknown reason, we are facing a situation where our application is crashing with one of the following:
1. An MPI exception (MPI_other_error ,related to the shared memory allocation)
2. An empty message that we receive
3. System halts within one of the MPI methods.
We suspect that the problem is related to the size of the message we are sending (up to 100 MB) and the high rate of the message transportation.
We are considering the option of changing the communication layer using a different type of communication method.
The main issue is the size of the data we are transferring and the rate of the messaging (there are quite a few large messages per second to and from different processes).
Just to have some clarifications, each of our ranks is looping on Iprobe with any tag, any source and then if there is a status calling the receive.
The sending from or to a rank can be simultaneously from and to another rank. (It means a rank can get more than one message simultaneously).
Are you familiar with these problems?
Is there something that can fix this problem?
If you are in need of any clarifications regarding our architecture, we will be happy to clarify.
( Phone: +972-9-8864648
( Mobile: +972-54-9222625
+ Fax: +972-9-8864766
* E-mail: email@example.com
LAND SYSTEMS & C4I - TADIRAN