[Mauiusers] torque 2.3.6 maui 3.2.6p19 and mpich2-1.1.1

Alfred Wagner wagner at rz.uni-kiel.de
Thu Sep 3 08:01:46 MDT 2009


Hi,

i have a question with suspending mpi Jobs.
We are running torque with maui on a cluster .
It's running fine. The Problem we have, is with mpi Jobs.
Suspension does'nt work an all Processes:

Job ID               Username Queue    Jobname          SessID NDS   TSK
Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- ---
------ ----- - -----
791.rzcluster2       xxxxx    short    test2             26698     1  --  
 --  02:30 S 01:33
   rzcl124/7+rzcl124/6+rzcl124/5+rzcl124/4+rzcl124/3+rzcl124/2+rzcl124/1
   +rzcl124/0
795.rzcluster2       xxxxx     batch    test              28098     1  -- 
  --    --


[root at rzcl124 tmp]# cat JJJ
4 T xxxxx 26698 16450  0  77   0 - 21014 -      14:20 ?        00:00:00 -bash
0 T xxxxx 26761 26698  0  77   0 - 15965 -      14:20 ?        00:00:00
/bin/bash /var/spool/pbs/mom_priv/jobs/791.rzcluster2.SC
0 T xxxxx 26773 26761  0  75   0 - 35915 -      14:20 ?        00:00:00
python2.4 /data/xxxxx/Dtesten/Drzcluster2/Dmpich2/mpich2-1.1.1/bin/mpirun
-np 16 ./m1
0 S xxxxx 26790     1 51  75   0 -  5290 -      14:20 ?        00:00:30 ./m1
0 S xxxxx 26791     1 38  76   0 -  4265 -      14:20 ?        00:00:23 ./m1
0 S xxxxx 26792     1 38  75   0 -  4266 -      14:20 ?        00:00:23 ./m1
0 S xxxxx 26793     1 25  76   0 -  4265 -      14:20 ?        00:00:15 ./m1
0 S xxxxx 26794     1 25  76   0 -  4266 -      14:20 ?        00:00:15 ./m1
0 S xxxxx 26795     1 26  75   0 -  4266 -      14:20 ?        00:00:15 ./m1
0 S xxxxx 26796     1 25  75   0 -  4267 -      14:20 ?        00:00:15 ./m1
0 S xxxxx 26797     1 12  75   0 -  5290 -      14:20 ?        00:00:07 ./m1
0 S xxxxx 26798     1 12  75   0 -  5289 -      14:20 ?        00:00:07 ./m1
0 S xxxxx 26799     1 12  75   0 -  5290 -      14:20 ?        00:00:07 ./m1
0 S xxxxx 26800     1 13  75   0 -  5290 -      14:20 ?        00:00:07 ./m1
0 S xxxxx 26801     1 12  75   0 -  5289 -      14:20 ?        00:00:07 ./m1
0 S xxxxx 26802     1 13  75   0 -  5290 -      14:20 ?        00:00:07 ./m1
0 S xxxxx 26803     1 12  75   0 -  5289 -      14:20 ?        00:00:07 ./m1
0 S xxxxx 26804     1 12  75   0 -  5289 -      14:20 ?        00:00:07 ./m1
0 R xxxxx 26805     1 52  75   0 -  4320 -      14:20 ?        00:00:31 ./m1

After resuming the job also the Prozesses don't start again.
Is that a known Problem ?




Thank you




      Alfred Wagner
      Rechenzentrum der Universität Kiel
      Ludewig-Meyn-Straße 4
24118 Kiel
      0431-8804494  FAX:0431-8801523



More information about the mauiusers mailing list