[torqueusers] Trouble with multinode mpi on two new nodes

Daniel Davidson danield at igb.uiuc.edu
Thu Dec 13 19:25:24 MST 2012


We recently received four new systems to add to our cluster, they have 
pure uefi bios on them, and do not pxe boot.  So I will have to install 
them manually instead of just installing rocks to build the nodes.  We 
have run into a problem when we try to use more than one of these new 
nodes to run openmpi jobs.  We have compiled our on openmpi with torque 
support and we have used it on existing hardware for a while.

What does work:
serial jobs submitted to each node
parallel jobs that fit into one node
parallel program submitted by command line on either node (mpirun -host 
compute-2-0,compute-2-1 -np 2 hostname)
parallel jobs run on other node

But when the job is submitted, it is queued, but the output files are 
never created.

FYI we use moab, if that matters.

I have ensured:
mpi/torque libraries are loaded
torque versions are the same (3.0.5-1 supplied by rocks installs on 
other nodes)
/etc/hosts is the same on all systems
ldap is working correctly on all system
universal file system home directories are mounted and writeable

Please help, I am at a loss for as to what is happening.

Dan



Pertinent log information

compute-2-0
12/13/2012 19:46:02;0008;   pbs_mom;Job;do_rpp;got an internal task 
manager request in do_rpp
12/13/2012 19:46:02;0002;   pbs_mom;Svr;im_request;connect from 
10.1.255.226:1023
12/13/2012 19:46:02;0008; 
pbs_mom;Job;296009.server.name.edu;im_request:received request 
'ABORT_JOB' (10) for job 296009.server.name.edu from 10.1.255.226:1023
12/13/2012 19:46:02;0008; pbs_mom;Job;296009.server.name.edu;ERROR:    
received request 'ABORT_JOB' from 10.1.255.226:1023 for job 
'296009.server.name.edu' (job does not exist locally)
12/13/2012 19:46:02;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file 
descriptor (9) in do_rpp, cannot get protocol End of File
12/13/2012 19:46:02;0002;   pbs_mom;Svr;im_eof;End of File from addr 
10.1.255.226:1023

compute-2-1
12/13/2012 19:48:06;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Bad file 
descriptor (9) in do_rpp, cannot get protocol Premature end of message
12/13/2012 19:48:06;0002;   pbs_mom;Svr;im_eof;Premature end of message 
from addr 10.1.255.227:15003
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::node_bailout, 
296009.server.name.edu join_job failed from node compute-2-0 1 - 
recovery attempted)
12/13/2012 19:48:06;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::sister could 
not communicate (15061) in 296009.server.name.edu, job_start_error from 
node compute-2-0 in job_start_error
12/13/2012 19:48:06;0008;   pbs_mom;Req;send_sisters;sending command 
ABORT_JOB for job 296009.server.name.edu (10)
12/13/2012 19:48:06;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
sisters for job 296009.server.name.edu
12/13/2012 19:48:06;0001; 
pbs_mom;Job;296009.server.name.edu;send_sisters:  sister #1 
(compute-2-0) is not ok (1099)
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::node_bailout, 
node_bailout: received KILL/ABORT request for job 296009.server.name.edu 
from node compute-2-0
12/13/2012 19:48:06;0080;   pbs_mom;Svr;scan_for_exiting;searching for 
exiting jobs
12/13/2012 19:48:06;0008;   pbs_mom;Job;kill_job;scan_for_exiting: 
sending signal 9, "KILL" to job 296009.server.name.edu, reason: local 
task termination detected
12/13/2012 19:48:06;0008; pbs_mom;Job;296009.server.name.edu;kill_job 
done (killed 0 processes)
12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;sending 
preobit jobstat
12/13/2012 19:48:06;0008;   pbs_mom;Job;do_rpp;got an internal task 
manager request in do_rpp
12/13/2012 19:48:06;0002;   pbs_mom;Svr;im_request;connect from 
10.1.255.227:15003
12/13/2012 19:48:06;0008; 
pbs_mom;Job;296009.server.name.edu;im_request:received request 'ERROR' 
(99) for job 296009.server.name.edu from 10.1.255.227:15003
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, 
event 4 taskid 0 not found
12/13/2012 19:48:06;0001; pbs_mom;Svr;pbs_mom;LOG_ERROR::im_request, 
error sending command 99 to job 296009.server.name.edu
12/13/2012 19:48:06;0002;   pbs_mom;Svr;im_eof;No error from addr 
10.1.255.227:15003
12/13/2012 19:48:06;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
12/13/2012 19:48:06;0080; 
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top 
of while loop
12/13/2012 19:48:06;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
error from job stat
12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;performing 
job clean-up in preobit_reply()
12/13/2012 19:48:06;0080; pbs_mom;Job;296009.server.name.edu;epilog 
subtask created with pid 72123 - substate set to JOB_SUBSTATE_OBIT - 
registered post_epilogue

Head node:
12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;Job Run 
at request of root at server.name.edu
12/13/2012 19:47:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 296009.server.name.edu state from QUEUED-QUEUED to 
RUNNING-PRERUN (4-40)
12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;forking 
in send_job
12/13/2012 19:47:22;0004;PBS_Server;Svr;svr_connect;attempting connect 
to host 10.1.255.226 port 15002
12/13/2012 19:47:22;0008;PBS_Server;Job;296009.server.name.edu;entering 
post_sendmom
12/13/2012 19:47:22;0002;PBS_Server;Job;296009.server.name.edu;child 
reported success for job after 0 seconds (dest=compute-2-1), rc=0
12/13/2012 19:47:22;0008;PBS_Server;Job;reply_send;Reply sent for 
request type RunJob on socket 13
12/13/2012 19:47:22;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 296009.server.name.edu state from RUNNING-PRERUN to 
RUNNING-RUNNING (4-42)
12/13/2012 19:47:22;0004;PBS_Server;Svr;svr_connect;attempting connect 
to host 10.1.255.226 port 15002
12/13/2012 19:47:27;0004;PBS_Server;Svr;svr_connect;attempting connect 
to host 10.1.255.226 port 15002
.......
12/13/2012 19:50:30;0009;PBS_Server;Job;296009.server.name.edu;obit 
received - updating final job usage info
12/13/2012 19:50:30;0008;PBS_Server;Job;296009.server.name.edu;attr 
resources_used modified
12/13/2012 19:50:30;0008;PBS_Server;Job;296009.server.name.edu;attr 
Error_Path modified
12/13/2012 19:50:30;0008;PBS_Server;Job;reply_send;Reply sent for 
request type JobObituary on socket 14
12/13/2012 19:50:30;0009;PBS_Server;Job;296009.server.name.edu;job exit 
status -3 handled
12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 296009.server.name.edu state from RUNNING-RERUN1 to 
EXITING-RERUN1 (5-61)
12/13/2012 
19:50:30;0009;PBS_Server;Job;296009.server.name.edu;on_job_rerun task 
assigned to job
12/13/2012 
19:50:30;0009;PBS_Server;Job;296009.server.name.edu;req_jobobit completed
12/13/2012 19:50:30;0004;PBS_Server;Svr;svr_connect;attempting connect 
to host 10.1.255.226 port 15002
12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 296009.server.name.edu state from EXITING-RERUN1 to 
EXITING-RERUN2 (5-62)
12/13/2012 19:50:30;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: 
setting job 296009.server.name.edu state from EXITING-RERUN2 to 
EXITING-RERUN3 (5-63)
.......
and this keeps repeating

We also get logs like:
Unable to copy file 
/var/spool/torque/spool/296008.biocluster.igb.illinois.edu.OU to 
/home/a-m/danield/mpitest.sh.o296008 *** error from copy /bin/cp: cannot 
stat `/var/spool/torque/spool/296008.biocluster.igb.illinois.edu.OU': No 
such file or directory *** end error output

But I am pretty sure that is because the mpi job is not running on the 
second node, from the first node that receives the job information.



More information about the torqueusers mailing list