[torqueusers] job with -l nodes=2 marked as running when has long
stopped
=?ISO-8859-8-I?B?4if46Q==?=
jerry.mersel at weizmann.ac.il
Thu Apr 10 01:44:29 MDT 2008
Hi:
I am setting up a cluster using torque 2.3.0 and maui 3.2.6p19.
I am using 2 running nodes at the moment.
Whenever I run like this:
qsub -l nodes=2:ppn=1 t3.sh or even just
qsub -l nodes=2 t3.sh
the job is marked as running with qstat (but it isn't).
the script is just:
#!/bin/tcsh
<that's all>
here is the log from one of the machines:
20080410:04/10/2008 09:53:16;0008; pbs_mom;Job;215.node4;Job Modified at request of PBS_Server at node 4.wcl
20080410:04/10/2008 09:56:24;0001; pbs_mom;Svr;pbs_mom;sister could not communicate (15059) in 215. node4, job_start_error from node node3 in job_start_error
20080410:04/10/2008 09:56:24;0001; pbs_mom;Job;215.node4;send_sisters: sister #1 (node3) is not ok (1099)
20080410:04/10/2008 09:56:24;0080; pbs_mom;Job;215.node4;obit sent to server
20080410:04/10/2008 09:56:25;0008; pbs_mom;Job;215.node4;Job Modified at request of PBS_Server at node 4.wcl
20080410:04/10/2008 09:59:33;0001; pbs_mom;Svr;pbs_mom;sister could not communicate (15059) in 215. node4, job_start_error from node node3 in job_start_error
20080410:04/10/2008 09:59:33;0001; pbs_mom;Job;215.node4;send_sisters: sister #1 (node3) is not ok (1099)
20080410:04/10/2008 09:59:33;0080; pbs_mom;Job;215.node4;obit sent to server
20080410:04/10/2008 09:59:34;0008; pbs_mom;Job;215.node4;Job Modified at request of PBS_Server at node 4.wcl
20080410:04/10/2008 10:01:53;0001; pbs_mom;Svr;pbs_mom;sister could not communicate (15059) in 215. node4, job_start_error from node node3 in job_start_error
20080410:04/10/2008 10:01:53;0001; pbs_mom;Job;215.node4;send_sisters: sister #1 (node3) is not ok (1099)
20080410:04/10/2008 10:01:53;0080; pbs_mom;Job;215.node4;obit sent to server
20080410:04/10/2008 10:01:54;0008; pbs_mom;Job;215.node4;Job Modified at request of PBS_Server at node 4.wcl
20080410:04/10/2008 10:02:07;0001; pbs_mom;Svr;pbs_mom;job_recov, warning: tmsockets not recovered
Please help.
I've seen this issue in the archives but no solutions.
Regards,
Jerry
More information about the torqueusers
mailing list