[torqueusers] qsub -lnodes=2 crashes

=?ISO-8859-8-I?B?4if46Q==?= jerry.mersel at weizmann.ac.il
Fri Apr 11 00:04:14 MDT 2008


Hi:


    I hope this isn't my second letter about this problem,
    but I didn't see my first posted.

    I built (under 64 bits) a torque/maui system
 but am having problems...


  I am setting up a cluster using torque 2.3.0 and maui 3.2.6p19.
  I am using 2 running nodes at the moment.
  Whenever I run like this:

   qsub -l nodes=2:ppn=1 t3.sh or even just
   qsub -l nodes=2 t3.sh
   
   

  the job is marked as running with qstat (but it isn't).

  The job just crashes.

  the script is just:

  #!/bin/tcsh
  <that's all>

  here is the log from one of the machines:

   20080410:04/10/2008 09:53:16;0008;   pbs_mom;Job;215.node4;Job Modified at request of PBS_Server at node 4.wcl
20080410:04/10/2008 09:56:24;0001;   pbs_mom;Svr;pbs_mom;sister could not communicate (15059) in 215. node4, job_start_error from node node3 in job_start_error
20080410:04/10/2008 09:56:24;0001;   pbs_mom;Job;215.node4;send_sisters:  sister #1 (node3) is not ok  (1099)
20080410:04/10/2008 09:56:24;0080;   pbs_mom;Job;215.node4;obit sent to server
20080410:04/10/2008 09:56:25;0008;   pbs_mom;Job;215.node4;Job Modified at request of PBS_Server at node 4.wcl
20080410:04/10/2008 09:59:33;0001;   pbs_mom;Svr;pbs_mom;sister could not communicate (15059) in 215. node4, job_start_error from node node3 in job_start_error
20080410:04/10/2008 09:59:33;0001;   pbs_mom;Job;215.node4;send_sisters:  sister #1 (node3) is not ok  (1099)
20080410:04/10/2008 09:59:33;0080;   pbs_mom;Job;215.node4;obit sent to server
20080410:04/10/2008 09:59:34;0008;   pbs_mom;Job;215.node4;Job Modified at request of PBS_Server at node 4.wcl
20080410:04/10/2008 10:01:53;0001;   pbs_mom;Svr;pbs_mom;sister could not communicate (15059) in 215. node4, job_start_error from node node3 in job_start_error
20080410:04/10/2008 10:01:53;0001;   pbs_mom;Job;215.node4;send_sisters:  sister #1 (node3) is not ok  (1099)
20080410:04/10/2008 10:01:53;0080;   pbs_mom;Job;215.node4;obit sent to server
20080410:04/10/2008 10:01:54;0008;   pbs_mom;Job;215.node4;Job Modified at request of PBS_Server at node 4.wcl
20080410:04/10/2008 10:02:07;0001;   pbs_mom;Svr;pbs_mom;job_recov, warning: tmsockets not recovered



Please help.

I've seen this issue in the archives but no solutions.


                   Regards,
                    Jerry




More information about the torqueusers mailing list