[torqueusers] torque errors can't understand

Garrick Staples garrick at usc.edu
Wed Mar 29 17:48:13 MST 2006


On Tue, Mar 28, 2006 at 05:11:06PM -0800, Jacqueline Scoggins alleged:
> I am running torque-1.1.0p2 (I know it is old but ignore this).  I am

Why ignore this?  If you suspect a problem with TORQUE than the first
suggestion is to reproduce this problem with the latest stable release.


> getting a strange message and I don't know exactly what is causing it:
> 
> In mom_logs I see:
> 
> 03/28/2006 18:01:27;0008;   pbs_mom;Job;5540 ;Started, pid = 2449
> 03/28/2006 18:01:33;0080;   pbs_mom;Job;5540;scan_for_terminated: task 1
> terminated, sid 2449
> 03/28/2006 18:01:33;0008;   pbs_mom;Job;5540;Terminated
> 03/28/2006 18:01:33;0080;   pbs_mom;Job;5540;Obit sent

The job started, and then exited within 5 seconds.


> 03/28/2006 18:01:33;0080;   pbs_mom;Req;req_reject;Reject reply
> code=15035( REJHOST=node0013), aux=0, type=54, from PBS_Server at jackie
> 
> When I look at the tracejob I see the following:
> 03/28/2006 16:59:12  S    Post job file processing error
> 03/28/2006 16:59:12  S    dequeuing from parallel, state 5

The job exited so quickly, pbs_server thinks the initial job launch
failed.



> When I look on the node and read the *.ER file I see the following
> message:
> 
> One of the processes started by mpirun has exited with a nonzero exit
> code.  This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
> 
> PID 2332 failed on node n1 (192.168.2.76) with exit status 1.

The MPI process on n1 failed.  Run it under a debugger, add verboseness,
etc.


> #!/bin/bash
> #PBS -l nodes=7:shared
> #PBS -l ncpus=14

That doesn't look right.  I think you wanted 'nodes=7:ppn=2:shared'


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060329/ebc57f6d/attachment.bin


More information about the torqueusers mailing list