[torqueusers] torque errors can't understand
Garrick Staples
garrick at usc.edu
Wed Mar 29 17:48:13 MST 2006
On Tue, Mar 28, 2006 at 05:11:06PM -0800, Jacqueline Scoggins alleged:
> I am running torque-1.1.0p2 (I know it is old but ignore this). I am
Why ignore this? If you suspect a problem with TORQUE than the first
suggestion is to reproduce this problem with the latest stable release.
> getting a strange message and I don't know exactly what is causing it:
>
> In mom_logs I see:
>
> 03/28/2006 18:01:27;0008; pbs_mom;Job;5540 ;Started, pid = 2449
> 03/28/2006 18:01:33;0080; pbs_mom;Job;5540;scan_for_terminated: task 1
> terminated, sid 2449
> 03/28/2006 18:01:33;0008; pbs_mom;Job;5540;Terminated
> 03/28/2006 18:01:33;0080; pbs_mom;Job;5540;Obit sent
The job started, and then exited within 5 seconds.
> 03/28/2006 18:01:33;0080; pbs_mom;Req;req_reject;Reject reply
> code=15035( REJHOST=node0013), aux=0, type=54, from PBS_Server at jackie
>
> When I look at the tracejob I see the following:
> 03/28/2006 16:59:12 S Post job file processing error
> 03/28/2006 16:59:12 S dequeuing from parallel, state 5
The job exited so quickly, pbs_server thinks the initial job launch
failed.
> When I look on the node and read the *.ER file I see the following
> message:
>
> One of the processes started by mpirun has exited with a nonzero exit
> code. This typically indicates that the process finished in error.
> If your process did not finish in error, be sure to include a "return
> 0" or "exit(0)" in your C code before exiting the application.
>
> PID 2332 failed on node n1 (192.168.2.76) with exit status 1.
The MPI process on n1 failed. Run it under a debugger, add verboseness,
etc.
> #!/bin/bash
> #PBS -l nodes=7:shared
> #PBS -l ncpus=14
That doesn't look right. I think you wanted 'nodes=7:ppn=2:shared'
--
Garrick Staples, Linux/HPCC Administrator
University of Southern California
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20060329/ebc57f6d/attachment.bin
More information about the torqueusers
mailing list