[torqueusers] torque errors can't understand

Jacqueline Scoggins jscoggins at lbl.gov
Tue Mar 28 18:11:06 MST 2006


I am running torque-1.1.0p2 (I know it is old but ignore this).  I am
getting a strange message and I don't know exactly what is causing it:

In mom_logs I see:

03/28/2006 18:01:27;0008;   pbs_mom;Job;5540 ;Started, pid = 2449
03/28/2006 18:01:33;0080;   pbs_mom;Job;5540;scan_for_terminated: task 1
terminated, sid 2449
03/28/2006 18:01:33;0008;   pbs_mom;Job;5540;Terminated
03/28/2006 18:01:33;0080;   pbs_mom;Job;5540;Obit sent
03/28/2006 18:01:33;0080;   pbs_mom;Req;req_reject;Reject reply
code=15035( REJHOST=node0013), aux=0, type=54, from PBS_Server at jackie

When I look at the tracejob I see the following:
03/28/2006 16:59:12  S    Post job file processing error
03/28/2006 16:59:12  S    dequeuing from parallel, state 5


When I look on the node and read the *.ER file I see the following
message:

One of the processes started by mpirun has exited with a nonzero exit
code.  This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.

PID 2332 failed on node n1 (192.168.2.76) with exit status 1.


I have traced the lamboot and they all seem to be connecting fine.  I
ran lamboot with the -d and -v option and I don't see anything out of
the ordinary.

The queue parallel is set up as follows:

Queue parallel
        queue_type = Execution
        total_jobs = 1
        state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1
Exiting:0 
        resources_min.nodes = 2:ppn=2
        resources_default.nodes = 2
        resources_assigned.ncpus = 14
        resources_assigned.nodect = 7
        enabled = True
        started = True

The script is:

#!/bin/bash
#PBS -l nodes=7:shared
#PBS -l ncpus=14

/usr/bin/lamboot -d -v $PBS_NODEFILE
cd $HOME
/usr/bin/mpirun -np 14 ./a.out
/usr/bin/lamhalt $PBS_NODEFILE


I can't figure out what is going on.  $HOME is really not used in this
script.  There is a full pathname here.  I just don't like passing that
information over in email.  

Anyway any help would be appreciated.

Thanks

Jackie





More information about the torqueusers mailing list