[torqueusers] torque errors can't understand
Jacqueline Scoggins
jscoggins at lbl.gov
Tue Mar 28 18:11:06 MST 2006
I am running torque-1.1.0p2 (I know it is old but ignore this). I am
getting a strange message and I don't know exactly what is causing it:
In mom_logs I see:
03/28/2006 18:01:27;0008; pbs_mom;Job;5540 ;Started, pid = 2449
03/28/2006 18:01:33;0080; pbs_mom;Job;5540;scan_for_terminated: task 1
terminated, sid 2449
03/28/2006 18:01:33;0008; pbs_mom;Job;5540;Terminated
03/28/2006 18:01:33;0080; pbs_mom;Job;5540;Obit sent
03/28/2006 18:01:33;0080; pbs_mom;Req;req_reject;Reject reply
code=15035( REJHOST=node0013), aux=0, type=54, from PBS_Server at jackie
When I look at the tracejob I see the following:
03/28/2006 16:59:12 S Post job file processing error
03/28/2006 16:59:12 S dequeuing from parallel, state 5
When I look on the node and read the *.ER file I see the following
message:
One of the processes started by mpirun has exited with a nonzero exit
code. This typically indicates that the process finished in error.
If your process did not finish in error, be sure to include a "return
0" or "exit(0)" in your C code before exiting the application.
PID 2332 failed on node n1 (192.168.2.76) with exit status 1.
I have traced the lamboot and they all seem to be connecting fine. I
ran lamboot with the -d and -v option and I don't see anything out of
the ordinary.
The queue parallel is set up as follows:
Queue parallel
queue_type = Execution
total_jobs = 1
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1
Exiting:0
resources_min.nodes = 2:ppn=2
resources_default.nodes = 2
resources_assigned.ncpus = 14
resources_assigned.nodect = 7
enabled = True
started = True
The script is:
#!/bin/bash
#PBS -l nodes=7:shared
#PBS -l ncpus=14
/usr/bin/lamboot -d -v $PBS_NODEFILE
cd $HOME
/usr/bin/mpirun -np 14 ./a.out
/usr/bin/lamhalt $PBS_NODEFILE
I can't figure out what is going on. $HOME is really not used in this
script. There is a full pathname here. I just don't like passing that
information over in email.
Anyway any help would be appreciated.
Thanks
Jackie
More information about the torqueusers
mailing list