[torqueusers] Problem with job starts on Linux
Chad Vizino
vizino at psc.edu
Thu Jun 28 09:49:53 MDT 2007
Hi,
It's been awhile since I have built and run Torque so downloaded 2.1.8
to try out on a couple of Linux systems (Ubuntu and Suse). I wanted to
confirm basic operation so did the customary configure-make-make install
sequence and started the server, mom and sched daemons on the same host,
adding the host to /var/spool/torque/server_priv/nodes ("blueracer:ts").
The mom host shows up as free and timeshared on "pbsnodes -a" output.
/var/spool/torque/mom_priv/config is:
$logevent 0x1ff
$loglevel 7
$usecp *:/home /home
However, I can't get a job to start under 2.1.8. I have tried a simple
job start test under versions 2.0.0p11, 2.1.1, 2.1.2, 2.1.3, 2.1.6, and
2.1.8. The following failure is occurs under every version starting
with 2.1.2:
$ qsub -j eo
echo touching /tmp/foo
touch /tmp/foo
6.heidi
$ ls -l STDIN.e6
-rw------- 1 vizino vizino 0 2007-06-28 11:39 STDIN.e6
$ ls -l /tmp/foo
ls: /tmp/foo: No such file or directory
From the mom log:
06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;starting job execution
06/28/2007 11:39:22;0001; pbs_mom;Job;job_nodes;0: heidi/0
06/28/2007 11:39:22;0001; pbs_mom;Job;job_nodes;job: 6.heidi
numnodes=1 numvnod=1
06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;evaluating limits for job
06/28/2007 11:39:22;0002; pbs_mom;n/a;mom_close_poll;entered
06/28/2007 11:39:22;0001; pbs_mom;Job;6.heidi;phase 2 of job launch
successfully completed
06/28/2007 11:39:22;0001; pbs_mom;Job;TMomFinalizeJob3;read start
return code=-2 session=26752
06/28/2007 11:39:22;0001; pbs_mom;Job;TMomFinalizeJob3;job not
started, Failure job exec failure, after files staged, no retry
06/28/2007 11:39:22;0008; pbs_mom;Req;send_sisters;sending command
ABORT_JOB for job 6.heidi (10)
06/28/2007 11:39:22;0008; pbs_mom;Req;send_sisters;sending ABORT to
sisters
06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;job execution started
06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;start failed on unknown node
06/28/2007 11:39:22;0080; pbs_mom;Job;6.heidi;local task termination
detected. killing job
06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;kill_job
This has to be due to some obvious problem I'm missing. The test has
been run on Ubuntu (7.04) and Suse (SLES 8) systems with similar
results. syslog shows nothing.
Any advice to correct a hopefully simple oversight on my part would be
appreciated.
-Chad
Chad Vizino
Pittsburgh Supercomputing Center
More information about the torqueusers
mailing list