Problem with job starts on Linux

Chad Vizino vizino at psc.edu
Thu Jun 28 09:49:53 MDT 2007


It's been awhile since I have built and run Torque so downloaded 2.1.8 
to try out on a couple of Linux systems (Ubuntu and Suse).  I wanted to 
confirm basic operation so did the customary configure-make-make install 
sequence and started the server, mom and sched daemons on the same host, 
adding the host to /var/spool/torque/server_priv/nodes ("blueracer:ts"). 
The mom host shows up as free and timeshared on "pbsnodes -a" output. 
/var/spool/torque/mom_priv/config is:

$logevent 0x1ff
$loglevel 7
$usecp *:/home /home

However, I can't get a job to start under 2.1.8.  I have tried a simple 
job start test under versions 2.0.0p11, 2.1.1, 2.1.2, 2.1.3, 2.1.6, and
2.1.8.  The following failure is occurs under every version starting 
with 2.1.2:

$ qsub -j eo
echo touching /tmp/foo
touch /tmp/foo
$ ls -l STDIN.e6
-rw------- 1 vizino vizino 0 2007-06-28 11:39 STDIN.e6
$ ls -l /tmp/foo
ls: /tmp/foo: No such file or directory

 From the mom log:

06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;starting job execution
06/28/2007 11:39:22;0001;   pbs_mom;Job;job_nodes;0: heidi/0
06/28/2007 11:39:22;0001;   pbs_mom;Job;job_nodes;job: 6.heidi 
numnodes=1 numvnod=1
06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;evaluating limits for job
06/28/2007 11:39:22;0002;   pbs_mom;n/a;mom_close_poll;entered
06/28/2007 11:39:22;0001;   pbs_mom;Job;6.heidi;phase 2 of job launch 
successfully completed
06/28/2007 11:39:22;0001;   pbs_mom;Job;TMomFinalizeJob3;read start 
return code=-2 session=26752
06/28/2007 11:39:22;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
started, Failure job exec failure, after files staged, no retry
06/28/2007 11:39:22;0008;   pbs_mom;Req;send_sisters;sending command 
ABORT_JOB for job 6.heidi (10)
06/28/2007 11:39:22;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;job execution started
06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;start failed on unknown node
06/28/2007 11:39:22;0080;   pbs_mom;Job;6.heidi;local task termination 
detected.  killing job
06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;kill_job

This has to be due to some obvious problem I'm missing.  The test has 
been run on Ubuntu (7.04) and Suse (SLES 8) systems with similar 
results.  syslog shows nothing.

Any advice to correct a hopefully simple oversight on my part would be 


Chad Vizino
Pittsburgh Supercomputing Center

