[torqueusers] Re: Problem with job starts on Linux
Chad Vizino
vizino at psc.edu
Thu Jun 28 11:44:17 MDT 2007
Typo in my previous post: the nodes file mention below should be
"heidi:ts" for the machine that I was testing on...was also on another
machine (blueracer) and mistyped the node name.
Sorry for the confusion.
-Chad
On 06/28/2007 11:49 AM, Chad Vizino wrote:
> Hi,
>
> It's been awhile since I have built and run Torque so downloaded 2.1.8
> to try out on a couple of Linux systems (Ubuntu and Suse). I wanted to
> confirm basic operation so did the customary configure-make-make install
> sequence and started the server, mom and sched daemons on the same host,
> adding the host to /var/spool/torque/server_priv/nodes ("blueracer:ts").
> The mom host shows up as free and timeshared on "pbsnodes -a" output.
> /var/spool/torque/mom_priv/config is:
>
> $logevent 0x1ff
> $loglevel 7
> $usecp *:/home /home
>
> However, I can't get a job to start under 2.1.8. I have tried a simple
> job start test under versions 2.0.0p11, 2.1.1, 2.1.2, 2.1.3, 2.1.6, and
> 2.1.8. The following failure is occurs under every version starting
> with 2.1.2:
>
> $ qsub -j eo
> echo touching /tmp/foo
> touch /tmp/foo
> 6.heidi
> $ ls -l STDIN.e6
> -rw------- 1 vizino vizino 0 2007-06-28 11:39 STDIN.e6
> $ ls -l /tmp/foo
> ls: /tmp/foo: No such file or directory
>
> From the mom log:
>
> 06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;starting job execution
> 06/28/2007 11:39:22;0001; pbs_mom;Job;job_nodes;0: heidi/0
> 06/28/2007 11:39:22;0001; pbs_mom;Job;job_nodes;job: 6.heidi
> numnodes=1 numvnod=1
> 06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;evaluating limits for job
> 06/28/2007 11:39:22;0002; pbs_mom;n/a;mom_close_poll;entered
> 06/28/2007 11:39:22;0001; pbs_mom;Job;6.heidi;phase 2 of job launch
> successfully completed
> 06/28/2007 11:39:22;0001; pbs_mom;Job;TMomFinalizeJob3;read start
> return code=-2 session=26752
> 06/28/2007 11:39:22;0001; pbs_mom;Job;TMomFinalizeJob3;job not
> started, Failure job exec failure, after files staged, no retry
> 06/28/2007 11:39:22;0008; pbs_mom;Req;send_sisters;sending command
> ABORT_JOB for job 6.heidi (10)
> 06/28/2007 11:39:22;0008; pbs_mom;Req;send_sisters;sending ABORT to
> sisters
> 06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;job execution started
> 06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;start failed on unknown
> node
> 06/28/2007 11:39:22;0080; pbs_mom;Job;6.heidi;local task termination
> detected. killing job
> 06/28/2007 11:39:22;0008; pbs_mom;Job;6.heidi;kill_job
>
> This has to be due to some obvious problem I'm missing. The test has
> been run on Ubuntu (7.04) and Suse (SLES 8) systems with similar
> results. syslog shows nothing.
>
> Any advice to correct a hopefully simple oversight on my part would be
> appreciated.
>
> -Chad
>
> Chad Vizino
> Pittsburgh Supercomputing Center
>
More information about the torqueusers
mailing list