[torqueusers] Re: Problem with job starts on Linux

Chad Vizino vizino at psc.edu
Thu Jun 28 11:44:17 MDT 2007


Typo in my previous post: the nodes file mention below should be 
"heidi:ts" for the machine that I was testing on...was also on another 
machine (blueracer) and mistyped the node name.

Sorry for the confusion.

   -Chad

On 06/28/2007 11:49 AM, Chad Vizino wrote:
> Hi,
> 
> It's been awhile since I have built and run Torque so downloaded 2.1.8 
> to try out on a couple of Linux systems (Ubuntu and Suse).  I wanted to 
> confirm basic operation so did the customary configure-make-make install 
> sequence and started the server, mom and sched daemons on the same host, 
> adding the host to /var/spool/torque/server_priv/nodes ("blueracer:ts"). 
> The mom host shows up as free and timeshared on "pbsnodes -a" output. 
> /var/spool/torque/mom_priv/config is:
> 
> $logevent 0x1ff
> $loglevel 7
> $usecp *:/home /home
> 
> However, I can't get a job to start under 2.1.8.  I have tried a simple 
> job start test under versions 2.0.0p11, 2.1.1, 2.1.2, 2.1.3, 2.1.6, and
> 2.1.8.  The following failure is occurs under every version starting 
> with 2.1.2:
> 
> $ qsub -j eo
> echo touching /tmp/foo
> touch /tmp/foo
> 6.heidi
> $ ls -l STDIN.e6
> -rw------- 1 vizino vizino 0 2007-06-28 11:39 STDIN.e6
> $ ls -l /tmp/foo
> ls: /tmp/foo: No such file or directory
> 
>  From the mom log:
> 
> 06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;starting job execution
> 06/28/2007 11:39:22;0001;   pbs_mom;Job;job_nodes;0: heidi/0
> 06/28/2007 11:39:22;0001;   pbs_mom;Job;job_nodes;job: 6.heidi 
> numnodes=1 numvnod=1
> 06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;evaluating limits for job
> 06/28/2007 11:39:22;0002;   pbs_mom;n/a;mom_close_poll;entered
> 06/28/2007 11:39:22;0001;   pbs_mom;Job;6.heidi;phase 2 of job launch 
> successfully completed
> 06/28/2007 11:39:22;0001;   pbs_mom;Job;TMomFinalizeJob3;read start 
> return code=-2 session=26752
> 06/28/2007 11:39:22;0001;   pbs_mom;Job;TMomFinalizeJob3;job not 
> started, Failure job exec failure, after files staged, no retry
> 06/28/2007 11:39:22;0008;   pbs_mom;Req;send_sisters;sending command 
> ABORT_JOB for job 6.heidi (10)
> 06/28/2007 11:39:22;0008;   pbs_mom;Req;send_sisters;sending ABORT to 
> sisters
> 06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;job execution started
> 06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;start failed on unknown 
> node
> 06/28/2007 11:39:22;0080;   pbs_mom;Job;6.heidi;local task termination 
> detected.  killing job
> 06/28/2007 11:39:22;0008;   pbs_mom;Job;6.heidi;kill_job
> 
> This has to be due to some obvious problem I'm missing.  The test has 
> been run on Ubuntu (7.04) and Suse (SLES 8) systems with similar 
> results.  syslog shows nothing.
> 
> Any advice to correct a hopefully simple oversight on my part would be 
> appreciated.
> 
>   -Chad
> 
> Chad Vizino
> Pittsburgh Supercomputing Center
> 


More information about the torqueusers mailing list