[torqueusers] Post job file processing bug in 1.1.0 (patch 1-3)

Roy Dragseth Roy.Dragseth at cc.uit.no
Wed Oct 27 03:12:33 MDT 2004


On Thursday 21 October 2004 19:04, Dave Jackson wrote:
>   The reply code 15035 inidicates that an invalid home directory was
> specified when the mom was attempting to fork to the user.  Torque
> 1.1.0p3 and higher fixed a bug where the mom would attempt to determine
> the home directory based on a NULL job.  This is the only change which
> would have affected it.  It appears that this bug masked another.
>
>   The latest torque-1.1.0p4 snapshot contains a re-organized mom-level
> fork routine which will log environment errors better and will also
> sanity check the home directory.  If you can regularly reproduce this
> failure, please test with the 1.1.0p4 snapshot and send us the logs.
> You should only need to upgrade the moms on the nodes where the test job
> is being run.  For maximum value, please export the env variable
> PBSLOGLEVEL=3 on the compute nodes before starting the mom.
>
>   With this info, we should be able to rectify this problem quickly.

Hi, I've installed the torque-1.1.0p4-snap.1098376627.tar.gz and gets the 
following in the mom logfile when I submit a trivial job:

10/27/2004 10:53:58;0002;   pbs_mom;Svr;Log;Log opened
10/27/2004 10:53:58;0002;   pbs_mom;Svr;restricted;paiute.local
10/27/2004 10:53:58;0002;   pbs_mom;Svr;usecp;paiute.cc.uit.no:/home /home
10/27/2004 10:53:58;0002;   pbs_mom;n/a;initialize;independent
10/27/2004 10:53:58;0002;   pbs_mom;Svr;pbs_mom;Is up
10/27/2004 10:53:58;0002;   pbs_mom;n/a;is_update_stat;is_update_stat: added 
connection to paiute.local
10/27/2004 10:53:58;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/27/2004 10:53:58;0002;   pbs_mom;n/a;is_update_stat;hello sent to server
10/27/2004 10:54:29;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/27/2004 10:54:49;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/27/2004 10:55:19;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server
10/27/2004 10:55:30;0100;   pbs_mom;Req;;Type queuejob request received from 
PBS_Server at paiute.local, sock=10
10/27/2004 10:55:30;0100;   pbs_mom;Req;;Type jobscript request received from 
PBS_Server at paiute.local, sock=10
10/27/2004 10:55:30;0100;   pbs_mom;Req;;Type readytocommit request received 
from PBS_Server at paiute.local, sock=10
10/27/2004 10:55:30;0100;   pbs_mom;Req;;Type commit request received from 
PBS_Server at paiute.local, sock=10
10/27/2004 10:55:30;0008;   pbs_mom;Job;3.paiute.cc.uit.no;Started, pid = 2400
10/27/2004 10:55:30;0100;   pbs_mom;Req;;Type statusjob request received from 
PBS_Server at paiute.local, sock=11
10/27/2004 10:55:30;0100;   pbs_mom;Req;;Type modifyjob request received from 
PBS_Server at paiute.local, sock=10
10/27/2004 10:55:30;0008;   pbs_mom;Job;3.paiute.cc.uit.no;Job Modified at 
request ofPBS_Server at paiute.local
10/27/2004 10:55:30;0080;   
pbs_mom;Job;3.paiute.cc.uit.no;scan_for_terminated: task 1 terminated, sid 
2400
10/27/2004 10:55:30;0008;   pbs_mom;Job;3.paiute.cc.uit.no;Terminated
10/27/2004 10:55:30;0008;   pbs_mom;Job;3.paiute.cc.uit.no;kill_job
10/27/2004 10:55:30;0080;   pbs_mom;Job;3.paiute.cc.uit.no;Obit sent
10/27/2004 10:55:30;0100;   pbs_mom;Req;;Type deletefiles request received 
from PBS_Server at paiute.local, sock=11
10/27/2004 10:55:30;0004;   pbs_mom;Fil;3.paiute.cc.uit.no;forking to user, 
uid: 50008  gid: 50008  homedir: '/home/royd'

10/27/2004 10:55:30;0080;   pbs_mom;Req;req_reject;Reject reply code=15035
( MSG=), aux=0, type=54, from PBS_Server at paiute.local
10/27/2004 10:55:30;0100;   pbs_mom;Req;;Type deletejob request received from 
PBS_Server at paiute.local, sock=11
10/27/2004 10:55:40;0002;   pbs_mom;n/a;is_update_stat;status update 
successfully sent to server


It seems to me like the correct homedir is used.

Some details about the setup in Rocks:

Home dirs:
The compute nodes automount the home dirs as needed from the frontend.  This 
seems to happen correctly, the submitting users homedir appears as mounted on 
the compute node running the job.

Network setup:
The compute nodes are on a private network behind the frontend that functions 
as a firewall and router for the nodes, also the frontend runs dns for the 
cluster.

The hostname of the frontend is the FQDN, in my case paiute.cc.uit.no, while 
the compute nodes sees it as paiute.local, which is the name associated with 
the NIC on the private network, which leads to the following assymmetry in 
the moms config:

[root at compute-0-0 root]# cat /opt/torque/mom_priv/config
$restricted paiute.local
$clienthost paiute.local
$clienthost localhost.localdomain
$clienthost localhost
$usecp paiute.cc.uit.no:/home /home

as the mom believes that the jobs are submitted from paiute.cc.uit.no not 
paiute.local.

Again, this setup works with torque 1.0.1p6.

Hope this helps.

Best regards,
r.
-- 
  The Computer Center, University of Tromsø, N-9037 TROMSØ, Norway.
	      phone:+47 77 64 41 07, fax:+47 77 64 41 00
     Roy Dragseth, High Performance Computing System Administrator
	 Direct call: +47 77 64 62 56. email: royd at cc.uit.no


More information about the torqueusers mailing list