[torqueusers] LDAP integration?

Prakash Velayutham prakash.velayutham at cchmc.org
Sat Feb 28 20:48:37 MST 2009


This might be way off, but sometimes NSCD in the nodes could be  
causing this.

Another thing would be to restart MOM on the node. I have seen Torque  
server not doing the right thing if there has been a change to the  
name services after it was started.

Hope that helps,
Prakash

On Feb 27, 2009, at 6:38 PM, Jim Turner wrote:

> I'm trying to submit a job on a cluster where users are  
> authenticated using LDAP to a server external to the cluster. I can  
> log in and ssh (without password) to any node in the cluster. But  
> when I try to submit a job the MOM log says - cannot find user in  
> password file...
>
> 02/27/2009 18:16:48;0001; pbs_mom;Svr;pbs_mom;start_exec, no  
> password entry for user crctst01
> 02/27/2009 18:16:48;0008; pbs_mom;Req;send_sisters;sending ABORT to  
> sisters
> 02/27/2009 18:16:48;0001; pbs_mom;Svr;pbs_mom;exec_bail, exec_bail:  
> sent 0 ABORT requests, should be 3
> 02/27/2009 18:16:48;0008; pbs_mom;Job;4.queuesrv1;Job Modified at  
> request of PBS_Server at queuesrv1.hpc.louisville.edu
> 02/27/2009 18:16:48;0080; pbs_mom;Svr;preobit_reply;top of  
> preobit_reply
> 02/27/2009 18:16:48;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/ 
> decode_DIS_replySvr worked, top of while loop
> 02/27/2009 18:16:48;0080; pbs_mom;Svr;preobit_reply;in while loop,  
> no error from job stat
> 02/27/2009 18:16:48;0008; pbs_mom;Job;4.queuesrv1;checking job post- 
> processing routine
> 02/27/2009 18:16:48;0080; pbs_mom;Job;4.queuesrv1;obit sent to server
> 02/27/2009 18:16:48;0001; pbs_mom;Svr;pbs_mom;Success (0) in  
> fork_to_user, cannot find user 'crctst01' in password file
> 02/27/2009 18:16:48;0080; pbs_mom;Req;req_reject;Reject reply  
> code=15023(Bad UID for job execution  
> REJHOST=node312.hpc.louisville.edu MSG=cannot find user 'crctst01'  
> in password file), aux=0, type=CopyFiles, from PBS_Server at queuesrv1.hpc.louisville.edu
>
> This is that user on the node:
>
> crctst01 at node312$ getent passwd crctst01
> crctst01:*:100003:100001:crctst01:/home/crctst01:/bin/bash
> crctst01 at node312$
>
> And if I read the code correctly.. I think that I'm getting rejected  
> by this fragment in src/resmom/start_exec.c
>
> pwdp = getpwnam(ptr);
>
> if (pwdp == NULL)
> {
> /* FAILURE */
>
> sprintf(log_buffer, "no password entry for user %s",
> ptr);
>
> return(NULL);
> }
>
> Putting together my own test case using getpwnam returns the correct  
> value on that node. Anybody got an idea on how to debug this?
>
> Jim Turner
> Cluster Enablement Team (CET) Senior Engineer
> phone: 919-543-2505 / mobile: 919-381-8739
> tjim at us.ibm.com
> ibm.com/systems/services/labservices
>
> <2F871306.jpg>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20090228/249236cb/attachment.html


More information about the torqueusers mailing list