[torqueusers] Problems compiling Torque GSSAPI branch

Mike Coyne Mike.Coyne at PACCAR.com
Thu Mar 11 07:25:30 MST 2010


In 
#1  0xb7eae5d8 in ccname_for_job (jobname=0x81803b9
"214.cluster-master.cluster-test.local", prefix=0x0) at
../Libifl/pbsgss.c:839
Your are getting Null for prefix which is a char * set to path_creds in 
The calling function. 
Path_creds is defined in src/resmom/mom_main.c  or it should be...
Try something like this maybe on or around line 212 ...
char        *path_aux;
char        *path_server_name;
char        *path_home = PBS_SERVER_HOME;
#ifdef GSSAPI
char           *path_creds = "/tmp";
#endif
char        *mom_home;
extern char *msg_daemonname;          /* for logs     */
extern char *msg_info_mom; /* Mom information message   */
extern int pbs_errno;
gid_t  pbsgroup;
unsigned int pbs_mom_port = 0;
unsigned int pbs_rm_port = 0;

to initialize it to /tmp

One other thing , if you are using AFS ... and on linux you will want to call out lsetpag() directly in src/lib/Libifl/pbsgss.c. Instead of using aklog -setpag as that doesn't actually work for linux... lsetpag is in /usr/lib64/afs/libsys.a which you will need to link with

/* assumes it's running as the mom, because server doesn't need to call aklog */
int authenticate_as_job(char *ccname,
			int setpag) {
  if (setenv("KRB5CCNAME",ccname,1) != 0) {
    return -1;
  }
  if (setpag) {
    system("/usr/bin/aklog -setpag");   
  } else {
    system("/usr/bin/aklog");
  }
  return 0;
}


Maybe change the
  if (setpag) {
    system("/usr/bin/aklog -setpag");   
  } else {

Call to something simple like
  if (setpag) {
    lsetpag();
    system("/usr/bin/aklog ");   
  } else {
And link with the libsys.a
Mike

-----Original Message-----
From: Peter Smith [mailto:peter.smith3882100 at gmail.com] 
Sent: Wednesday, March 10, 2010 6:10 PM
To: Mike Coyne
Cc: torqueusers at supercluster.org
Subject: Re: [torqueusers] Problems compiling Torque GSSAPI branch

Configuration files have now been thoroughly checked and no
configuration errors has been found.

"watch -n 0,1 /tmp/" has been running on the worker node at the same
time as a job is submitted, but no creds is found in the directory at
any time, but it seems like the worker node recieves a ticket form the
master when looking through the pbs_mom logfile after the segfault,
this is very strange. The permissions on tmp is correct and Kerberos
between master and worker are functional as SSH has been configured to
use Kerberos credentials.

This information is gathered at the same time a user with a valid
forwardable tickets is submitting echo "sleep 30" | qsub" :

---------------------------
gdb session:

gdb /usr/local/sbin/pbs_mom
Starting program: /usr/local/sbin/pbs_mom
[Thread debugging using libthread_db enabled]
[New Thread 0xb7c0c6d0 (LWP 1836)]
(gdb) r
MOM is up
do_rpp: got a resource monitor request
do_rpp: got a resource monitor request
Accepting user creds for 214.cluster-master.cluster-test.local

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb7c0c6d0 (LWP 1836)]
0xb7cc13b3 in strlen () from /lib/i686/cmov/libc.so.6
(gdb) bt
#0  0xb7cc13b3 in strlen () from /lib/i686/cmov/libc.so.6
#1  0xb7eae5d8 in ccname_for_job (jobname=0x81803b9
"214.cluster-master.cluster-test.local", prefix=0x0) at
../Libifl/pbsgss.c:839
#2  0x08064bd5 in req_accept_forwarded_creds (request=0x817f6e0,
socket=10, save=1) at requests.c:4141
#3  0x08078443 in dispatch_request (sfds=10, request=0x817f6e0) at
../server/process_request.c:1054
#4  0x08078843 in process_request (sfds=10) at ../server/process_request.=:730
#5  0xb7eac902 in wait_request (waittime=1, SState=0x0) at
../Libnet/net_server.c:503
#6  0x0805fbbf in main_loop () at mom_main.c:8101
#7  0x0806105d in main (argc=1, argv=0xbfffd894) at mom_main.c:8203

(gdb) frame 0
#0  0xb7cc13b3 in strlen () from /lib/i686/cmov/libc.so.6
(gdb) frame 1
#1  0xb7eae5d8 in ccname_for_job (jobname=0x81803b9
"214.cluster-master.cluster-test.local", prefix=0x0) at
../Libifl/pbsgss.c:839
839	  i = strlen(prefix) +
(gdb) frame 2
#2  0x08064bd5 in req_accept_forwarded_creds (request=0x817f6e0,
socket=10, save=1) at requests.c:4141
4141	  ccname = ccname_for_job(request->rq_ind.rq_queuejob.rq_jid,path_cr=ds);
(gdb) frame 3
#3  0x08078443 in dispatch_request (sfds=10, request=0x817f6e0) at
../server/process_request.c:1054
1054	      if (req_accept_forwarded_creds(request,sfds,1) != 0) {
(gdb) frame 4
#4  0x08078843 in process_request (sfds=10) at ../server/process_request.=:730
730	  dispatch_request(sfds, request);
(gdb) frame 5
#5  0xb7eac902 in wait_request (waittime=1, SState=0x0) at
../Libnet/net_server.c:503
503	        svr_conn[i].cn_func(i);
(gdb) frame 6
#6  0x0805fbbf in main_loop () at mom_main.c:8101
8101	    if (wait_request(tmpTime, NULL) != 0)
(gdb) frame 7
#7  0x0806105d in main (argc=1, argv=0xbfffd894) at mom_main.c:8203
8203	  main_loop();
---------------------------
pbs_mom logfile [loglevel 7]

pbs_mom;n/a;rm_request;setting alarm in rm_request
pbs_mom;n/a;rm_request;setting alarm in rm_request
pbs_mom;n/a;rm_request;setting alarm in rm_request
pbs_mom;n/a;rm_request;setting alarm in rm_request
pbs_mom;n/a;rm_request;setting alarm in rm_request
pbs_mom;n/a;rm_request;setting alarm in rm_request
pbs_mom;Req;dis_request_read;decoding command ForwardCreds from PBS_Server
pbs_mom;Req;;Type ForwardCreds request received from
PBS_Server at cluster-master.cluster-test.local, sock=10
pbs_mom;Job;process_request;request type ForwardCreds from host
cluster-master.cluster-test.local received
pbs_mom;Job;process_request;request type ForwardCreds from host
cluster-master.cluster-test.local allowed
pbs_mom;Job;dispatch_request;dispatching request ForwardCreds on sd=10
pbs_mom;Svr;req_accept_forwarded_creds;dispatching request ForwardCreds on =d=10
---------------------------
strace output from pbs_mom is attached to this mail [pbs_mom_strace.txt]
---------------------------


On Wed, Mar 10, 2010 at 4:33 PM, Mike Coyne <Mike.Coyne at paccar.com> wrote:
> The req_accept_forwarded_creds is when its authenticating and saving the =reds to /tmp/<something that looks like a job name with krb...prepended>
> You might check to see if it wrote out the creds. One thing that has been=helpful for me when mom dumps is to start it up in gdb with DEBUGING turne= on to keep it from forking , let it dump and do a backtrace .... somethin= like
> $ export PBSDEBUG=1
> $ gdb /opt/torque/sbin/pbs_mom
>> r <what ever options mom  you set>
>
> And run your job to the node , when it dumps do a backtrace
>> bt
>
> that way you can see what called the library function that died.
>
> Mike



More information about the torqueusers mailing list