[torqueusers] Torque GSSAPI branch: compiling on AIX 5.3

Alessandro Federico alessandro.federico at caspur.it
Wed Jul 11 11:32:52 MDT 2007


hi Garrick,

I installed Torque GSSAPI branch with Moab on 4 nodes (IBM Power5,
8 cpus, AIX 5.3).
I get this strange behavior:

- if a job requests more than one node (for example, nodes=9
   nodes=12, nodes=32) it runs without errors

- if a job requests less than 8 processors it fails and pbs_mom
   logs 'Bad DIS based Request Protocol' as you can see below

#############################################################
07/11/2007 19:11:48;0001;   pbs_mom;Job;job_nodes;job: 
16.pwr504.caspur.it numnodes=1 numvnod=4
07/11/2007 19:11:54;0001;   pbs_mom;Job;16.pwr504.caspur.it;phase 2 of 
job launch successfully completed
07/11/2007 19:11:57;0001;   pbs_mom;Job;TMomFinalizeJob3;read start 
return code=0 session=159864
07/11/2007 19:11:57;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
16.pwr504.caspur.it started, pid = 159864
07/11/2007 19:11:57;0001;   pbs_mom;Job;16.pwr504.caspur.it;job 
successfully started
07/11/2007 19:11:57;0008;   pbs_mom;Job;16.pwr504.caspur.it;job 
16.pwr504.caspur.it reported successful start on 1 node(s)
07/11/2007 19:11:57;0080;   pbs_mom;Req;?;req header bad, dis error 7
07/11/2007 19:11:57;0080;   pbs_mom;Req;req_reject;Reject reply 
code=15056(Bad DIS based Request Protocol MSG=cannot decode me
ssage), aux=0, type=Connect, from @
07/11/2007 19:11:57;0002;   pbs_mom;Req;dis_reply_write;DIS reply 
failure, -1
07/11/2007 19:12:04;0080;   pbs_mom;Job;16.pwr504.caspur.it;task 1 
terminated
07/11/2007 19:12:04;0008;   pbs_mom;Job;16.pwr504.caspur.it;job was 
terminated
07/11/2007 19:12:04;0080;   pbs_mom;Job;16.pwr504.caspur.it;local task 
termination detected.  killing job
07/11/2007 19:12:04;0008;   pbs_mom;Job;16.pwr504.caspur.it;kill_job
07/11/2007 19:12:04;0080;   pbs_mom;Job;16.pwr504.caspur.it;sending 
preobit jobstat
07/11/2007 19:12:04;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
07/11/2007 19:12:04;0080;   pbs_mom;Svr;preobit_reply;decode_DIS_Status 
worked, top of while loop
07/11/2007 19:12:04;0080;   pbs_mom;Svr;preobit_reply;in while loop, no 
error from job stat
07/11/2007 19:12:04;0080;   pbs_mom;Job;16.pwr504.caspur.it;performing 
job clean-up
07/11/2007 19:12:11;0080;   pbs_mom;Job;16.pwr504.caspur.it;Obit sent
07/11/2007 19:12:11;0001;   pbs_mom;Job;16.pwr504.caspur.it;server 
rejected job obit - unexpected job state
07/11/2007 19:12:11;0080;   pbs_mom;Job;16.pwr504.caspur.it;deleting job 
16.pwr504.caspur.it in state OBIT
#############################################################

Can you help me, please?

Thanks
Ale

Garrick Staples wrote:
> On Mon, Jul 09, 2007 at 06:34:58PM +0200, Alessandro Federico alleged:
>> Hi,
>>
>> I found two errors compiling Torque GSSAPI branch on AIX 5.3.
>>
>> 1) src/lib/Libnet/get_hostnamefromaddr.c must include sys/socket.h
>>    in order to define the macro AF_INET. this is because on AIX
>>    the header netinet/in.h doesn't include sys/socket.h as in Linux.
>>
>> 2) in src/server/job_func.c (line 535) h_errno must be defined only
>>    if H_ERRNO_DECLARED is not.
> 
> I just checked in your patch.  Thanks for sending it in!
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers

-- 
  Alessandro Federico
  CASPUR     http://www.caspur.it/
  e-mail:    alessandro.federico at caspur.it
  phone:     +39 06 44486708
  fax:       +39 06 4957083
------------------------------------------
  Military intelligence is a contradiction
  in terms.                 (Groucho Marx)
------------------------------------------


More information about the torqueusers mailing list