[torqueusers] Torque GSSAPI branch: compiling on AIX 5.3
Alessandro Federico
alessandro.federico at caspur.it
Wed Jul 11 11:32:52 MDT 2007
hi Garrick,
I installed Torque GSSAPI branch with Moab on 4 nodes (IBM Power5,
8 cpus, AIX 5.3).
I get this strange behavior:
- if a job requests more than one node (for example, nodes=9
nodes=12, nodes=32) it runs without errors
- if a job requests less than 8 processors it fails and pbs_mom
logs 'Bad DIS based Request Protocol' as you can see below
#############################################################
07/11/2007 19:11:48;0001; pbs_mom;Job;job_nodes;job:
16.pwr504.caspur.it numnodes=1 numvnod=4
07/11/2007 19:11:54;0001; pbs_mom;Job;16.pwr504.caspur.it;phase 2 of
job launch successfully completed
07/11/2007 19:11:57;0001; pbs_mom;Job;TMomFinalizeJob3;read start
return code=0 session=159864
07/11/2007 19:11:57;0001; pbs_mom;Job;TMomFinalizeJob3;job
16.pwr504.caspur.it started, pid = 159864
07/11/2007 19:11:57;0001; pbs_mom;Job;16.pwr504.caspur.it;job
successfully started
07/11/2007 19:11:57;0008; pbs_mom;Job;16.pwr504.caspur.it;job
16.pwr504.caspur.it reported successful start on 1 node(s)
07/11/2007 19:11:57;0080; pbs_mom;Req;?;req header bad, dis error 7
07/11/2007 19:11:57;0080; pbs_mom;Req;req_reject;Reject reply
code=15056(Bad DIS based Request Protocol MSG=cannot decode me
ssage), aux=0, type=Connect, from @
07/11/2007 19:11:57;0002; pbs_mom;Req;dis_reply_write;DIS reply
failure, -1
07/11/2007 19:12:04;0080; pbs_mom;Job;16.pwr504.caspur.it;task 1
terminated
07/11/2007 19:12:04;0008; pbs_mom;Job;16.pwr504.caspur.it;job was
terminated
07/11/2007 19:12:04;0080; pbs_mom;Job;16.pwr504.caspur.it;local task
termination detected. killing job
07/11/2007 19:12:04;0008; pbs_mom;Job;16.pwr504.caspur.it;kill_job
07/11/2007 19:12:04;0080; pbs_mom;Job;16.pwr504.caspur.it;sending
preobit jobstat
07/11/2007 19:12:04;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
07/11/2007 19:12:04;0080; pbs_mom;Svr;preobit_reply;decode_DIS_Status
worked, top of while loop
07/11/2007 19:12:04;0080; pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
07/11/2007 19:12:04;0080; pbs_mom;Job;16.pwr504.caspur.it;performing
job clean-up
07/11/2007 19:12:11;0080; pbs_mom;Job;16.pwr504.caspur.it;Obit sent
07/11/2007 19:12:11;0001; pbs_mom;Job;16.pwr504.caspur.it;server
rejected job obit - unexpected job state
07/11/2007 19:12:11;0080; pbs_mom;Job;16.pwr504.caspur.it;deleting job
16.pwr504.caspur.it in state OBIT
#############################################################
Can you help me, please?
Thanks
Ale
Garrick Staples wrote:
> On Mon, Jul 09, 2007 at 06:34:58PM +0200, Alessandro Federico alleged:
>> Hi,
>>
>> I found two errors compiling Torque GSSAPI branch on AIX 5.3.
>>
>> 1) src/lib/Libnet/get_hostnamefromaddr.c must include sys/socket.h
>> in order to define the macro AF_INET. this is because on AIX
>> the header netinet/in.h doesn't include sys/socket.h as in Linux.
>>
>> 2) in src/server/job_func.c (line 535) h_errno must be defined only
>> if H_ERRNO_DECLARED is not.
>
> I just checked in your patch. Thanks for sending it in!
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
--
Alessandro Federico
CASPUR http://www.caspur.it/
e-mail: alessandro.federico at caspur.it
phone: +39 06 44486708
fax: +39 06 4957083
------------------------------------------
Military intelligence is a contradiction
in terms. (Groucho Marx)
------------------------------------------
More information about the torqueusers
mailing list