[torqueusers] Question about the difference between a node where pbs_server is run and a compute node

Bas van der Vlies basv at sara.nl
Wed Apr 28 11:53:37 MDT 2010


On 28 apr 2010, at 19:23, Garrick Staples wrote:

> On Wed, Apr 28, 2010 at 07:15:42PM +0200, Bas van der Vlies alleged:
>> 
>> On 28 apr 2010, at 19:01, Bas van der Vlies wrote:
>> 
>> I am not at work but here is a strace of qstat thangs for a few seconds in the poll call.
>> {{{
>> socket(PF_FILE, SOCK_STREAM, 0)         = 3
>> connect(3, {sa_family=AF_FILE, path="/tmp/.torque-unix"...}, 19) = 0
>> getuid32()                              = 31000
>> getgid32()                              = 31010
>> getpid()                                = 22919
>> sendmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"m"..., 1}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=22919, uid=31000, gid=31010}}, msg_flags=0}, 0) = 1
>> mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40275000
>> mmap2(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x402b6000
>> write(3, "+2+12+21+3bas+0+0+0"..., 19)  = 19
>> poll([{fd=3, events=POLLIN|POLLHUP}], 1, 10800000
>> }}}
>> 
>> Regards
> 
> qstat sent a request and is waiting for the response. Perfectly normal.
> 
That is true, but before the pbs_connect on the server. It is handled in a split second:
{{{
real	0m0.042s
user	0m0.008s
sys	0m0.004s
}}}

after the pbs_connect call:
{{{
real	0m56.078s
user	0m0.012s
sys	0m0.004s
}}}


> What is the server doing?


Everything is stopped so no pbs_server logs. After the pbs_connect call timeout:
{{{
4/28/2010 19:41:48;0080;PBS_Server;Req;dis_request_read;req header bad, dis error 7 (Premature end of message), type=Connect
04/28/2010 19:41:48;0080;PBS_Server;Req;req_reject;Reject reply code=15056(Bad DIS based Request Protocol MSG=cannot decode message), aux=0, type=Connect, from @
}}}

strace on the pbs_server after pbs_connect
{{{
select(4096, [6 7 8 9 10 12 13 14 15 16 17 18 19 20 21 190], NULL, NULL, {1, 0}) = 9 (in [12 13 15 16 17 18 19 20 21], left {0, 999994})
time(NULL)                              = 1272477000
time(NULL)                              = 1272477000
time(NULL)                              = 1272477000
setsockopt(12, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
recvmsg(12, {msg_name(0)=NULL, msg_iov(1)=[{"m"..., 1}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=24617, uid=0, gid=0}}, msg_flags=0}, 0) = 1
open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 22
_llseek(22, 0, [0], SEEK_CUR)           = 0
fstat64(22, {st_mode=S_IFREG|0644, st_size=1552, ...}) = 0
mmap2(NULL, 1552, PROT_READ, MAP_SHARED, 22, 0) = 0x40020000
_llseek(22, 1552, [1552], SEEK_SET)     = 0
munmap(0x40020000, 1552)                = 0
close(22)                               = 0
poll([{fd=12, events=POLLIN|POLLHUP}], 1, 60000
}}}

It waits for nearly a minute in the poll function.
--
Bas van der Vlies
basv at sara.nl





More information about the torqueusers mailing list