[torqueusers] torque getting stuck

Adrian Sevcenco Adrian.Sevcenco at cern.ch
Mon May 11 07:35:45 MDT 2009


Hi! I have a situation with torque-server-2.3.0 .. after 5 to 10 minutes
after a restart the servers is stuck .. for a qstat command i have
[root at grid01 ~]# time -p qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----

.....
real 23.60
user 0.00
sys 0.00

and for maui to contact the pbs_server i have
[root at grid01 ~]# time -p diagnose -n
ERROR:    lost connection to server
ERROR:    cannot request service (status)
real 29.99
user 0.00
sys 0.00

i done an strace for a few hours on pbs_server and it showed me this :

[root at grid01 ~]# cat pbs_server.trace
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 35.87    2.424344          73     33334         8 select
 15.92    1.075911           6    177583           time
 11.57    0.782018          29     27419        26 write
  9.49    0.641369          25     25969         2 poll
  3.83    0.258976           4     69645     61244 recvfrom
  3.45    0.233453          16     14926      9574 connect
  2.84    0.192112           8     23314           close
  2.75    0.185961           4     41387      9504 read
  2.11    0.142857           3     56436           fcntl64
  2.11    0.142558          15      9527           brk
  1.89    0.127642           9     14926           socket
  1.40    0.094394          17      5417           send
  1.20    0.081420           9      9485           shutdown
  0.91    0.061722           6      9514           getsockopt
  0.70    0.047012           5      9585        69 bind
  0.61    0.041050           3     11990           setsockopt
  0.56    0.037660           9      4184           sendto
  0.49    0.033069           6      5885           open
  0.46    0.031291           5      5700           munmap
  0.43    0.029329           5      5394           gettimeofday
  0.31    0.021006           8      2506           accept
  0.29    0.019565           3      5720           mmap2
  0.23    0.015729           4      4242        25 ioctl
  0.22    0.014798           6      2474           recvmsg
  0.20    0.013414           2      5700           fstat64
  0.08    0.005256         584         9           clone
  0.05    0.003105          22       140        22 unlink
  0.01    0.000587          12        48           link
  0.01    0.000341           5        75           stat64
  0.00    0.000073           4        18         8 waitpid
  0.00    0.000044           2        18           rt_sigprocmask
  0.00    0.000020           2        10         9 sigreturn
------ ----------- ----------- --------- --------- ----------------
100.00    6.758086                582580     80491 total

Can some torque expert see some problems here?
Thank you,
Adrian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3105 bytes
Desc: S/MIME Cryptographic Signature
Url : http://www.supercluster.org/pipermail/torqueusers/attachments/20090511/405b0e48/attachment.bin 


More information about the torqueusers mailing list