[torqueusers] Processes Running but state "queued" on pbs_server

Simon Robbins robbins at physik.uni-wuppertal.de
Mon Aug 8 06:25:36 MDT 2005


Hello,

We're running a 512 node cluster using Maui 3.2.6p13 
and Torque 1.2.0p4.  Torque was configure without 
any special flags to ./configure.

I occasionally observe jobs where the pbs_server has 
allocated a node to a job but thinks that the job is 
still in the queued state.  When I look on the node 
the process is running and the pbs_mom is aware of 
this.  There is an inconsistency between the mom and 
server information.  

I attach my server configuration, command output and 
snipets of log files.  I've tried `momctl -C` and 
similar with no effect.

I guess this effect occurs due to a short break in 
the network beteen the mom and the server.  Maybe I 
should adjust some of the time-outs on the pbs_mom?

Cheers,

Dr. Simon Robbins                  robbins at physik.uni-wuppertal.de
==================================================================
BU Wuppertal, FB-C,                                 (Room. F11.08)
Gaussstrasse, 20,                      Phone : +49 (0)202 439 3750
D-42119 Wuppertal                      Fax   : +49 (0)202 439 2662


**************************************************
qmgr -c 'p s'
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = True
set server acl_hosts = ...
set server managers = ...
set server operators = ...
set server default_queue = default
set server log_events = 255
set server mail_from = ...
set server query_other_jobs = True
set server scheduler_iteration = 99999
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server node_pack = True
set server job_stat_rate = 90
set server poll_jobs = True

**************************************************
maui.cfg:
...
NODEACCESSPOLICY      SHARED

RMPOLLINTERVAL        00:01:30

SERVERPORT            42559
SERVERMODE            NORMAL

LOGFILE               maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              3

CLIENTTIMEOUT         00:01:00

# Node configuration

JOBNODEMATCHPOLICY EXACTNODE
NODEALLOCATIONPOLICY        FIRSTAVAILABLE

# Priority Weights

FSWEIGHT              1
SERVICEWEIGHT         1 # For Xfactor (queue time based) boost
CREDWEIGHT            1 # For idle queue weights
RESOURCEWEIGHT        1 # For parallel jobs

QUEUETIMEWEIGHT       0
XFACTORWEIGHT         100
XFXAP                 10000

# Credential Priorities (used to supress the idle queue)
CLASSWEIGHT            1
CLASSCFG[DEFAULT] PRIORITY=0

# Resources priorities (for parallel jobs)
PROCWEIGHT        100

# Give idle a low priority:
CLASSCFG[idle]    MAXJOB=400 MAXJOBPERUSER=300 PRIORITY=-1000000
CLASSCFG[medium]  MAXJOB=450 MAXJOBPERUSER=400
CLASSCFG[large]   MAXJOB=450 MAXJOBPERUSER=300
CLASSCFG[huge]    MAXJOB=50  MAXJOBPERUSER=40 PRIORITY=-500000

# Standing reservation (small jobs during office hours) - deprecated: to be replaced by idle reservation
SRCFG[quick]    PERIOD=DAY DAYS=MON,TUE,WED,THU,FRI
SRCFG[quick]    STARTTIME=10:00:00 ENDTIME=17:00:00
SRCFG[quick]    MAXTIME=1:01:00
SRCFG[quick]    TASKCOUNT=2 RESOURCES=PROCS:2   # reserve 2 complete (2 processor) nodes
SRCFG[quick]    DEPTH=2

# FairShare

FSPOLICY              DEDICATEDPS
FSDEPTH               14
FSINTERVAL            24:00:00
FSDECAY               0.90
FSGROUPWEIGHT         100
FSUSERWEIGHT          0
FSACCOUNTWEIGHT       0
FSQOSWEIGHT           0
FSCLASSWEIGHT         0

# JOB Deferal settings

DEFERTIME 00:05:00
DEFERCOUNT 1000

# Policies

REJECTNEGPRIOJOBS              FALSE
ENABLENEGJOBPRIORITY           TRUE
ENABLEMULTINODEJOBS            TRUE

BACKFILLPOLICY              FIRSTFIT

RESERVATIONPOLICY           CURRENTHIGHEST # re-evaluate reservation priorities constantly
reservation priorities
RESERVATIONDEPTH            20 # how many advanced system reservations to make

MAXJOBPERUSERPOLICY         ON
MAXJOBPERUSERCOUNT          99999

MAXPROCPERUSERPOLICY        OFF
MAXPROCPERUSERCOUNT         256

MAXPROCSECONDPERUSERPOLICY  OFF
MAXPROCSECONDPERUSERCOUNT   36864000

MAXJOBQUEUEDPERUSERPOLICY   OFF
MAXJOBQUEUEDPERUSERCOUNT    2

MAXPROCPERGROUPPOLICY       OFF
SMAXPROCPERGROUPCOUNT       128
MAXPROCPERGROUPCOUNT        160

JOBMAXOVERRUN               01:00:00  # Allow jobs to overrun wall time by 1 hour

...

**************************************************
node-438:~ # /opt/pbs/sbin/momctl -d 3

Host: node-438/node-438.alicenext   Server: 
alicenext   Version: torque_1.2.0p4
HomeDirectory:          
/opt/torque_1.2.0p4/opt/mom_priv
MOM active:             367280 seconds
Last Msg From Server:   11 seconds (StatusJob)
Last Msg To Server:     10 seconds
Server Update Interval: 20 seconds
Init Msgs Received:     0 hellos/5 cluster-addrs
Init Msgs Sent:         1 hellos
LOGLEVEL:               0 (use SIGUSR1/SIGUSR2 to 
adjust)
Communication Model:    RPP
TCP Timeout:            20 seconds
Prolog Alarm Time:      300 seconds
Alarm Time:             0 of 10 seconds
Trusted Client List:    
10.50.129.182,10.50.129.181,10.50.129.180,10.50.129.179,10.50.129.178,10.50.129.177,10.50.129.176,10.50.129.175,10.50.129.174,10.50.129.173,10.50.129.172,10.50.129.171,10.50.129.170,10.50.129.169,10.50.129.168,10.50.129.167,10.50.129.166,10.50.129.165,10.50.129.164,10.50.129.163,10.50.129.162,10.50.129.161,10.50.129.160,10.50.129.159,10.50.129.158,10.50.129.157,10.50.129.156,10.50.129.155,10.50.129.154,10.50.129.153,10.50.129.152,10.50.129.151,10.50.129.150,10.50.129.149,10.50.129.148,10.50.129.147,10.50.129.146,10.50.129.145,10.50.129.144,10.50.129.143,10.50.129.142,10.50.129.141,10.50.129.140,10.50.129.139,10.50.129.138,10.50.129.137,10.50.129.136,10.50.129.135,10.50.129.134,10.50.129.133,10.50.129.132,10.50.129.131,10.50.129.130,10.50.129.129,10.50.129.128,10.50.129.127,10.50.129.126,10.50.129.125,10.50.129.124,10.50.129.123,10.50.129.122,10.50.129.121,10.50.129.120,10.50.129.119,10.50.129.118,10.50.129.117,10.50.129.116,10.50.129.115,10.50.129.114,10.50.129.113,10.50.129.112,10.50.129.111,10.50.129.110
Job[61555.alicenext.alicenext]  State=RUNNING
Job[61556.alicenext.alicenext]  State=RUNNING
Job[61557.alicenext.alicenext]  State=RUNNING
Job[61558.alicenext.alicenext]  State=RUNNING

diagnostics complete

**************************************************
08/01/2005 11:24:24;0002;   pbs_mom;Svr;Log;Log opened
08/01/2005 11:24:24;0001;   pbs_mom;Job;61555.alicenext.alicenext;job not ready after 5 second timeout, MOM will recheck
08/01/2005 11:24:37;0001;   pbs_mom;Job;61556.alicenext.alicenext;job not ready after 5 second timeout, MOM will recheck
08/01/2005 11:24:52;0001;   pbs_mom;Job;61557.alicenext.alicenext;job not ready after 5 second timeout, MOM will recheck
08/01/2005 11:25:28;0001;   pbs_mom;Job;61558.alicenext.alicenext;job not ready after 5 second timeout, MOM will recheck
08/01/2005 11:25:28;0002;   pbs_mom;Req;dis_reply_write;DIS reply failure, -1
08/01/2005 11:25:40;0002;   pbs_mom;Req;dis_reply_write;DIS reply failure, -1
08/01/2005 11:27:33;0001;   pbs_mom;Job;TMomFinalizeJob3;job 61558.alicenext.alicenext started, pid = 5699
08/01/2005 11:27:43;0001;   pbs_mom;Job;TMomFinalizeJob3;job 61555.alicenext.alicenext started, pid = 5688
08/01/2005 11:27:43;0001;   pbs_mom;Job;TMomFinalizeJob3;job 61556.alicenext.alicenext started, pid = 5693
08/01/2005 11:27:43;0001;   pbs_mom;Job;TMomFinalizeJob3;job 61557.alicenext.alicenext started, pid = 5696
08/01/2005 16:02:49;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 16:14:16;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 16:19:24;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 16:24:15;0001;   pbs_mom;Job;84646.alicenext.alicenext;job not ready after 5 second timeout, MOM will recheck
08/01/2005 16:25:04;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84646.alicenext.alicenext started, pid = 6425
08/01/2005 16:29:51;0008;   pbs_mom;Job;84646.alicenext.alicenext;kill_task: killing pid 6425 task 1 with sig 15
08/01/2005 16:29:51;0008;   pbs_mom;Job;84646.alicenext.alicenext;kill_task: killing pid 6450 task 1 with sig 15
08/01/2005 16:29:51;0008;   pbs_mom;Job;84646.alicenext.alicenext;kill_task: killing pid 6459 task 1 with sig 15
08/01/2005 16:29:52;0008;   pbs_mom;Job;84646.alicenext.alicenext;kill_task: killing pid 6459 task 1 with sig 9
08/01/2005 16:29:52;0080;   pbs_mom;Job;84646.alicenext.alicenext;scan_for_terminated: job 84646.alicenext.alicenext task 1 terminated, sid 6425
08/01/2005 16:29:52;0008;   pbs_mom;Job;84646.alicenext.alicenext;Terminated
08/01/2005 16:33:26;0002;   pbs_mom;Req;dis_reply_write;DIS reply failure, -1
08/01/2005 16:35:52;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 16:36:00;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84649.alicenext.alicenext started, pid = 6487
08/01/2005 16:36:36;0080;   pbs_mom;Job;84649.alicenext.alicenext;scan_for_terminated: job 84649.alicenext.alicenext task 1 terminated, sid 6487
08/01/2005 16:36:36;0008;   pbs_mom;Job;84649.alicenext.alicenext;Terminated
08/01/2005 16:37:41;0001;   pbs_mom;Job;84650.alicenext.alicenext;job not ready after 5 second timeout, MOM will recheck
08/01/2005 16:37:41;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84650.alicenext.alicenext started, pid = 6533
08/01/2005 16:38:08;0080;   pbs_mom;Job;84650.alicenext.alicenext;scan_for_terminated: job 84650.alicenext.alicenext task 1 terminated, sid 6533
08/01/2005 16:38:08;0008;   pbs_mom;Job;84650.alicenext.alicenext;Terminated
08/01/2005 16:39:43;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84648.alicenext.alicenext started, pid = 6579
08/01/2005 16:40:16;0080;   pbs_mom;Job;84648.alicenext.alicenext;scan_for_terminated: job 84648.alicenext.alicenext task 1 terminated, sid 6579
08/01/2005 16:40:16;0008;   pbs_mom;Job;84648.alicenext.alicenext;Terminated
08/01/2005 16:41:51;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 16:48:05;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 17:06:05;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 17:11:18;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 17:17:33;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 17:28:52;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 17:45:27;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 17:56:32;0080;   pbs_mom;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, type=MoveJobFile, from PBS_Server at alicenext.alicenext
08/01/2005 18:00:30;0002;   pbs_mom;n/a;rm_request;internal diagnostics complete
08/01/2005 18:15:44;0001;   pbs_mom;Job;84651.alicenext.alicenext;job not ready after 5 second timeout, MOM will recheck
08/01/2005 18:15:54;0002;   pbs_mom;n/a;rm_request;internal diagnostics complete
08/01/2005 18:16:20;0002;   pbs_mom;n/a;rm_request;internal diagnostics complete
08/01/2005 18:16:34;0002;   pbs_mom;n/a;rm_request;reporting cycle forced
08/01/2005 18:16:38;0002;   pbs_mom;n/a;rm_request;internal diagnostics complete
08/01/2005 18:16:41;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84651.alicenext.alicenext started, pid = 6913
08/01/2005 18:17:08;0080;   pbs_mom;Job;84651.alicenext.alicenext;scan_for_terminated: job 84651.alicenext.alicenext task 1 terminated, sid 6913
08/01/2005 18:17:08;0008;   pbs_mom;Job;84651.alicenext.alicenext;Terminated
08/01/2005 18:17:42;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84652.alicenext.alicenext started, pid = 6966
08/01/2005 18:18:07;0080;   pbs_mom;Job;84652.alicenext.alicenext;scan_for_terminated: job 84652.alicenext.alicenext task 1 terminated, sid 6966
08/01/2005 18:18:07;0008;   pbs_mom;Job;84652.alicenext.alicenext;Terminated
08/01/2005 18:18:16;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84654.alicenext.alicenext started, pid = 7012
08/01/2005 18:18:45;0080;   pbs_mom;Job;84654.alicenext.alicenext;scan_for_terminated: job 84654.alicenext.alicenext task 1 terminated, sid 7012
08/01/2005 18:18:45;0008;   pbs_mom;Job;84654.alicenext.alicenext;Terminated
08/01/2005 18:18:49;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84656.alicenext.alicenext started, pid = 7058
08/01/2005 18:19:16;0080;   pbs_mom;Job;84656.alicenext.alicenext;scan_for_terminated: job 84656.alicenext.alicenext task 1 terminated, sid 7058
08/01/2005 18:19:16;0008;   pbs_mom;Job;84656.alicenext.alicenext;Terminated
08/01/2005 18:25:56;0001;   pbs_mom;Job;84657.alicenext.alicenext;job not ready after 5 second timeout, MOM will recheck
08/01/2005 18:26:11;0001;   pbs_mom;Job;TMomFinalizeJob3;job 84657.alicenext.alicenext started, pid = 7117

**************************************************
ALiCEnext:/opt/pbs/opt # diagnose -n n438
diagnosing node table (5120 slots)
Name                    State  Procs     Memory         Disk          Swap      Speed  Opsys   Arch Par   
Load Res Classes                        Network                     Features

n438                  Running   1:2     1999:1999        
1:1        3461:3461    1.00 DEFAUL  linux DEF   4.01 002 [idle_1:2][medium_2:2][large_2 [DEFAULT]                     [t10s2][mesh28]
WARNING:  node 'n438' has excessive load (state: 'Running'  load:  4.010)
-----                     ---   1:2     1999:1999        1:1        3461:3461

Total Nodes: 1  (Active: 1  Idle: 0  Down: 0)

**************************************************
ALiCEnext:/opt/pbs/opt # checkjob 61556


checking job 61556

State: Idle
Creds:  user:wiebusch  group:taup  class:idle  
qos:DEFAULT
WallTime: 00:00:00 of 5:00:00:00
SubmitTime: Mon Jul 25 11:24:31
  (Time Queued  Total: 7:07:07:22  Eligible: 7:02:25)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 9093
PartitionMask: [ALL]
Flags:       RESTARTABLE

Messages:  cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=send failed, STARTING'
PE:  1.00  StartPriority:  -1000583
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs:  13  feasible procs:   0

Rejection Reasons: [CPU          :    3][State        :  250][ReserveTime  :    3]

**************************************************
From server_log/20050801:
...
08/01/2005 11:24:24;0008;PBS_Server;Job;61556.alicenext.alicenext;Job Modified at request of root at ALiCEnext.alicenext
08/01/2005 11:24:24;0008;PBS_Server;Job;61556.alicenext.alicenext;Job Run at request of root at ALiCEnext.alicenext
08/01/2005 11:24:37;0001;PBS_Server;Svr;PBS_Server;Operation now in progress (115) in send_job, send_job commit failed, rc=15031 (Premature end of message)
08/01/2005 11:24:37;0008;PBS_Server;Job;61556.alicenext.alicenext;unable to run job, MOM rejected/rc=1
08/01/2005 11:24:37;0080;PBS_Server;Req;req_reject;Reject reply code=15041(Execution server rejected request MSG=send failed, STARTING), aux=0, type=RunJob, from root at ALiCEnext.alicenext
...

**************************************************
ALiCEnext:/opt/pbs/opt # tracejob -n 14 61556

Job: 61556.alicenext.alicenext

07/25/2005 11:24:31  S    Job Queued at request of wiebusch at sam1.alicenext, owner = wiebusch at sam1.alicenext, job name = ems.z-270.a55, queue = idle
07/25/2005 11:24:31  A    queue=idle
07/29/2005 12:33:55  S    Requeueing job, substate: 10 Requeued in queue: idle
08/01/2005 11:24:24  S    Job Modified at request of root at ALiCEnext.alicenext
08/01/2005 11:24:24  S    Job Run at request of root at ALiCEnext.alicenext
08/01/2005 11:24:37  S    unable to run job, MOM rejected/rc=1


More information about the torqueusers mailing list