[torquedev] [Bug 96] handle failed stagein jobs properly

bugzilla-daemon at supercluster.org bugzilla-daemon at supercluster.org
Wed Nov 17 04:27:17 MST 2010


http://www.clusterresources.com/bugzilla/show_bug.cgi?id=96

--- Comment #4 from ramon.bastiaans at sara.nl 2010-11-17 04:27:17 MST ---
these jobs keep bouncing around in my batch.

as you can see this job is stuck for 3 days now:

12:24 ce.my.fqdn.xxx:/var/tmp/wax 
root# grep 8318989 /var/spool/torque/server_logs/20101116
11/16/2010 09:34:33;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into
medium, state 1 hop 1
11/16/2010 09:34:33;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job,
substate: 11 Requeued in queue: medium
11/16/2010 11:08:54;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into
medium, state 1 hop 1
11/16/2010 11:08:54;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job,
substate: 11 Requeued in queue: medium
11/16/2010 16:56:02;0100;PBS_Server;Job;8318989.ce.my.fqdn.xxx;enqueuing into
medium, state 1 hop 1
11/16/2010 16:56:02;0086;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Requeueing job,
substate: 11 Requeued in queue: medium
11/16/2010 18:25:16;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 18:25:16;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 18:25:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 18:55:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 18:55:53;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 18:55:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 18:55:57;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 19:25:57;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 19:26:28;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 19:26:28;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 19:26:32;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 19:56:32;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 19:56:49;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 19:56:49;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 19:56:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 20:26:53;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 20:27:14;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 20:27:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 20:27:19;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 20:57:19;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 20:57:47;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 20:57:47;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 20:57:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 21:27:52;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 21:28:50;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 21:28:50;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 21:28:55;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 21:58:55;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 21:59:37;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 21:59:37;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 21:59:41;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 22:29:41;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 22:30:09;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 22:30:09;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 22:30:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 23:00:14;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 23:00:37;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 23:00:37;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 23:00:42;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)
11/16/2010 23:30:42;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from WAITING-STAGEFAIL to QUEUED-PRESTAGEIN (1-11)
11/16/2010 23:31:16;0008;PBS_Server;Job;8318989.ce.my.fqdn.xxx;Job Run at
request of root at ce.my.fqdn.xxx
11/16/2010 23:31:16;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from QUEUED-PRESTAGEIN to RUNNING-STAGEGO (4-15)
11/16/2010 23:31:20;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job
8318989.ce.my.fqdn.xxx state from RUNNING-STAGEGO to WAITING-STAGEFAIL (3-37)

12:24 ce.my.fqdn.xxx:/var/tmp/wax 
root# showq | grep Hold
8318989              lhb038       Hold     1  1:12:00:00  Sun Nov 14 20:12:11

12:25 ce.my.fqdn.xxx:/var/tmp/wax 
root# qstat -ns 8318989

ce.my.fqdn.xxx: 
                                                                         Req'd 
Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory
Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------
----- - -----
8318989.ce.gina.     lhb038   medium   crm01_533024086     --      1   1    -- 
36:00 W   -- 
   v35-13/5
    -- 

12:25 ce.my.fqdn.xxx:/var/tmp/wax 
root#

-- 
Configure bugmail: http://www.clusterresources.com/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.


More information about the torquedev mailing list