[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

David Beer dbeer at adaptivecomputing.com
Thu Dec 13 17:40:19 MST 2012


Actually Lech, do you think you could just email in the logs for the mother
superior and the server for one of these jobs for the day that it gets
deleted? If it is too big to email, I'm sending you guest credentials for
the scp server separately.

David

On Thu, Dec 13, 2012 at 6:17 AM, Lech Nieroda <nieroda.lech at uni-koeln.de>wrote:

> Dear David,
>
> thanks - let's hope that this error will be fixed.
>
> As to the jobs that are getting killed, there doesn't seem to be any
> discernible pattern, e.g. take a look at the last 5 killed jobs:
>
> 724272: killed after 43min runtime, single node program with 8 threads
> (according to cputime, ca 4% serial execution)
> 724650[13]: killed after 148min runtime, single node jobarray instance
> with a single core serial program (40 instances total)
> 724843[32]: killed after 12min runtime, single node jobarray instance
> with a single core serial program (42 instances total)
> 723747: killed after 909min runtime, single node program with 12 threads
> (according to cputime, ca 14% serial execution)
> 724044: killed after 667min runtime, mpi program with 6 nodes, 12 cores
> each (according to cputime, ca 83% serial execution)
>
> There are single node jobs with good/average/minimal thread
> parallelization, classic serial jobs, various mpi jobs, instances of
> jobarrays. They can share a node or use it exclusively. They can use
> different parallel filesystems, like Panasas or Lustre. They are
> submitted by different users. They are killed after varying amounts of
> elapsed runtime - it can happen after a few minutes, hours or days.
> There doesn't seem to be a common denominator for these aborted jobs.
>
> I've taken a look into the mom_logs on loglevel10, as well as the
> server_logs on loglevel7.
>
> On the server side, when looking at the thread which performs the
> state-change to  "QUEUED-SUBSTATE55", the previous message mentions
> "locking complete" on the node on which the job runs, e.g. (the job is
> 699693[33]):
>
> [snip]
> 11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking
> complete cheops21314 in method next_node
> 11/30/2012
> 19:03:24;0008;PBS_Server.28147;Job;svr_setjobstate;svr_setjobstate:
> setting job 699693[33].cheops10 state from RUNNING-RUNNING to
> QUEUED-SUBSTATE55 (1-55)
> [snap]
>
> When grepping the server_logs with the executing host
> (cheops21314/172.18.3.132), there is a "svr_is_request" message which
> comes from the executing host before the state-change and later on a few
> "lockings":
> [snip]
> 52243:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking
> complete cheops21314 in method svr_is_request
> 52244:11/30/2012
> 19:03:24;0004;PBS_Server.28116;Svr;svr_is_request;message STATUS (4)
> received from mom on host cheops21314 (172.18.3.132:273) (sock 13)
> 52245:11/30/2012
> 19:03:24;0004;PBS_Server.28116;Svr;svr_is_request;IS_STATUS received
> from cheops21314
> 52246:11/30/2012 19:03:24;0040;PBS_Server.28116;Req;is_stat_get;received
> status from node cheops21314
> 52247:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking
> complete cheops21314 in method find_nodebyname
> 52248:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking
> complete cheops21314 in method find_nodebyname
> 52276:11/30/2012
> 19:03:24;0040;PBS_Server.28116;Req;update_node_state;adjusting state for
> node cheops21314 - state=0, newstate=0
> 52277:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking
> complete cheops21314 in method find_nodebyname
> 52310:11/30/2012 19:03:24;0080;PBS_Server.28137;node;lock_node;locking
> complete cheops21314 in method find_nodebyname
> 52635:11/30/2012 19:03:24;0080;PBS_Server.28096;node;lock_node;locking
> complete cheops21314 in method next_node
> 54181:11/30/2012 19:03:24;0080;PBS_Server.28113;node;lock_node;locking
> complete cheops21314 in method next_node
> 54596:11/30/2012 19:03:24;0080;PBS_Server.28057;node;lock_node;locking
> complete cheops21314 in method next_node
> 54967:11/30/2012 19:03:24;0080;PBS_Server.28031;node;lock_node;locking
> complete cheops21314 in method next_node
> 55401:11/30/2012 19:03:24;0080;PBS_Server.28007;node;lock_node;locking
> complete cheops21314 in method next_node
> 56029:11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking
> complete cheops21314 in method next_node
> 56058:11/30/2012 19:03:24;0080;PBS_Server.28035;node;lock_node;locking
> complete cheops21314 in method next_node
> 56570:11/30/2012 19:03:24;0080;PBS_Server.28141;node;lock_node;locking
> complete cheops21314 in method next_node
> 56760:11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking
> complete cheops21314 in method find_nodebyname
> 56761:11/30/2012
> 19:03:24;0040;PBS_Server.28147;Req;remove_job_from_node;freeing node
> cheops21314/3 from job 699693[33].cheops10 (nsnfree=1)
> [snap]
>
> There is no obvious reference to a Job polling request - what should I
> be looking for?
>
> The client of this particular job has been redeployed and the logs are
> no longer available. However, here is a loglevel10 output of job 724272,
> grepped with the jobid:
>
> [snip]
> 12/13/2012 00:00:27;0008;   pbs_mom.27123;Req;send_sisters;sending
> command POLL_JOB for job 724272.cheops10 (7)
> 12/13/2012 00:00:59;0002;
> pbs_mom.27123;n/a;mom_server_update_stat;mom_server_update_stat: sending
> to server "jobs=717556.cheops10 724272.cheops10 724285.cheops10"
> 12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;724272.cheops10;signaling
> job with signal SIGKILL
> 12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;kill_job;req_signaljob:
> sending signal 9, "KILL" to job 724272.cheops10, reason: killing job
> [snap]
>
> There is a POLL_JOB request named "send_sisters" (strange, since it is a
> single node job) and the pbs_mom sends a mom_server_update_stat with a
> joblist including the job 724272. Right afterwards comes the SignalJob
> to Kill, here the entire logsnippet between the mom_server_update_stat
> and the kill:
>
> [snip]
> 12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;read_tcp_reply;protocol: 4
>   version: 3  command:4  sock:8
> 12/13/2012 00:00:59;0002;
> pbs_mom.27123;n/a;mom_server_update_stat;status update successfully sent
> to cheops10
> 12/13/2012 00:00:59;0080;   pbs_mom.27123;Req;dis_request_read;decoding
> command SignalJob from PBS_Server
> 12/13/2012 00:00:59;0100;   pbs_mom.27123;Req;;Type SignalJob request
> received from PBS_Server at cheops10, sock=8
> 12/13/2012 00:00:59;0008;
> pbs_mom.27123;Job;mom_process_request;request type SignalJob from host
> cheops10 received
> 12/13/2012 00:00:59;0008;
> pbs_mom.27123;Job;mom_process_request;request type SignalJob from host
> cheops10 allowed
> 12/13/2012 00:00:59;0008;
> pbs_mom.27123;Job;mom_dispatch_request;dispatching request SignalJob on
> sd=8
> 12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;724272.cheops10;signaling
> job with signal SIGKILL
> 12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;kill_job;req_signaljob:
> sending signal 9, "KILL" to job 724272.cheops10, reason: killing job
> [snap]
>
> What other info do you need? Are there specific messages to look for?
>
> Regards,
> Lech
>
>
>
> On 12.12.2012 18:16, David Beer wrote:
> > Lech,
> >
> >  From your log snippet, it appears that the server sent a poll request
> > to the mom (for the job) and the mom replied back that it didn't know
> > about that job, getting the job cancelled on the server's side. I'm
> > going to do a little more investigation to see if there's some way that
> > the poll request to the mom can time out or otherwise fail to
> > erroneously create this condition. If you can provide any more details
> > about creating this scenario, that'd be great. Helpful information would
> > be describing the lengths of jobs that are running, if they are parallel
> > or serial, or estimate the ratio of parallel to serial, things like that
> > might help us figure it out.
> >
> > David
> >
> > On Wed, Dec 12, 2012 at 9:57 AM, Lech Nieroda <nieroda.lech at uni-koeln.de
> > <mailto:nieroda.lech at uni-koeln.de>> wrote:
> >
> >     Dear list,
> >
> >     the problem continues to persist: ca. 1 every 1000-2000 jobs is
> killed.
> >     We've noticed that it almost disappears when the pbs_server loglevel
> is
> >     raised to 7 (ca. one occurrence of the error per week). Considering
> that
> >     higher loglevels simply generate more output and may thus prolong the
> >     execution of some tasks for a minuscule amount of time, this may turn
> >     out to be a race condition issue.
> >     We can't permanently set the loglevel to 7 though, because of two
> >     problems: the daily logfile size is ca. 30GB and pbs_server crashes
> >     every 2-3 days.
> >
> >     The error: initially, the server sends an abort for no apparent
> reason,
> >     however it mails the user "Job does not exist on node". Even though
> the
> >     pbs_mom sends the correct joblist to the pbs_server, it receives a
> >     SIGKILL signal for the job (here with the jobid 721711, no
> dependencies,
> >     etc) without an error message of its own:
> >
> >     [pbs_server snip]
> >     12/12/2012 02:49:46;0004;PBS_Server.27962;Svr;svr_connect;attempting
> >     connect to host 172.18.1.220 port 15002
> >     12/12/2012
> >     02:49:46;0008;PBS_Server.27962;Job;svr_setjobstate;svr_setjobstate:
> >     setting job 721711.cheops10 state from RUNNING-RUNNING to
> >     QUEUED-SUBSTATE55 (1-55)
> >     [...]
> >     12/12/2012
> 02:49:46;000d;PBS_Server.27962;Job;721711.cheops10;preparing
> >     to send 'a' mail for job 721711.cheops10 to fernandl at cheops0 (Job
> does
> >     not exist on node)
> >     [pbs_server snap]
> >
> >     [pbs_mom snip]
> >     12/12/2012 02:49:50;0002;
> >     pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat:
> sending
> >     to server "jobs=722456.cheops10 721711.cheops10 722457.cheops10
> >     722458.cheops10"
> >     12/12/2012 02:49:50;0002;
> >     pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat:
> sending
> >     to server "varattr= "
> >     12/12/2012 02:49:50;0008;
> pbs_mom.31686;Job;read_tcp_reply;protocol: 4
> >        version: 3  command:4  sock:8
> >     12/12/2012 02:49:50;0002;
> >     pbs_mom.31686;n/a;mom_server_update_stat;status update successfully
> sent
> >     to cheops10
> >     12/12/2012 02:49:50;0080;
> pbs_mom.31686;Req;dis_request_read;decoding
> >     command SignalJob from PBS_Server
> >     12/12/2012 02:49:50;0100;   pbs_mom.31686;Req;;Type SignalJob request
> >     received from PBS_Server at cheops10, sock=8
> >     12/12/2012 02:49:50;0008;
> >     pbs_mom.31686;Job;mom_process_request;request type SignalJob from
> host
> >     cheops10 received
> >     12/12/2012 02:49:50;0008;
> >     pbs_mom.31686;Job;mom_process_request;request type SignalJob from
> host
> >     cheops10 allowed
> >     12/12/2012 02:49:50;0008;
> >     pbs_mom.31686;Job;mom_dispatch_request;dispatching request SignalJob
> >     on sd=8
> >     12/12/2012 02:49:50;0008;
> pbs_mom.31686;Job;721711.cheops10;signaling
> >     job with signal SIGKILL
> >     12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;kill_job;req_signaljob:
> >     sending signal 9, "KILL" to job 721711.cheops10, reason: killing job
> >     [...]
> >     12/12/2012 02:50:35;0080;   pbs_mom.31686;Svr;preobit_reply;top of
> >     preobit_reply
> >     12/12/2012 02:50:35;0080;
> >     pbs_mom.31686;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr
> >     worked, top of while loop
> >     12/12/2012 02:50:35;0001;
> >     pbs_mom.31686;Job;721711.cheops10;preobit_reply, unknown on server,
> >     deleting locally
> >     [pbs_mom snap]
> >
> >     Ideas? Fixes?
> >
> >     Regards,
> >     Lech Nieroda
> >
> >
> >
> >     In case it helps to analyze the bug
> >
> >     On 26.11.2012 14:02, Lech Nieroda wrote:
> >      > Dear list,
> >      >
> >      > I've raised the loglevel on one of the clients to loglevel=7 in
> >     order to
> >      > collect more information on the event.
> >      >
> >      > maui just assumes a successful completion of the job.
> >      >
> >      > pbs_server log regarding the job 686938[4] that suddenly becomes
> >     unknown
> >      > at 00:15:21:
> >      > [snip]
> >      > /var/spool/torque/server_logs/20121125:11/25/2012
> >      > 10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at
> >     request
> >      > of maui at localhost.localdomain
> >      > /var/spool/torque/server_logs/20121125:11/25/2012
> >      >
> 10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate:
> >      > setting job 686938[4].cheops10 state from QUEUED-QUEUED to
> >      > RUNNING-PRERUN (4-40)
> >      > /var/spool/torque/server_logs/20121125:11/25/2012
> >      > 10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child
> reported
> >      > success for job after 1 seconds (dest=???), rc=0
> >      > /var/spool/torque/server_logs/20121125:11/25/2012
> >      >
> 10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate:
> >      > setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to
> >      > RUNNING-RUNNING (4-42)
> >      > /var/spool/torque/server_logs/20121125:11/25/2012
> >      > 10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending
> >     email:
> >      > User does not want mail of this type.
> >      > /var/spool/torque/server_logs/20121125:11/25/2012
> >      >
> 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate:
> >      > setting job 686938[4].cheops10 state from RUNNING-RUNNING to
> >      > QUEUED-SUBSTATE55 (1-55)
> >      > /var/spool/torque/server_logs/20121126:11/26/2012
> >      >
> 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate:
> >      > setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to
> >      > EXITING-SUBSTATE55 (5-55)
> >      > /var/spool/torque/server_logs/20121126:11/26/2012
> >      > 00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing
> from
> >      > smp, state EXITING
> >      > /var/spool/torque/server_logs/20121126:11/26/2012
> >      >
> >
> 00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
> >     stray
> >      > job 686938[4].cheops10 found on cheops11801
> >      > [snap]
> >      >
> >      > pbs_mom log at loglevel 7, shortly before the kill:
> >      > [snip]
> >      > 11/26/2012 00:14:26;0008;   pbs_mom.26562;Req;send_sisters;sending
> >      > command POLL_JOB for job 686938[4].cheops10 (7)
> >      > 11/26/2012 00:15:02;0002;
> >      > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat:
> >     sending
> >      > to server "jobs=686659[1].cheops10 686694[8].cheops10
> >      > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10
> >      > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10
> >      > 686941[130].cheops10 686941[132].cheops10 686941[133].cheops10
> >      > 686941[150].cheops10 686941[151].cheops10 686941[153].cheops10
> >      > 686941[186].cheops10 686941[187].cheops10 686941[188].cheops10
> >      > 686941[212].cheops10 687103[1].cheops10 687224.cheops10
> >      > 687252[1].cheops10 687279[1].cheops10 687279[2].cheops10
> >      > 687279[3].cheops10 687334[1].cheops10 687334[2].cheops10
> >      > 687367[1].cheops10 687367[2].cheops10 687366.cheops10"
> >      > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;mem_sum;proc_array
> loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:15:11;0008;   pbs_mom.26562;Req;send_sisters;sending
> >      > command POLL_JOB for job 686938[4].cheops10 (7)
> >      > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;mem_sum;proc_array
> loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:15:56;0008;   pbs_mom.26562;Req;send_sisters;sending
> >      > command POLL_JOB for job 686938[4].cheops10 (7)
> >      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array
> loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:00;0008;
> >      > pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in
> >      > non-exiting substate RUNNING, no obit sent at this time
> >      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array
> loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array
> loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array
> loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
> >     loop
> >      > start - jobid = 686938[4].cheops10
> >      > 11/26/2012 00:16:01;0080;
> >      > pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0
> >     (child
> >      > pid=4214)
> >      > 11/26/2012 00:16:02;0002;
> >      > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat:
> >     sending
> >      > to server "jobs=686659[1].cheops10 686694[8].cheops10
> >      > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10
> >      > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10
> >      > 686941[130].cheops10 686941[132].cheops10 686941[150].cheops10
> >      > 686941[151].cheops10 686941[153].cheops10 686941[186].cheops10
> >      > 686941[187].cheops10 686941[188].cheops10 686941[212].cheops10
> >      > 687103[1].cheops10 687224.cheops10 687252[1].cheops10
> >     687279[1].cheops10
> >      > 687279[2].cheops10 687279[3].cheops10 687334[1].cheops10
> >      > 687334[2].cheops10 687367[1].cheops10 687367[2].cheops10
> >     687366.cheops10"
> >      > 11/26/2012 00:16:02;0008;
> >      > pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal
> >     SIGKILL
> >      > 11/26/2012 00:16:02;0008;
> pbs_mom.26562;Job;kill_job;req_signaljob:
> >      > sending signal 9, "KILL" to job 686938[4].cheops10, reason:
> >     killing job
> >      > [snap]
> >      >
> >      > As we can see, the pbs_mom sends complete job lists to the
> pbs_server
> >      > right before (00:15:02) and after (00:16:02) the SUBSTATE55
> setting
> >      > (00:15:21), where the job 686938[4].cheops10 is included.
> >     However, the
> >      > server claims that the job is unknown/doesn't exist and sends a
> kill
> >      > command which the pbs_mom then executes (0:16:02).
> >      >
> >      > I hope this helps in tracking down the bug.
> >      >
> >      > Regards,
> >      > Lech Nieroda
> >      >
> >      >
> >      > On 22.11.2012 16:46, Lech Nieroda wrote:
> >      >> Dear list,
> >      >>
> >      >> we have another serious problem since our upgrade to Torque
> >     4.1.3. We
> >      >> are using it with Maui 3.3.1. The problem in a nutshell: some
> few,
> >      >> random jobs are suddenly "unknown" to the server, it changes
> their
> >      >> status to EXITING-SUBSTATE55 and requests a silent kill on the
> >     compute
> >      >> nodes. The job then dies, the processes are killed on the node,
> >     there is
> >      >> no "Exit_status" in the server-log, no entry in maui/stats, no
> >      >> stdout/stderr files. The users are, understandably, not amused.
> >      >>
> >      >> It doesn't seem to be user or application specific. Even a single
> >      >> instance from a job array can get blown away in this way while
> >     all other
> >      >> instances end normally.
> >      >>
> >      >> Here are some logs of such a job (681684[35]):
> >      >>
> >      >> maui just assumes a successful completion:
> >      >> [snip]
> >      >> 11/21 19:24:49
> >     MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
> >      >> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is
> >       1.000000,
> >      >> 1.000000, 1
> >      >> 11/21 19:25:55 INFO:     active PBS job 681684[35] has been
> >     removed from
> >      >> the queue.  assuming successful completion
> >      >> 11/21 19:25:55 MJobProcessCompleted(681684[35])
> >      >> 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356
>  T:
> >      >> 10903  PS: 10903  A: 0.063096
> >      >> 11/21 19:25:55 MJobSendFB(681684[35])
> >      >> 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
> >      >> 11/21 19:25:55 MJobRemove(681684[35])
> >      >> 11/21 19:25:55 MJobDestroy(681684[35])
> >      >> [snap]
> >      >>
> >      >> pbs_server decides at 19:25:11, after 3 hours runtime, that the
> >     job is
> >      >> unknown (grepped by JobID from the server logs):
> >      >> [snip]
> >      >> 11/21/2012
> >      >>
> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
> >      >> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
> >      >> RUNNING-RUNNING (4-42)
> >      >> 11/21/2012
> >      >>
> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> >      >> setting job 681684[35].cheops10 state from RUNNING-RUNNING to
> >      >> QUEUED-SUBSTATE55 (1-55)
> >      >> 11/21/2012
> >      >>
> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
> >      >> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
> >      >> EXITING-SUBSTATE55 (5-55)
> >      >> 11/21/2012
> >      >> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing
> >     from
> >      >> smp, state EXITING
> >      >> 11/21/2012
> >      >>
> >
> 19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
> >     stray
> >      >> job 681684[35].cheops10 found on cheops21316
> >      >> [snap]
> >      >>
> >      >> pbs_client just kills the processes:
> >      >> [snip]
> >      >> 11/21/2012 16:23:43;0001;
> pbs_mom.32254;Job;TMomFinalizeJob3;job
> >      >> 681684[35].cheops10 started, pid = 17452
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid
> >     17452 task
> >      >> 1 gracefully with sig 15
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >      >> (pid=17452/state=R) after sig 15
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >      >> (pid=17452/state=Z) after sig 15
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid
> >     17692 task
> >      >> 1 gracefully with sig 15
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >      >> (pid=17692/state=R) after sig 15
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid
> >     17703 task
> >      >> 1 gracefully with sig 15
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >      >> (pid=17703/state=R) after sig 15
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid
> >     17731 task
> >      >> 1 gracefully with sig 15
> >      >> 11/21/2012 19:25:14;0008;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
> >      >> (pid=17731/state=R) after sig 15
> >      >> 11/21/2012 19:25:15;0080;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
> >      >> 681684[35].cheops10 task 1 terminated, sid=17452
> >      >> 11/21/2012 19:25:15;0008;
> >     pbs_mom.32254;Job;681684[35].cheops10;job
> >      >> was terminated
> >      >> 11/21/2012 19:25:50;0001;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on
> >     server,
> >      >> deleting locally
> >      >> 11/21/2012 19:25:50;0080;
> >      >> pbs_mom.32254;Job;681684[35].cheops10;removed job script
> >      >> [snap]
> >      >>
> >      >> Sometimes, the pbs_mom logs include this message before the
> >     killing starts:
> >      >> [snip]
> >      >> Req;req_reject;Reject reply code=15001(Unknown Job Id Error),
> aux=0,
> >      >> type=StatusJob, from PBS_Server at cheops10
> >      >> [snap]
> >      >>
> >      >> And finally, some job informations given to epilogue:
> >      >> [snip]
> >      >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> >      >> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
> >      >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
> >      >> 681684[35].cheops10,hthiele0,cheops21316,Job Information:
> >      >> userid=hthiele0,
> >      >>
> >
> resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
> >      >>
> >
> resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
> >      >> queue=smp, account=ccg-ngs, exitcode=271
> >      >> [snap]
> >      >>
> >      >> This happens rarely (about 1 in 3000). However, silent deletions
> of
> >      >> random jobs aren't exactly a trifling matter.
> >      >> I could try to disable the mom_job_sync option, which could
> perhaps
> >      >> prevent the process killing of unknown jobs, but it would also
> leave
> >      >> corrupt/pre-execution jobs alive.
> >      >>
> >      >> Can this be fixed?
> >      >>
> >      >> On a side-note, here are some further, minor Bugs I've noticed
> >     in the
> >      >> Torque 4.1.3. Version:
> >      >> - the epilogue script is usually invoked twice and sometimes even
> >      >> several times
> >      >> - when explicit node lists are used, e.g.
> >     nodes=node1:ppn=2+node2:ppn=2,
> >      >> then the number of "tasks" as seen by qstat is zero
> >      >> - there have been some API changes between Torque 2.x and Torque
> >     4.x, so
> >      >> that two maui calls had to be altered in order to build against
> >     Torque
> >      >> 4.x (get_svrport, openrm).
> >      >>
> >      >>
> >      >> Regards,
> >      >> Lech Nieroda
> >      >>
> >      >
> >      >
> >
> >
> >     --
> >     Dipl.-Wirt.-Inf. Lech Nieroda
> >     Regionales Rechenzentrum der Universität zu Köln (RRZK)
> >     Universität zu Köln
> >     Weyertal 121
> >     Raum 309 (3. Etage)
> >     D-50931 Köln
> >     Deutschland
> >
> >     Tel.: +49 (221) 470-89606
> >     E-Mail: nieroda.lech at uni-koeln.de <mailto:nieroda.lech at uni-koeln.de>
> >     _______________________________________________
> >     torqueusers mailing list
> >     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
> >     http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
> >
> >
> > --
> > David Beer | Senior Software Engineer
> > Adaptive Computing
> >
> >
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
>
>
> --
> Dipl.-Wirt.-Inf. Lech Nieroda
> Regionales Rechenzentrum der Universität zu Köln (RRZK)
> Universität zu Köln
> Weyertal 121
> Raum 309 (3. Etage)
> D-50931 Köln
> Deutschland
>
> Tel.: +49 (221) 470-89606
> E-Mail: nieroda.lech at uni-koeln.de
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20121213/b1fe0d4f/attachment-0001.html 


More information about the torqueusers mailing list