[torqueusers] Running Jobs suddenly "Unknown" and killed on Torque 4.1.3

Lech Nieroda nieroda.lech at uni-koeln.de
Thu Dec 13 06:17:51 MST 2012


Dear David,

thanks - let's hope that this error will be fixed.

As to the jobs that are getting killed, there doesn't seem to be any 
discernible pattern, e.g. take a look at the last 5 killed jobs:

724272: killed after 43min runtime, single node program with 8 threads 
(according to cputime, ca 4% serial execution)
724650[13]: killed after 148min runtime, single node jobarray instance 
with a single core serial program (40 instances total)
724843[32]: killed after 12min runtime, single node jobarray instance 
with a single core serial program (42 instances total)
723747: killed after 909min runtime, single node program with 12 threads
(according to cputime, ca 14% serial execution)
724044: killed after 667min runtime, mpi program with 6 nodes, 12 cores 
each (according to cputime, ca 83% serial execution)

There are single node jobs with good/average/minimal thread 
parallelization, classic serial jobs, various mpi jobs, instances of 
jobarrays. They can share a node or use it exclusively. They can use 
different parallel filesystems, like Panasas or Lustre. They are 
submitted by different users. They are killed after varying amounts of 
elapsed runtime - it can happen after a few minutes, hours or days. 
There doesn't seem to be a common denominator for these aborted jobs.

I've taken a look into the mom_logs on loglevel10, as well as the 
server_logs on loglevel7.

On the server side, when looking at the thread which performs the 
state-change to  "QUEUED-SUBSTATE55", the previous message mentions 
"locking complete" on the node on which the job runs, e.g. (the job is 
699693[33]):

[snip]
11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking 
complete cheops21314 in method next_node
11/30/2012 
19:03:24;0008;PBS_Server.28147;Job;svr_setjobstate;svr_setjobstate: 
setting job 699693[33].cheops10 state from RUNNING-RUNNING to 
QUEUED-SUBSTATE55 (1-55)
[snap]

When grepping the server_logs with the executing host 
(cheops21314/172.18.3.132), there is a "svr_is_request" message which 
comes from the executing host before the state-change and later on a few 
"lockings":
[snip]
52243:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking 
complete cheops21314 in method svr_is_request
52244:11/30/2012 
19:03:24;0004;PBS_Server.28116;Svr;svr_is_request;message STATUS (4) 
received from mom on host cheops21314 (172.18.3.132:273) (sock 13)
52245:11/30/2012 
19:03:24;0004;PBS_Server.28116;Svr;svr_is_request;IS_STATUS received 
from cheops21314
52246:11/30/2012 19:03:24;0040;PBS_Server.28116;Req;is_stat_get;received 
status from node cheops21314
52247:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking 
complete cheops21314 in method find_nodebyname
52248:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking 
complete cheops21314 in method find_nodebyname
52276:11/30/2012 
19:03:24;0040;PBS_Server.28116;Req;update_node_state;adjusting state for 
node cheops21314 - state=0, newstate=0
52277:11/30/2012 19:03:24;0080;PBS_Server.28116;node;lock_node;locking 
complete cheops21314 in method find_nodebyname
52310:11/30/2012 19:03:24;0080;PBS_Server.28137;node;lock_node;locking 
complete cheops21314 in method find_nodebyname
52635:11/30/2012 19:03:24;0080;PBS_Server.28096;node;lock_node;locking 
complete cheops21314 in method next_node
54181:11/30/2012 19:03:24;0080;PBS_Server.28113;node;lock_node;locking 
complete cheops21314 in method next_node
54596:11/30/2012 19:03:24;0080;PBS_Server.28057;node;lock_node;locking 
complete cheops21314 in method next_node
54967:11/30/2012 19:03:24;0080;PBS_Server.28031;node;lock_node;locking 
complete cheops21314 in method next_node
55401:11/30/2012 19:03:24;0080;PBS_Server.28007;node;lock_node;locking 
complete cheops21314 in method next_node
56029:11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking 
complete cheops21314 in method next_node
56058:11/30/2012 19:03:24;0080;PBS_Server.28035;node;lock_node;locking 
complete cheops21314 in method next_node
56570:11/30/2012 19:03:24;0080;PBS_Server.28141;node;lock_node;locking 
complete cheops21314 in method next_node
56760:11/30/2012 19:03:24;0080;PBS_Server.28147;node;lock_node;locking 
complete cheops21314 in method find_nodebyname
56761:11/30/2012 
19:03:24;0040;PBS_Server.28147;Req;remove_job_from_node;freeing node 
cheops21314/3 from job 699693[33].cheops10 (nsnfree=1)
[snap]

There is no obvious reference to a Job polling request - what should I 
be looking for?

The client of this particular job has been redeployed and the logs are 
no longer available. However, here is a loglevel10 output of job 724272, 
grepped with the jobid:

[snip]
12/13/2012 00:00:27;0008;   pbs_mom.27123;Req;send_sisters;sending 
command POLL_JOB for job 724272.cheops10 (7)
12/13/2012 00:00:59;0002; 
pbs_mom.27123;n/a;mom_server_update_stat;mom_server_update_stat: sending 
to server "jobs=717556.cheops10 724272.cheops10 724285.cheops10"
12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;724272.cheops10;signaling 
job with signal SIGKILL
12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;kill_job;req_signaljob: 
sending signal 9, "KILL" to job 724272.cheops10, reason: killing job
[snap]

There is a POLL_JOB request named "send_sisters" (strange, since it is a 
single node job) and the pbs_mom sends a mom_server_update_stat with a 
joblist including the job 724272. Right afterwards comes the SignalJob 
to Kill, here the entire logsnippet between the mom_server_update_stat 
and the kill:

[snip]
12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;read_tcp_reply;protocol: 4 
  version: 3  command:4  sock:8
12/13/2012 00:00:59;0002; 
pbs_mom.27123;n/a;mom_server_update_stat;status update successfully sent 
to cheops10
12/13/2012 00:00:59;0080;   pbs_mom.27123;Req;dis_request_read;decoding 
command SignalJob from PBS_Server
12/13/2012 00:00:59;0100;   pbs_mom.27123;Req;;Type SignalJob request 
received from PBS_Server at cheops10, sock=8
12/13/2012 00:00:59;0008; 
pbs_mom.27123;Job;mom_process_request;request type SignalJob from host 
cheops10 received
12/13/2012 00:00:59;0008; 
pbs_mom.27123;Job;mom_process_request;request type SignalJob from host 
cheops10 allowed
12/13/2012 00:00:59;0008; 
pbs_mom.27123;Job;mom_dispatch_request;dispatching request SignalJob on sd=8
12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;724272.cheops10;signaling 
job with signal SIGKILL
12/13/2012 00:00:59;0008;   pbs_mom.27123;Job;kill_job;req_signaljob: 
sending signal 9, "KILL" to job 724272.cheops10, reason: killing job
[snap]

What other info do you need? Are there specific messages to look for?

Regards,
Lech



On 12.12.2012 18:16, David Beer wrote:
> Lech,
>
>  From your log snippet, it appears that the server sent a poll request
> to the mom (for the job) and the mom replied back that it didn't know
> about that job, getting the job cancelled on the server's side. I'm
> going to do a little more investigation to see if there's some way that
> the poll request to the mom can time out or otherwise fail to
> erroneously create this condition. If you can provide any more details
> about creating this scenario, that'd be great. Helpful information would
> be describing the lengths of jobs that are running, if they are parallel
> or serial, or estimate the ratio of parallel to serial, things like that
> might help us figure it out.
>
> David
>
> On Wed, Dec 12, 2012 at 9:57 AM, Lech Nieroda <nieroda.lech at uni-koeln.de
> <mailto:nieroda.lech at uni-koeln.de>> wrote:
>
>     Dear list,
>
>     the problem continues to persist: ca. 1 every 1000-2000 jobs is killed.
>     We've noticed that it almost disappears when the pbs_server loglevel is
>     raised to 7 (ca. one occurrence of the error per week). Considering that
>     higher loglevels simply generate more output and may thus prolong the
>     execution of some tasks for a minuscule amount of time, this may turn
>     out to be a race condition issue.
>     We can't permanently set the loglevel to 7 though, because of two
>     problems: the daily logfile size is ca. 30GB and pbs_server crashes
>     every 2-3 days.
>
>     The error: initially, the server sends an abort for no apparent reason,
>     however it mails the user "Job does not exist on node". Even though the
>     pbs_mom sends the correct joblist to the pbs_server, it receives a
>     SIGKILL signal for the job (here with the jobid 721711, no dependencies,
>     etc) without an error message of its own:
>
>     [pbs_server snip]
>     12/12/2012 02:49:46;0004;PBS_Server.27962;Svr;svr_connect;attempting
>     connect to host 172.18.1.220 port 15002
>     12/12/2012
>     02:49:46;0008;PBS_Server.27962;Job;svr_setjobstate;svr_setjobstate:
>     setting job 721711.cheops10 state from RUNNING-RUNNING to
>     QUEUED-SUBSTATE55 (1-55)
>     [...]
>     12/12/2012 02:49:46;000d;PBS_Server.27962;Job;721711.cheops10;preparing
>     to send 'a' mail for job 721711.cheops10 to fernandl at cheops0 (Job does
>     not exist on node)
>     [pbs_server snap]
>
>     [pbs_mom snip]
>     12/12/2012 02:49:50;0002;
>     pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending
>     to server "jobs=722456.cheops10 721711.cheops10 722457.cheops10
>     722458.cheops10"
>     12/12/2012 02:49:50;0002;
>     pbs_mom.31686;n/a;mom_server_update_stat;mom_server_update_stat: sending
>     to server "varattr= "
>     12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;read_tcp_reply;protocol: 4
>        version: 3  command:4  sock:8
>     12/12/2012 02:49:50;0002;
>     pbs_mom.31686;n/a;mom_server_update_stat;status update successfully sent
>     to cheops10
>     12/12/2012 02:49:50;0080;   pbs_mom.31686;Req;dis_request_read;decoding
>     command SignalJob from PBS_Server
>     12/12/2012 02:49:50;0100;   pbs_mom.31686;Req;;Type SignalJob request
>     received from PBS_Server at cheops10, sock=8
>     12/12/2012 02:49:50;0008;
>     pbs_mom.31686;Job;mom_process_request;request type SignalJob from host
>     cheops10 received
>     12/12/2012 02:49:50;0008;
>     pbs_mom.31686;Job;mom_process_request;request type SignalJob from host
>     cheops10 allowed
>     12/12/2012 02:49:50;0008;
>     pbs_mom.31686;Job;mom_dispatch_request;dispatching request SignalJob
>     on sd=8
>     12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;721711.cheops10;signaling
>     job with signal SIGKILL
>     12/12/2012 02:49:50;0008;   pbs_mom.31686;Job;kill_job;req_signaljob:
>     sending signal 9, "KILL" to job 721711.cheops10, reason: killing job
>     [...]
>     12/12/2012 02:50:35;0080;   pbs_mom.31686;Svr;preobit_reply;top of
>     preobit_reply
>     12/12/2012 02:50:35;0080;
>     pbs_mom.31686;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr
>     worked, top of while loop
>     12/12/2012 02:50:35;0001;
>     pbs_mom.31686;Job;721711.cheops10;preobit_reply, unknown on server,
>     deleting locally
>     [pbs_mom snap]
>
>     Ideas? Fixes?
>
>     Regards,
>     Lech Nieroda
>
>
>
>     In case it helps to analyze the bug
>
>     On 26.11.2012 14:02, Lech Nieroda wrote:
>      > Dear list,
>      >
>      > I've raised the loglevel on one of the clients to loglevel=7 in
>     order to
>      > collect more information on the event.
>      >
>      > maui just assumes a successful completion of the job.
>      >
>      > pbs_server log regarding the job 686938[4] that suddenly becomes
>     unknown
>      > at 00:15:21:
>      > [snip]
>      > /var/spool/torque/server_logs/20121125:11/25/2012
>      > 10:52:15;0008;PBS_Server.12742;Job;686938[4].cheops10;Job Run at
>     request
>      > of maui at localhost.localdomain
>      > /var/spool/torque/server_logs/20121125:11/25/2012
>      > 10:52:15;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate:
>      > setting job 686938[4].cheops10 state from QUEUED-QUEUED to
>      > RUNNING-PRERUN (4-40)
>      > /var/spool/torque/server_logs/20121125:11/25/2012
>      > 10:52:16;0002;PBS_Server.12742;Job;686938[4].cheops10;child reported
>      > success for job after 1 seconds (dest=???), rc=0
>      > /var/spool/torque/server_logs/20121125:11/25/2012
>      > 10:52:16;0008;PBS_Server.12742;Job;svr_setjobstate;svr_setjobstate:
>      > setting job 686938[4].cheops10 state from RUNNING-TRNOUTCM to
>      > RUNNING-RUNNING (4-42)
>      > /var/spool/torque/server_logs/20121125:11/25/2012
>      > 10:52:16;000d;PBS_Server.12742;Job;686938[4].cheops10;Not sending
>     email:
>      > User does not want mail of this type.
>      > /var/spool/torque/server_logs/20121125:11/25/2012
>      > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate:
>      > setting job 686938[4].cheops10 state from RUNNING-RUNNING to
>      > QUEUED-SUBSTATE55 (1-55)
>      > /var/spool/torque/server_logs/20121126:11/26/2012
>      > 00:15:21;0008;PBS_Server.12702;Job;svr_setjobstate;svr_setjobstate:
>      > setting job 686938[4].cheops10 state from QUEUED-SUBSTATE55 to
>      > EXITING-SUBSTATE55 (5-55)
>      > /var/spool/torque/server_logs/20121126:11/26/2012
>      > 00:15:21;0100;PBS_Server.12702;Job;686938[4].cheops10;dequeuing from
>      > smp, state EXITING
>      > /var/spool/torque/server_logs/20121126:11/26/2012
>      >
>     00:16:02;0001;PBS_Server.12640;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
>     stray
>      > job 686938[4].cheops10 found on cheops11801
>      > [snap]
>      >
>      > pbs_mom log at loglevel 7, shortly before the kill:
>      > [snip]
>      > 11/26/2012 00:14:26;0008;   pbs_mom.26562;Req;send_sisters;sending
>      > command POLL_JOB for job 686938[4].cheops10 (7)
>      > 11/26/2012 00:15:02;0002;
>      > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat:
>     sending
>      > to server "jobs=686659[1].cheops10 686694[8].cheops10
>      > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10
>      > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10
>      > 686941[130].cheops10 686941[132].cheops10 686941[133].cheops10
>      > 686941[150].cheops10 686941[151].cheops10 686941[153].cheops10
>      > 686941[186].cheops10 686941[187].cheops10 686941[188].cheops10
>      > 686941[212].cheops10 687103[1].cheops10 687224.cheops10
>      > 687252[1].cheops10 687279[1].cheops10 687279[2].cheops10
>      > 687279[3].cheops10 687334[1].cheops10 687334[2].cheops10
>      > 687367[1].cheops10 687367[2].cheops10 687366.cheops10"
>      > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:15:11;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:15:11;0008;   pbs_mom.26562;Req;send_sisters;sending
>      > command POLL_JOB for job 686938[4].cheops10 (7)
>      > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:15:56;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:15:56;0008;   pbs_mom.26562;Req;send_sisters;sending
>      > command POLL_JOB for job 686938[4].cheops10 (7)
>      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:00;0008;
>      > pbs_mom.26562;Job;686938[4].cheops10;scan_for_exiting:job is in
>      > non-exiting substate RUNNING, no obit sent at this time
>      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:00;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;cput_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;mem_sum;proc_array loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:01;0080;   pbs_mom.26562;n/a;resi_sum;proc_array
>     loop
>      > start - jobid = 686938[4].cheops10
>      > 11/26/2012 00:16:01;0080;
>      > pbs_mom.26562;Job;686938[4].cheops10;checking job w/subtask pid=0
>     (child
>      > pid=4214)
>      > 11/26/2012 00:16:02;0002;
>      > pbs_mom.26562;n/a;mom_server_update_stat;mom_server_update_stat:
>     sending
>      > to server "jobs=686659[1].cheops10 686694[8].cheops10
>      > 686717[10].cheops10 686938[2].cheops10 686938[3].cheops10
>      > 686938[4].cheops10 686941[2].cheops10 686941[128].cheops10
>      > 686941[130].cheops10 686941[132].cheops10 686941[150].cheops10
>      > 686941[151].cheops10 686941[153].cheops10 686941[186].cheops10
>      > 686941[187].cheops10 686941[188].cheops10 686941[212].cheops10
>      > 687103[1].cheops10 687224.cheops10 687252[1].cheops10
>     687279[1].cheops10
>      > 687279[2].cheops10 687279[3].cheops10 687334[1].cheops10
>      > 687334[2].cheops10 687367[1].cheops10 687367[2].cheops10
>     687366.cheops10"
>      > 11/26/2012 00:16:02;0008;
>      > pbs_mom.26562;Job;686938[4].cheops10;signaling job with signal
>     SIGKILL
>      > 11/26/2012 00:16:02;0008;   pbs_mom.26562;Job;kill_job;req_signaljob:
>      > sending signal 9, "KILL" to job 686938[4].cheops10, reason:
>     killing job
>      > [snap]
>      >
>      > As we can see, the pbs_mom sends complete job lists to the pbs_server
>      > right before (00:15:02) and after (00:16:02) the SUBSTATE55 setting
>      > (00:15:21), where the job 686938[4].cheops10 is included.
>     However, the
>      > server claims that the job is unknown/doesn't exist and sends a kill
>      > command which the pbs_mom then executes (0:16:02).
>      >
>      > I hope this helps in tracking down the bug.
>      >
>      > Regards,
>      > Lech Nieroda
>      >
>      >
>      > On 22.11.2012 16:46, Lech Nieroda wrote:
>      >> Dear list,
>      >>
>      >> we have another serious problem since our upgrade to Torque
>     4.1.3. We
>      >> are using it with Maui 3.3.1. The problem in a nutshell: some few,
>      >> random jobs are suddenly "unknown" to the server, it changes their
>      >> status to EXITING-SUBSTATE55 and requests a silent kill on the
>     compute
>      >> nodes. The job then dies, the processes are killed on the node,
>     there is
>      >> no "Exit_status" in the server-log, no entry in maui/stats, no
>      >> stdout/stderr files. The users are, understandably, not amused.
>      >>
>      >> It doesn't seem to be user or application specific. Even a single
>      >> instance from a job array can get blown away in this way while
>     all other
>      >> instances end normally.
>      >>
>      >> Here are some logs of such a job (681684[35]):
>      >>
>      >> maui just assumes a successful completion:
>      >> [snip]
>      >> 11/21 19:24:49
>     MPBSJobUpdate(681684[35],681684[35].cheops10,TaskList,0)
>      >> 11/21 19:24:49 INFO: Average nodespeed for Job 681684[35] is
>       1.000000,
>      >> 1.000000, 1
>      >> 11/21 19:25:55 INFO:     active PBS job 681684[35] has been
>     removed from
>      >> the queue.  assuming successful completion
>      >> 11/21 19:25:55 MJobProcessCompleted(681684[35])
>      >> 11/21 19:25:55 INFO:     job '681684[35]' completed  X: 0.063356  T:
>      >> 10903  PS: 10903  A: 0.063096
>      >> 11/21 19:25:55 MJobSendFB(681684[35])
>      >> 11/21 19:25:55 INFO:     job usage sent for job '681684[35]'
>      >> 11/21 19:25:55 MJobRemove(681684[35])
>      >> 11/21 19:25:55 MJobDestroy(681684[35])
>      >> [snap]
>      >>
>      >> pbs_server decides at 19:25:11, after 3 hours runtime, that the
>     job is
>      >> unknown (grepped by JobID from the server logs):
>      >> [snip]
>      >> 11/21/2012
>      >> 16:23:43;0008;PBS_Server.26038;Job;svr_setjobstate;svr_setjobstate:
>      >> setting job 681684[35].cheops10 state from RUNNING-TRNOUTCM to
>      >> RUNNING-RUNNING (4-42)
>      >> 11/21/2012
>      >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
>      >> setting job 681684[35].cheops10 state from RUNNING-RUNNING to
>      >> QUEUED-SUBSTATE55 (1-55)
>      >> 11/21/2012
>      >> 19:25:11;0008;PBS_Server.26097;Job;svr_setjobstate;svr_setjobstate:
>      >> setting job 681684[35].cheops10 state from QUEUED-SUBSTATE55 to
>      >> EXITING-SUBSTATE55 (5-55)
>      >> 11/21/2012
>      >> 19:25:11;0100;PBS_Server.26097;Job;681684[35].cheops10;dequeuing
>     from
>      >> smp, state EXITING
>      >> 11/21/2012
>      >>
>     19:25:14;0001;PBS_Server.26122;Svr;PBS_Server;LOG_ERROR::kill_job_on_mom,
>     stray
>      >> job 681684[35].cheops10 found on cheops21316
>      >> [snap]
>      >>
>      >> pbs_client just kills the processes:
>      >> [snip]
>      >> 11/21/2012 16:23:43;0001;   pbs_mom.32254;Job;TMomFinalizeJob3;job
>      >> 681684[35].cheops10 started, pid = 17452
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid
>     17452 task
>      >> 1 gracefully with sig 15
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>      >> (pid=17452/state=R) after sig 15
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>      >> (pid=17452/state=Z) after sig 15
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid
>     17692 task
>      >> 1 gracefully with sig 15
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>      >> (pid=17692/state=R) after sig 15
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid
>     17703 task
>      >> 1 gracefully with sig 15
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>      >> (pid=17703/state=R) after sig 15
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: killing pid
>     17731 task
>      >> 1 gracefully with sig 15
>      >> 11/21/2012 19:25:14;0008;
>      >> pbs_mom.32254;Job;681684[35].cheops10;kill_task: process
>      >> (pid=17731/state=R) after sig 15
>      >> 11/21/2012 19:25:15;0080;
>      >> pbs_mom.32254;Job;681684[35].cheops10;scan_for_terminated: job
>      >> 681684[35].cheops10 task 1 terminated, sid=17452
>      >> 11/21/2012 19:25:15;0008;
>     pbs_mom.32254;Job;681684[35].cheops10;job
>      >> was terminated
>      >> 11/21/2012 19:25:50;0001;
>      >> pbs_mom.32254;Job;681684[35].cheops10;preobit_reply, unknown on
>     server,
>      >> deleting locally
>      >> 11/21/2012 19:25:50;0080;
>      >> pbs_mom.32254;Job;681684[35].cheops10;removed job script
>      >> [snap]
>      >>
>      >> Sometimes, the pbs_mom logs include this message before the
>     killing starts:
>      >> [snip]
>      >> Req;req_reject;Reject reply code=15001(Unknown Job Id Error), aux=0,
>      >> type=StatusJob, from PBS_Server at cheops10
>      >> [snap]
>      >>
>      >> And finally, some job informations given to epilogue:
>      >> [snip]
>      >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
>      >> 681684[35].cheops10,hthiele0,cheops21316,Starting shared epilogue
>      >> Nov 21 19:25:15 s_sys at cheops21316 epilogue.shared:
>      >> 681684[35].cheops10,hthiele0,cheops21316,Job Information:
>      >> userid=hthiele0,
>      >>
>     resourcelist='mem=5gb,ncpus=1,neednodes=1:ppn=1,nodes=1:ppn=1,walltime=48:00:00',
>      >>
>     resourcesused='cput=03:00:46,mem=945160kb,vmem=1368548kb,walltime=03:01:34',
>      >> queue=smp, account=ccg-ngs, exitcode=271
>      >> [snap]
>      >>
>      >> This happens rarely (about 1 in 3000). However, silent deletions of
>      >> random jobs aren't exactly a trifling matter.
>      >> I could try to disable the mom_job_sync option, which could perhaps
>      >> prevent the process killing of unknown jobs, but it would also leave
>      >> corrupt/pre-execution jobs alive.
>      >>
>      >> Can this be fixed?
>      >>
>      >> On a side-note, here are some further, minor Bugs I've noticed
>     in the
>      >> Torque 4.1.3. Version:
>      >> - the epilogue script is usually invoked twice and sometimes even
>      >> several times
>      >> - when explicit node lists are used, e.g.
>     nodes=node1:ppn=2+node2:ppn=2,
>      >> then the number of "tasks" as seen by qstat is zero
>      >> - there have been some API changes between Torque 2.x and Torque
>     4.x, so
>      >> that two maui calls had to be altered in order to build against
>     Torque
>      >> 4.x (get_svrport, openrm).
>      >>
>      >>
>      >> Regards,
>      >> Lech Nieroda
>      >>
>      >
>      >
>
>
>     --
>     Dipl.-Wirt.-Inf. Lech Nieroda
>     Regionales Rechenzentrum der Universität zu Köln (RRZK)
>     Universität zu Köln
>     Weyertal 121
>     Raum 309 (3. Etage)
>     D-50931 Köln
>     Deutschland
>
>     Tel.: +49 (221) 470-89606
>     E-Mail: nieroda.lech at uni-koeln.de <mailto:nieroda.lech at uni-koeln.de>
>     _______________________________________________
>     torqueusers mailing list
>     torqueusers at supercluster.org <mailto:torqueusers at supercluster.org>
>     http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
>
>
> --
> David Beer | Senior Software Engineer
> Adaptive Computing
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>


-- 
Dipl.-Wirt.-Inf. Lech Nieroda
Regionales Rechenzentrum der Universität zu Köln (RRZK)
Universität zu Köln
Weyertal 121
Raum 309 (3. Etage)
D-50931 Köln
Deutschland

Tel.: +49 (221) 470-89606
E-Mail: nieroda.lech at uni-koeln.de


More information about the torqueusers mailing list