[torqueusers] pbs_server becoming unresponsive while processing job array

Hutcheson, Mike Mike_Hutcheson at baylor.edu
Fri Apr 26 16:00:45 MDT 2013


Hi.  When a user submits a job as a job array, pbs_server begins processing it, but after making its way through less than 100 sub-jobs, client commands time out and the moms stop being able to communicate with the server.

We're running Torque 4.2.2 and Maui 3.3.1 on the server, which is running CentOS 6.4 on x86_64.  This problem also occurred with Torque 4.1.5.1, which is why I wanted to try 4.2.2.  The moms (Torque 4.2.2) are running on CentOS 5.3 on x86_64.  We have 128 compute nodes and a dedicated Torque/Maui server.  We have two submit hosts (Torque 4.2.2), one running CentOS 5.3 on x86_64 and the other CentOS 6.3.

The compute nodes have one active gigE network and one IB network.  The moms use the gigE network.  The server is has two gigE networks, private and campus.  pbs_server uses the private network.

In the server log, there were some "Too many open files (24) in job_server, open for full save" messages that occurred a couple of hours after the snipits from the logs below.  Ulimit showed number of max open files to be 1024 so I added "ulimit -n 16384" to /etc/init.d/pbs_server and haven't seen any more of those messages.

Here's a snipit from the mom on node n128:

04/26/2013 00:02:07;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
04/26/2013 00:02:07;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
04/26/2013 00:02:07;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
04/26/2013 00:02:07;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::mom_server_all_update_stat, Could not contact any of the servers to send an update
04/26/2013 00:03:17;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
04/26/2013 00:03:17;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
04/26/2013 00:03:17;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
04/26/2013 00:03:17;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::mom_server_all_update_stat, Could not contact any of the servers to send an update
04/26/2013 00:04:27;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
04/26/2013 00:04:27;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
04/26/2013 00:04:27;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
04/26/2013 00:04:27;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::mom_server_all_update_stat, Could not contact any of the servers to send an update
04/26/2013 00:04:37;0002;   pbs_mom.8915;Svr;pbs_mom;Torque Mom Version = 4.1.5.1, loglevel = 0
04/26/2013 00:05:37;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
04/26/2013 00:05:37;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
04/26/2013 00:05:37;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
04/26/2013 00:05:37;0001;   pbs_mom.8915;Svr;pbs_mom;LOG_ERROR::mom_server_all_update_stat, Could not contact any of the servers to send an update

The messages repeat every 70 seconds.

In that same time frame, here's what I see in the pbs_server log:

04/26/2013 00:02:07;0100;PBS_Server.12288;Req;;Type StatusJob request received from pbs_mom at n082, sock=75
04/26/2013 00:02:09;0100;PBS_Server.12294;Req;;Type StatusNode request received from root at n131.localdomain, sock=11
04/26/2013 00:02:09;0100;PBS_Server.12294;Req;;Type StatusQueue request received from root at n131.localdomain, sock=11
04/26/2013 00:02:09;0100;PBS_Server.12294;Req;;Type StatusJob request received from root at n131.localdomain, sock=11
04/26/2013 00:02:09;0100;PBS_Server.12293;Req;;Type StatusJob request received from pbs_mom at n081, sock=77
04/26/2013 00:02:09;0100;PBS_Server.12293;Req;;Type StatusJob request received from pbs_mom at n081, sock=210
04/26/2013 00:02:09;0100;PBS_Server.12289;Req;;Type StatusJob request received from pbs_mom at n095, sock=211
04/26/2013 00:02:09;0100;PBS_Server.12288;Req;;Type StatusJob request received from pbs_mom at n080, sock=212
04/26/2013 00:02:40;0100;PBS_Server.12294;Req;;Type StatusNode request received from root at n131.localdomain, sock=11
04/26/2013 00:02:40;0100;PBS_Server.12294;Req;;Type StatusQueue request received from root at n131.localdomain, sock=11
04/26/2013 00:02:40;0100;PBS_Server.12294;Req;;Type StatusJob request received from root at n131.localdomain, sock=11
04/26/2013 00:03:06;0008;PBS_Server.12284;Job;svr_setjobstate;svr_setjobstate: setting job 19311[225].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:03:06;0008;PBS_Server.12284;Job;svr_setjobstate;svr_setjobstate: setting job 19311[225].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:03:06;0008;PBS_Server.12283;Job;svr_setjobstate;svr_setjobstate: setting job 19311[216].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:03:06;0008;PBS_Server.12283;Job;svr_setjobstate;svr_setjobstate: setting job 19311[216].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:03:06;0008;PBS_Server.12296;Job;svr_setjobstate;svr_setjobstate: setting job 19311[208].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:03:06;0008;PBS_Server.12296;Job;svr_setjobstate;svr_setjobstate: setting job 19311[208].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:03:07;0040;PBS_Server.12291;Req;free_nodes;freeing nodes for job 19311[160].n131.localdomain
04/26/2013 00:03:07;0008;PBS_Server.12291;Job;svr_setjobstate;svr_setjobstate: setting job 19311[160].n131.localdomain state from EXITING-ABORT to COMPLETE-COMPLETE (6-59)
04/26/2013 00:03:07;0008;PBS_Server.12291;Job;19311[180].n131.localdomain;on_job_exit valid pjob: 19311[180].n131.localdomain (substate=50)
04/26/2013 00:03:07;0008;PBS_Server.12291;Job;handle_exiting_or_abort_substate;19311[180].n131.localdomain; JOB_SUBSTATE_EXITING
04/26/2013 00:03:07;0008;PBS_Server.12291;Job;svr_setjobstate;svr_setjobstate: setting job 19311[180].n131.localdomain state from EXITING-EXITING to EXITING-RETURNSTD (5-70)
04/26/2013 00:03:07;0008;PBS_Server.12291;Job;svr_setjobstate;svr_setjobstate: setting job 19311[180].n131.localdomain state from EXITING-RETURNSTD to EXITING-STAGEDEL (5-52)
04/26/2013 00:03:07;0008;PBS_Server.12292;Job;svr_setjobstate;svr_setjobstate: setting job 19311[200].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:03:07;0008;PBS_Server.12292;Job;svr_setjobstate;svr_setjobstate: setting job 19311[200].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:03:07;0008;PBS_Server.12297;Job;svr_setjobstate;svr_setjobstate: setting job 19311[150].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:03:07;0008;PBS_Server.12297;Job;svr_setjobstate;svr_setjobstate: setting job 19311[150].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:03:07;0008;PBS_Server.12285;Job;svr_setjobstate;svr_setjobstate: setting job 19311[163].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:03:07;0008;PBS_Server.12285;Job;svr_setjobstate;svr_setjobstate: setting job 19311[163].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:03:09;0100;PBS_Server.12293;Req;;Type StatusJob request received from pbs_mom at n078, sock=213
04/26/2013 00:03:09;0100;PBS_Server.12288;Req;;Type StatusJob request received from pbs_mom at n077, sock=78
04/26/2013 00:03:09;0100;PBS_Server.12289;Req;;Type StatusJob request received from pbs_mom at n079, sock=214
04/26/2013 00:03:11;0100;PBS_Server.12294;Req;;Type StatusNode request received from root at n131.localdomain, sock=11
04/26/2013 00:03:11;0100;PBS_Server.12294;Req;;Type StatusQueue request received from root at n131.localdomain, sock=11
04/26/2013 00:03:11;0100;PBS_Server.12294;Req;;Type StatusJob request received from root at n131.localdomain, sock=11
04/26/2013 00:03:19;0008;PBS_Server.12289;Job;19311[176].n131.localdomain;on_job_exit valid pjob: 19311[176].n131.localdomain (substate=50)
04/26/2013 00:03:19;0008;PBS_Server.12289;Job;handle_exiting_or_abort_substate;19311[176].n131.localdomain; JOB_SUBSTATE_EXITING
04/26/2013 00:03:19;0008;PBS_Server.12289;Job;svr_setjobstate;svr_setjobstate: setting job 19311[176].n131.localdomain state from EXITING-EXITING to EXITING-RETURNSTD (5-70)
04/26/2013 00:03:19;0008;PBS_Server.12289;Job;svr_setjobstate;svr_setjobstate: setting job 19311[176].n131.localdomain state from EXITING-RETURNSTD to EXITING-STAGEDEL (5-52)
04/26/2013 00:03:19;0100;PBS_Server.12293;Req;;Type StatusJob request received from pbs_mom at n083, sock=215
04/26/2013 00:03:19;0100;PBS_Server.12293;Req;;Type StatusJob request received from pbs_mom at n095, sock=74
04/26/2013 00:03:20;0008;PBS_Server.12288;Job;19311[175].n131.localdomain;on_job_exit valid pjob: 19311[175].n131.localdomain (substate=50)
04/26/2013 00:03:20;0008;PBS_Server.12288;Job;handle_exiting_or_abort_substate;19311[175].n131.localdomain; JOB_SUBSTATE_EXITING
04/26/2013 00:03:20;0008;PBS_Server.12288;Job;svr_setjobstate;svr_setjobstate: setting job 19311[175].n131.localdomain state from EXITING-EXITING to EXITING-RETURNSTD (5-70)
04/26/2013 00:03:20;0008;PBS_Server.12288;Job;svr_setjobstate;svr_setjobstate: setting job 19311[175].n131.localdomain state from EXITING-RETURNSTD to EXITING-STAGEDEL (5-52)
04/26/2013 00:03:42;0100;PBS_Server.12294;Req;;Type StatusNode request received from root at n131.localdomain, sock=11
04/26/2013 00:03:42;0100;PBS_Server.12294;Req;;Type StatusQueue request received from root at n131.localdomain, sock=11
04/26/2013 00:03:42;0100;PBS_Server.12294;Req;;Type StatusJob request received from root at n131.localdomain, sock=11
04/26/2013 00:04:06;0008;PBS_Server.12299;Job;svr_setjobstate;svr_setjobstate: setting job 19311[212].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:04:06;0008;PBS_Server.12299;Job;svr_setjobstate;svr_setjobstate: setting job 19311[212].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:04:07;0008;PBS_Server.12298;Job;svr_setjobstate;svr_setjobstate: setting job 19311[153].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:04:07;0008;PBS_Server.12298;Job;svr_setjobstate;svr_setjobstate: setting job 19311[153].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:04:07;0008;PBS_Server.12287;Job;svr_setjobstate;svr_setjobstate: setting job 19311[197].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:04:07;0008;PBS_Server.12287;Job;svr_setjobstate;svr_setjobstate: setting job 19311[197].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:04:07;0008;PBS_Server.12286;Job;svr_setjobstate;svr_setjobstate: setting job 19311[152].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:04:07;0008;PBS_Server.12286;Job;svr_setjobstate;svr_setjobstate: setting job 19311[152].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:04:08;0008;PBS_Server.12290;Job;svr_setjobstate;svr_setjobstate: setting job 19311[164].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:04:08;0008;PBS_Server.12290;Job;svr_setjobstate;svr_setjobstate: setting job 19311[164].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:04:09;0100;PBS_Server.12293;Req;;Type StatusJob request received from pbs_mom at n074, sock=216
04/26/2013 00:04:13;0100;PBS_Server.12294;Req;;Type StatusNode request received from root at n131.localdomain, sock=11
04/26/2013 00:04:13;0100;PBS_Server.12294;Req;;Type StatusQueue request received from root at n131.localdomain, sock=11
04/26/2013 00:04:13;0100;PBS_Server.12294;Req;;Type StatusJob request received from root at n131.localdomain, sock=11
04/26/2013 00:04:17;0008;PBS_Server.12291;Job;svr_setjobstate;svr_setjobstate: setting job 19311[180].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:04:17;0008;PBS_Server.12291;Job;svr_setjobstate;svr_setjobstate: setting job 19311[180].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:04:19;0008;PBS_Server.12289;Job;svr_setjobstate;svr_setjobstate: setting job 19311[176].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:04:19;0008;PBS_Server.12289;Job;svr_setjobstate;svr_setjobstate: setting job 19311[176].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:04:29;0100;PBS_Server.12293;Req;;Type StatusJob request received from pbs_mom at n075, sock=15
04/26/2013 00:04:29;0100;PBS_Server.12293;Req;;Type StatusJob request received from pbs_mom at n080, sock=217
04/26/2013 00:04:44;0100;PBS_Server.12294;Req;;Type StatusNode request received from root at n131.localdomain, sock=11
04/26/2013 00:04:44;0100;PBS_Server.12294;Req;;Type StatusQueue request received from root at n131.localdomain, sock=11
04/26/2013 00:04:44;0100;PBS_Server.12294;Req;;Type StatusJob request received from root at n131.localdomain, sock=11
04/26/2013 00:05:06;0040;PBS_Server.12284;Req;free_nodes;freeing nodes for job 19311[225].n131.localdomain
04/26/2013 00:05:06;0008;PBS_Server.12284;Job;svr_setjobstate;svr_setjobstate: setting job 19311[225].n131.localdomain state from EXITING-ABORT to COMPLETE-COMPLETE (6-59)
04/26/2013 00:05:06;0100;PBS_Server.12284;Req;;Type StatusJob request received from pbs_mom at n095, sock=218
04/26/2013 00:05:06;0040;PBS_Server.12283;Req;free_nodes;freeing nodes for job 19311[216].n131.localdomain
04/26/2013 00:05:06;0008;PBS_Server.12283;Job;svr_setjobstate;svr_setjobstate: setting job 19311[216].n131.localdomain state from EXITING-ABORT to COMPLETE-COMPLETE (6-59)
04/26/2013 00:05:06;0008;PBS_Server.12283;Job;19311[182].n131.localdomain;on_job_exit valid pjob: 19311[182].n131.localdomain (substate=50)
04/26/2013 00:05:06;0008;PBS_Server.12283;Job;handle_exiting_or_abort_substate;19311[182].n131.localdomain; JOB_SUBSTATE_EXITING
04/26/2013 00:05:06;0008;PBS_Server.12283;Job;svr_setjobstate;svr_setjobstate: setting job 19311[182].n131.localdomain state from EXITING-EXITING to EXITING-RETURNSTD (5-70)
04/26/2013 00:05:06;0008;PBS_Server.12283;Job;svr_setjobstate;svr_setjobstate: setting job 19311[182].n131.localdomain state from EXITING-RETURNSTD to EXITING-STAGEDEL (5-52)
04/26/2013 00:05:06;0040;PBS_Server.12296;Req;free_nodes;freeing nodes for job 19311[208].n131.localdomain
04/26/2013 00:05:06;0008;PBS_Server.12296;Job;svr_setjobstate;svr_setjobstate: setting job 19311[208].n131.localdomain state from EXITING-ABORT to COMPLETE-COMPLETE (6-59)
04/26/2013 00:05:06;0100;PBS_Server.12296;Req;;Type StatusJob request received from pbs_mom at n081, sock=219
04/26/2013 00:05:07;0040;PBS_Server.12292;Req;free_nodes;freeing nodes for job 19311[200].n131.localdomain
04/26/2013 00:05:07;0008;PBS_Server.12292;Job;svr_setjobstate;svr_setjobstate: setting job 19311[200].n131.localdomain state from EXITING-ABORT to COMPLETE-COMPLETE (6-59)
04/26/2013 00:05:07;0040;PBS_Server.12297;Req;free_nodes;freeing nodes for job 19311[150].n131.localdomain
04/26/2013 00:05:07;0008;PBS_Server.12297;Job;svr_setjobstate;svr_setjobstate: setting job 19311[150].n131.localdomain state from EXITING-ABORT to COMPLETE-COMPLETE (6-59)
04/26/2013 00:05:07;0100;PBS_Server.12297;Req;;Type StatusJob request received from pbs_mom at n083, sock=220
04/26/2013 00:05:07;0040;PBS_Server.12285;Req;free_nodes;freeing nodes for job 19311[163].n131.localdomain
04/26/2013 00:05:07;0008;PBS_Server.12285;Job;svr_setjobstate;svr_setjobstate: setting job 19311[163].n131.localdomain state from EXITING-ABORT to COMPLETE-COMPLETE (6-59)
04/26/2013 00:05:07;0008;PBS_Server.12285;Job;19311[168].n131.localdomain;on_job_exit valid pjob: 19311[168].n131.localdomain (substate=50)
04/26/2013 00:05:07;0008;PBS_Server.12285;Job;handle_exiting_or_abort_substate;19311[168].n131.localdomain; JOB_SUBSTATE_EXITING
04/26/2013 00:05:07;0008;PBS_Server.12285;Job;svr_setjobstate;svr_setjobstate: setting job 19311[168].n131.localdomain state from EXITING-EXITING to EXITING-RETURNSTD (5-70)
04/26/2013 00:05:07;0008;PBS_Server.12285;Job;svr_setjobstate;svr_setjobstate: setting job 19311[168].n131.localdomain state from EXITING-RETURNSTD to EXITING-STAGEDEL (5-52)
04/26/2013 00:05:08;0100;PBS_Server.12292;Req;;Type StatusJob request received from pbs_mom at n079, sock=12
04/26/2013 00:05:09;0008;PBS_Server.12284;Job;19311[172].n131.localdomain;on_job_exit valid pjob: 19311[172].n131.localdomain (substate=50)
04/26/2013 00:05:09;0008;PBS_Server.12284;Job;handle_exiting_or_abort_substate;19311[172].n131.localdomain; JOB_SUBSTATE_EXITING
04/26/2013 00:05:09;0008;PBS_Server.12284;Job;svr_setjobstate;svr_setjobstate: setting job 19311[172].n131.localdomain state from EXITING-EXITING to EXITING-RETURNSTD (5-70)
04/26/2013 00:05:09;0008;PBS_Server.12284;Job;svr_setjobstate;svr_setjobstate: setting job 19311[172].n131.localdomain state from EXITING-RETURNSTD to EXITING-STAGEDEL (5-52)
04/26/2013 00:05:15;0100;PBS_Server.12294;Req;;Type StatusNode request received from root at n131.localdomain, sock=11
04/26/2013 00:05:15;0100;PBS_Server.12294;Req;;Type StatusQueue request received from root at n131.localdomain, sock=11
04/26/2013 00:05:15;0100;PBS_Server.12294;Req;;Type StatusJob request received from root at n131.localdomain, sock=11
04/26/2013 00:05:19;0100;PBS_Server.12296;Req;;Type StatusJob request received from pbs_mom at n076, sock=79
04/26/2013 00:05:19;0100;PBS_Server.12293;Req;;Type JobObituary request received from pbs_mom at n100, sock=80
04/26/2013 00:05:19;0009;PBS_Server.12293;Job;19311[228].n131.localdomain;obit received - updating final job usage info
04/26/2013 00:05:19;0009;PBS_Server.12293;Job;19311[228].n131.localdomain;job exit status 0 handled
04/26/2013 00:05:19;0008;PBS_Server.12293;Job;svr_setjobstate;svr_setjobstate: setting job 19311[228].n131.localdomain state from RUNNING-RUNNING to EXITING-EXITING (5-50)
04/26/2013 00:05:19;000d;PBS_Server.12293;Job;19311[228].n131.localdomain;preparing to send 'e' mail for job 19311[228].n131.localdomain to hutches at n130.localdomain (Exit_status=0
04/26/2013 00:05:19;0010;PBS_Server.12293;Job;19311[228].n131.localdomain;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=2060kb resources_used.vmem=132076kb resources_used.walltime=00:00:17
04/26/2013 00:05:19;0008;PBS_Server.12288;Job;svr_setjobstate;svr_setjobstate: setting job 19311[175].n131.localdomain state from EXITING-STAGEDEL to EXITING-EXITED (5-53)
04/26/2013 00:05:19;0008;PBS_Server.12288;Job;svr_setjobstate;svr_setjobstate: setting job 19311[175].n131.localdomain state from EXITING-EXITED to EXITING-ABORT (5-54)
04/26/2013 00:05:29;0008;PBS_Server.12292;Job;19311[165].n131.localdomain;on_job_exit valid pjob: 19311[165].n131.localdomain (substate=50)
04/26/2013 00:05:29;0008;PBS_Server.12292;Job;handle_exiting_or_abort_substate;19311[165].n131.localdomain; JOB_SUBSTATE_EXITING
04/26/2013 00:05:29;0008;PBS_Server.12292;Job;svr_setjobstate;svr_setjobstate: setting job 19311[165].n131.localdomain state from EXITING-EXITING to EXITING-RETURNSTD (5-70)
04/26/2013 00:05:29;0008;PBS_Server.12292;Job;svr_setjobstate;svr_setjobstate: setting job 19311[165].n131.localdomain state from EXITING-RETURNSTD to EXITING-STAGEDEL (5-52)
04/26/2013 00:05:30;0100;PBS_Server.12297;Req;;Type StatusJob request received from pbs_mom at n080, sock=111

Here's what our server and queue configuration looks like:

#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 5000:00:00
set queue batch keep_completed = 300
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = n131
set server managers = root at n131.localdomain
set server managers += root at n130.localdomain
set server operators = root at n131.localdomain
set server default_queue = batch
set server log_events = 511
set server mail_from = adm at n131.localdomain
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 120
set server poll_jobs = True
set server log_level = 3
set server mom_job_sync = True
set server keep_completed = 300
set server allow_node_submit = True
set server next_job_number = 19425
set server clone_batch_delay = 30
set server job_force_cancel_time = 120
set server record_job_info = True
set server min_threads = 8
set server max_threads = 16
set server moab_array_compatible = True

Any help with diagnosing this problem would be greately appreciate!

Thanks,

Mike Hutcheson (mikeUNDERSCOREhutchesonATbaylorDOTedu)
Systems Manager


More information about the torqueusers mailing list