[torqueusers] some help using torque

Fabrizio Salvatore p.salvatore at rhul.ac.uk
Wed Mar 9 08:40:35 MST 2005


Hi,

We have just moved our batch farm to SL303 and we have installed 
torque to replace the old pbs batch system. Everything worked fine for 
the last month after the installation of the new OS and batch system, 
until last week when we had the following problem: all jobs sent to 
the batch queue got stuck and are not routed to any of the 'execution' 
queues. By running qmgr, everything seems to work fine for what 
concernes the pbs server, all queues are enabled and all the server 
attributes seem OK:

> #
> # Set server attributes.
> #
> set server scheduling = True
> set server acl_hosts = *.pp.rhul.ac.uk
> set server managers = babarmc@*.pp.rhul.ac.uk
> set server managers += dhs@*.pp.rhul.ac.uk
> set server managers += george@*.pp.rhul.ac.uk
> set server managers += root@*.pp.rhul.ac.uk
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server resources_default.neednodes = 1
> set server resources_default.nodect = 1
> set server scheduler_iteration = 600
> set server node_ping_rate = 300
> set server node_check_rate = 600
 
However, looking in /var/spool/pbs/sched_priv/sched_out, I see the
following lines that make me worry that there is something going wrong:

 simpleget: Premature end of message
 startcom: diswsi error Protocol failure in commit
 alarm call

but I have no idea what this mean....
 
One more thing I can see is that in /var/spool/pbs/sched_priv/ and 
/var/spool/pbs/server_priv/ there are two lock files (server.lock and
sched.lock) that are created everytime I start pbs_server and/or 
pbs_sched. These files contain the PID of the server/sched process and 
are renewed every time I start a new process. In fact, if I re-start 
the server and scheduler few jobs queued are submitted, and then 
everything stops again. If I look at the logfile of the scheduler
(/var/spool/pbs/sched_log/<number>) I see the following:
 
03/08/2005 15:34:55;0002; pbs_sched;Svr;Log;Log opened 03/08/2005 
15:34:55;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 
27631 03/08/2005 15:35:49;0040; 
pbs_sched;Job;1464.bfa.pp.rhul.ac.uk;Job Run 03/08/2005 15:35:50;0040; 
pbs_sched;Job;1465.bfa.pp.rhul.ac.uk;Job Run 03/08/2005 15:35:50;0040; 
pbs_sched;Job;1466.bfa.pp.rhul.ac.uk;Job Run 03/08/2005 15:35:51;0040; 
pbs_sched;Job;1467.bfa.pp.rhul.ac.uk;Job Run 03/08/2005 15:35:51;0040; 
pbs_sched;Job;1468.bfa.pp.rhul.ac.uk;Job Run 03/08/2005 15:35:52;0040; 
pbs_sched;Job;1469.bfa.pp.rhul.ac.uk;Job Run 03/08/2005 15:35:52;0040; 
pbs_sched;Job;1470.bfa.pp.rhul.ac.uk;Job Run 03/08/2005 15:35:53;0040; 
pbs_sched;Job;1471.bfa.pp.rhul.ac.uk;Job Run 03/08/2005 15:35:54;0080; 
pbs_sched;Svr;main;brk point 134774784 03/08/2005 15:47:08;0002; 
pbs_sched;Svr;toolong;alarm call 03/08/2005 15:47:08;0002; 
pbs_sched;Svr;Log;Log closed 03/08/2005 15:47:08;0002; 
pbs_sched;Svr;toolong;restart dir /var/spool/pbs/server_priv object 
/usr/sbin/pbs_sched 03/08/2005 15:47:08;0002; pbs_sched;Svr;Log;Log 
opened 03/08/2005 15:47:08;0002; 
pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 27696
03/08/2005 15:51:37;0002; pbs_sched;Svr;toolong;alarm call
03/08/2005 15:51:37;0002; pbs_sched;Svr;Log;Log closed
03/08/2005 15:51:37;0002; pbs_sched;Svr;toolong;restart dir 
/var/spool/pbs/server_priv object /usr/sbin/pbs_sched
03/08/2005 15:51:37;0002; pbs_sched;Svr;Log;Log opened
03/08/2005 15:51:37;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup 
pid 27713

So it looks as if after the 'pbs_sched;Svr;main;brk point 134774784' 
message and the 'alarm call', the scheduler was restarted 
(automatically ? I didn't do it !):
 
 pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 27696
 
and then there was another 'alarm call' and it was restarted again 
(pid 27713).
 
The .lock files that I deleted previously, before I restarted the 
server
and scheduler, have been created again (and the only line that is shown is 
the last pid number for the scheduler/server), and they have very weird 
dates: 

-rw-r--r--    1 root     root            7 Mar  8 15:51
sched_priv/sched.lock
-rw-------    1 root     root            6 Mar  8 15:35
server_priv/server.lock

If you have had the same experience with torque or know someone who can help
me with that, 
I'd be really grateful.

Thank you very much in advance.
 
Regards,

			Fabrizio
 
 
************************************************************************
Dr P.-Fabrizio SALVATORE            * Royal Holloway, London University
E-mail: P.Salvatore at rhul.ac.uk      * Egham, Surrey, TW20 0EX, UK
http://www.pp.rhul.ac.uk/~salvator/ * Office phone: +44 (0)1784 443479
                                    * Office fax:   +44 (0)1784 472794
************************************************************************





More information about the torqueusers mailing list