[torqueusers] Torque's scheduling problem
Zvika Galant
zvika at Camero-Tech.com
Sun Oct 16 07:09:27 MDT 2005
Hi
I'm using Torque ver. 1.2.0p6 for our cluster of 8 nodes named wild1..8.
Our server's name is creambo.
Once submitting the following job "echo sleep 600 | qsub -r y" and
calling qrerun for this job, the job is queued and not rerun.
I encounter the same phenomenon when using Maui scheduler (ver.
3.2.6p11)
1)
Following the server log regarding job #2592:
10/16/2005 12:44:40;0100;PBS_Server;Job;2592.creambo;enqueuing into
batch, state 1 hop 1
10/16/2005 12:44:40;0008;PBS_Server;Job;2592.creambo;Job Queued at
request of zv
10/16/2005 12:44:40;0008;PBS_Server;Job;2592.creambo;Job Modified at
request of Scheduler at creambo
10/16/2005 12:44:40;0008;PBS_Server;Job;2592.creambo;Job Run at request
of Scheduler at creambo
10/16/2005 12:44:50;0008;PBS_Server;Job;2592.creambo;Job Rerun at
request of root at creambo
10/16/2005 12:44:56;0008;PBS_Server;Job;2592.creambo;Job Run at request
of Scheduler at creambo
10/16/2005 12:44:56;0008;PBS_Server;Job;2592.creambo;unable to run job,
MOM rejected/rc=1
10/16/2005 12:44:56;0008;PBS_Server;Job;2592.creambo;Job Modified at
request of Scheduler at creambo
10/16/2005 12:54:56;0008;PBS_Server;Job;2592.creambo;Job Modified at
request of Scheduler at creambo
10/16/2005 12:54:56;0100;PBS_Server;Req;;Type RunJob request received
from Scheduler at creambo, sock=10
10/16/2005 12:54:56;0008;PBS_Server;Job;2592.creambo;Job Run at request
of Scheduler at creambo
10/16/2005 12:54:56;0008;PBS_Server;Job;2592.creambo;unable to run job,
MOM rejected/rc=1
10/16/2005 12:54:56;0080;PBS_Server;Req;req_reject;Reject reply
code=15041(Execution server rejected request MSG=send failed, STARTING),
aux=0, type=RunJob, from Scheduler at creambo
2)
Following mom log from the relevant node regarding job #2592:
10/16/2005 12:44:57;0001; pbs_mom;Job;job_nodes;job: 2592.creambo
numnodes=1 numvnod=1
10/16/2005 12:44:57;0008; pbs_mom;Job;2592.creambo;evaluating limits
for job
10/16/2005 12:44:57;0001; pbs_mom;Job;2592.creambo;phase 2 of job
launch successfully completed
10/16/2005 12:44:57;0001; pbs_mom;Job;2592.creambo;saving task
(TMomFinalizeJob3)
10/16/2005 12:44:57;0008; pbs_mom;Job;task_save;saving task in
/usr/spool/PBS/mom_priv/jobs/2592.creamb.TK/0000000001
10/16/2005 12:44:57;0001; pbs_mom;Job;TMomFinalizeJob3;job
2592.creambo started, pid = 28903
10/16/2005 12:44:57;0001; pbs_mom;Job;2592.creambo;job successfully
started
10/16/2005 12:44:57;0008; pbs_mom;Job;2592.creambo;job 2592.creambo
reported successful start on 1 node(s)
10/16/2005 12:45:07;0008; pbs_mom;Job;2592.creambo;kill_job
10/16/2005 12:45:07;0002; pbs_mom;n/a;run_pelog;userepilog script
'/usr/spool/PBS/mom_priv/epilogue.precancel' does not exist (cwd:
/usr/spool/PBS/mom_priv)
10/16/2005 12:45:07;0008; pbs_mom;Job;2592.creambo;kill_job found a
task to kill
10/16/2005 12:45:07;0008; pbs_mom;Job;2592.creambo;sending signal 9 to
task
10/16/2005 12:45:12;0008; pbs_mom;Job;2592.creambo;kill_task: killing
pid 28903 task 1 with sig 9
10/16/2005 12:45:12;0008; pbs_mom;Job;2592.creambo;kill_task: killing
pid 28918 task 1 with sig 9
10/16/2005 12:45:12;0008; pbs_mom;Job;2592.creambo;kill_task: killing
pid 28919 task 1 with sig 9
10/16/2005 12:45:12;0008; pbs_mom;Job;2592.creambo;kill_job done
10/16/2005 12:45:12;0002; pbs_mom;n/a;cput_sum;cput_sum: session=28903
pid=28903 cputime=0 (cputfactor=1.000000)
10/16/2005 12:45:12;0008; pbs_mom;Job;scan_for_terminated;for job
2592.creambo, task 1, pid=28903, exitcode=265
10/16/2005 12:45:12;0008; pbs_mom;Job;2592.creambo;sending signal 9 to
task
10/16/2005 12:45:12;0008; pbs_mom;Job;task_save;saving task in
/usr/spool/PBS/mom_priv/jobs/2592.creamb.TK/0000000001
10/16/2005 12:45:12;0080; pbs_mom;Job;2592.creambo;saving task in
/usr/spool/PBS/mom_priv/jobs/2592.creamb.TK/0000000001
10/16/2005 12:45:12;0008; pbs_mom;Job;2592.creambo;Terminated
10/16/2005 12:45:12;0008; pbs_mom;Req;send_sisters;sending command
KILL_JOB (2)
10/16/2005 12:45:12;0008; pbs_mom;Job;task_save;saving task in
/usr/spool/PBS/mom_priv/jobs/2592.creamb.TK/0000000001
10/16/2005 12:45:12;0080; pbs_mom;Job;2592.creambo;local task
termination detected. killing job
10/16/2005 12:45:12;0008; pbs_mom;Job;2592.creambo;kill_job
10/16/2005 12:45:12;0002; pbs_mom;n/a;run_pelog;userepilog script
'/usr/spool/PBS/mom_priv/epilogue.precancel' does not exist (cwd:
/usr/spool/PBS/mom_priv)
10/16/2005 12:45:12;0008; pbs_mom;Job;2592.creambo;kill_job done
10/16/2005 12:45:12;0080; pbs_mom;Job;2592.creambo;performing job
clean-up
10/16/2005 12:45:12;0080; pbs_mom;Job;2592.creambo;deleting job
2592.creambo in state EXITED
3)
Following QMGR -c "print server":
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Create and define queue long
#
create queue long
set queue long queue_type = Execution
set queue long max_running = 16
set queue long resources_default.nodes = 1
set queue long resources_default.walltime = 01:00:00
set queue long enabled = True
set queue long started = True
#
# Create and define queue short
#
create queue short
set queue short queue_type = Execution
set queue short max_running = 16
set queue short resources_default.nodes = 1
set queue short resources_default.walltime = 01:00:00
set queue short enabled = True
set queue short started = True
#
# Set server attributes.
#
set server scheduling = True
set server managers = root at creambo
set server operators = root at creambo
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server resources_default.walltime = 01:00:00
set server scheduler_iteration = 600
set server node_ping_rate = 300
set server node_check_rate = 150
set server tcp_timeout = 6
set server job_stat_rate = 30
4)
Following momctl -d 0 -h wild1,...,wild8:
Host: wild1.camero-tech.com/wild1 Server: creambo Version: 1.2.0p6
PID: 2766
HomeDirectory: /usr/spool/PBS/mom_priv
MOM active: 18309 seconds
Last Msg From Server: 421 seconds (ReadyToCommit)
Last Msg To Server: 20 seconds
LOGLEVEL: 9 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
Host: wild2.camero-tech.com/wild2 Server: creambo Version: 1.2.0p6
PID: 2722
HomeDirectory: /usr/spool/PBS/mom_priv
MOM active: 18280 seconds
Last Msg From Server: 8254 seconds (CLUSTER_ADDRS)
Last Msg To Server: 20 seconds
LOGLEVEL: 9 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
Host: wild3.camero-tech.com/wild3 Server: creambo Version: 1.2.0p6
PID: 2738
HomeDirectory: /usr/spool/PBS/mom_priv
MOM active: 18284 seconds
Last Msg From Server: 8253 seconds (CLUSTER_ADDRS)
Last Msg To Server: 19 seconds
LOGLEVEL: 9 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
Host: wild4.camero-tech.com/wild4 Server: creambo Version: 1.2.0p6
PID: 2767
HomeDirectory: /usr/spool/PBS/mom_priv
MOM active: 18292 seconds
Last Msg From Server: 8253 seconds (CLUSTER_ADDRS)
Last Msg To Server: 19 seconds
LOGLEVEL: 9 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
Host: wild5.camero-tech.com/wild5 Server: creambo Version: 1.2.0p6
PID: 2757
HomeDirectory: /usr/spool/PBS/mom_priv
MOM active: 18266 seconds
Last Msg From Server: 8254 seconds (CLUSTER_ADDRS)
Last Msg To Server: 20 seconds
LOGLEVEL: 9 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
Host: wild6.camero-tech.com/wild6 Server: creambo Version: 1.2.0p6
PID: 2704
HomeDirectory: /usr/spool/PBS/mom_priv
MOM active: 18264 seconds
Last Msg From Server: 8253 seconds (CLUSTER_ADDRS)
Last Msg To Server: 19 seconds
LOGLEVEL: 9 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
Host: wild7.camero-tech.com/wild7 Server: creambo Version: 1.2.0p6
PID: 2758
HomeDirectory: /usr/spool/PBS/mom_priv
MOM active: 18265 seconds
Last Msg From Server: 8254 seconds (CLUSTER_ADDRS)
Last Msg To Server: 20 seconds
LOGLEVEL: 9 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
Host: wild8.camero-tech.com/wild8 Server: creambo Version: 1.2.0p6
PID: 2720
HomeDirectory: /usr/spool/PBS/mom_priv
MOM active: 18270 seconds
Last Msg From Server: 8254 seconds (CLUSTER_ADDRS)
Last Msg To Server: 20 seconds
LOGLEVEL: 9 (use SIGUSR1/SIGUSR2 to adjust)
JobList: NONE
diagnostics complete
5)
Following mom_priv/config file on a computing node:
$clienthost creambo
$logevent 255
$loglevel 9
$usecp *:/home /home
$clienthost wild1.camero-tech.com
$clienthost wild2.camero-tech.com
$clienthost wild3.camero-tech.com
$clienthost wild4.camero-tech.com
$clienthost wild5.camero-tech.com
$clienthost wild6.camero-tech.com
$clienthost wild7.camero-tech.com
$clienthost wild8.camero-tech.com
I hope this is sufficient for getting any initial indication about the
problem.
Thanks,
Zvika
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20051016/43bfd8dd/attachment-0001.html
More information about the torqueusers
mailing list