[torqueusers] Torque's scheduling problem

Zvika Galant zvika at Camero-Tech.com
Sun Oct 16 07:09:27 MDT 2005


Hi

 

I'm using Torque ver. 1.2.0p6 for our cluster of 8 nodes named wild1..8.
Our server's name is creambo.

Once submitting the following job "echo sleep 600 | qsub -r y" and
calling qrerun for this job, the job is queued and not rerun.

I encounter the same phenomenon when using Maui scheduler (ver.
3.2.6p11)

 

1)

Following the server log regarding job #2592:

 

10/16/2005 12:44:40;0100;PBS_Server;Job;2592.creambo;enqueuing into
batch, state 1 hop 1

10/16/2005 12:44:40;0008;PBS_Server;Job;2592.creambo;Job Queued at
request of zv

10/16/2005 12:44:40;0008;PBS_Server;Job;2592.creambo;Job Modified at
request of Scheduler at creambo

10/16/2005 12:44:40;0008;PBS_Server;Job;2592.creambo;Job Run at request
of Scheduler at creambo

10/16/2005 12:44:50;0008;PBS_Server;Job;2592.creambo;Job Rerun at
request of root at creambo

10/16/2005 12:44:56;0008;PBS_Server;Job;2592.creambo;Job Run at request
of Scheduler at creambo

10/16/2005 12:44:56;0008;PBS_Server;Job;2592.creambo;unable to run job,
MOM rejected/rc=1

10/16/2005 12:44:56;0008;PBS_Server;Job;2592.creambo;Job Modified at
request of Scheduler at creambo

10/16/2005 12:54:56;0008;PBS_Server;Job;2592.creambo;Job Modified at
request of Scheduler at creambo

10/16/2005 12:54:56;0100;PBS_Server;Req;;Type RunJob request received
from Scheduler at creambo, sock=10

10/16/2005 12:54:56;0008;PBS_Server;Job;2592.creambo;Job Run at request
of Scheduler at creambo

10/16/2005 12:54:56;0008;PBS_Server;Job;2592.creambo;unable to run job,
MOM rejected/rc=1

10/16/2005 12:54:56;0080;PBS_Server;Req;req_reject;Reject reply
code=15041(Execution server rejected request MSG=send failed, STARTING),
aux=0, type=RunJob, from Scheduler at creambo

 

2)

Following mom log from the relevant node regarding job #2592:

 

10/16/2005 12:44:57;0001;   pbs_mom;Job;job_nodes;job: 2592.creambo
numnodes=1 numvnod=1

 

10/16/2005 12:44:57;0008;   pbs_mom;Job;2592.creambo;evaluating limits
for job

10/16/2005 12:44:57;0001;   pbs_mom;Job;2592.creambo;phase 2 of job
launch successfully completed

10/16/2005 12:44:57;0001;   pbs_mom;Job;2592.creambo;saving task
(TMomFinalizeJob3)

10/16/2005 12:44:57;0008;   pbs_mom;Job;task_save;saving task in
/usr/spool/PBS/mom_priv/jobs/2592.creamb.TK/0000000001

10/16/2005 12:44:57;0001;   pbs_mom;Job;TMomFinalizeJob3;job
2592.creambo started, pid = 28903

10/16/2005 12:44:57;0001;   pbs_mom;Job;2592.creambo;job successfully
started

10/16/2005 12:44:57;0008;   pbs_mom;Job;2592.creambo;job 2592.creambo
reported successful start on 1 node(s)

10/16/2005 12:45:07;0008;   pbs_mom;Job;2592.creambo;kill_job

10/16/2005 12:45:07;0002;   pbs_mom;n/a;run_pelog;userepilog script
'/usr/spool/PBS/mom_priv/epilogue.precancel' does not exist (cwd:
/usr/spool/PBS/mom_priv)

10/16/2005 12:45:07;0008;   pbs_mom;Job;2592.creambo;kill_job found a
task to kill

10/16/2005 12:45:07;0008;   pbs_mom;Job;2592.creambo;sending signal 9 to
task

10/16/2005 12:45:12;0008;   pbs_mom;Job;2592.creambo;kill_task: killing
pid 28903 task 1 with sig 9

10/16/2005 12:45:12;0008;   pbs_mom;Job;2592.creambo;kill_task: killing
pid 28918 task 1 with sig 9

10/16/2005 12:45:12;0008;   pbs_mom;Job;2592.creambo;kill_task: killing
pid 28919 task 1 with sig 9

10/16/2005 12:45:12;0008;   pbs_mom;Job;2592.creambo;kill_job done

10/16/2005 12:45:12;0002;   pbs_mom;n/a;cput_sum;cput_sum: session=28903
pid=28903 cputime=0 (cputfactor=1.000000)

10/16/2005 12:45:12;0008;   pbs_mom;Job;scan_for_terminated;for job
2592.creambo, task 1, pid=28903, exitcode=265

10/16/2005 12:45:12;0008;   pbs_mom;Job;2592.creambo;sending signal 9 to
task

10/16/2005 12:45:12;0008;   pbs_mom;Job;task_save;saving task in
/usr/spool/PBS/mom_priv/jobs/2592.creamb.TK/0000000001

10/16/2005 12:45:12;0080;   pbs_mom;Job;2592.creambo;saving task in
/usr/spool/PBS/mom_priv/jobs/2592.creamb.TK/0000000001

10/16/2005 12:45:12;0008;   pbs_mom;Job;2592.creambo;Terminated

10/16/2005 12:45:12;0008;   pbs_mom;Req;send_sisters;sending command
KILL_JOB (2)

10/16/2005 12:45:12;0008;   pbs_mom;Job;task_save;saving task in
/usr/spool/PBS/mom_priv/jobs/2592.creamb.TK/0000000001

10/16/2005 12:45:12;0080;   pbs_mom;Job;2592.creambo;local task
termination detected.  killing job

10/16/2005 12:45:12;0008;   pbs_mom;Job;2592.creambo;kill_job

10/16/2005 12:45:12;0002;   pbs_mom;n/a;run_pelog;userepilog script
'/usr/spool/PBS/mom_priv/epilogue.precancel' does not exist (cwd:
/usr/spool/PBS/mom_priv)

10/16/2005 12:45:12;0008;   pbs_mom;Job;2592.creambo;kill_job done

10/16/2005 12:45:12;0080;   pbs_mom;Job;2592.creambo;performing job
clean-up

10/16/2005 12:45:12;0080;   pbs_mom;Job;2592.creambo;deleting job
2592.creambo in state EXITED

 

3)

Following QMGR  -c "print server":

 

#

# Create queues and set their attributes.

#

#

# Create and define queue batch

#

create queue batch

set queue batch queue_type = Execution

set queue batch resources_default.nodes = 1

set queue batch resources_default.walltime = 01:00:00

set queue batch enabled = True

set queue batch started = True

#

# Create and define queue long

#

create queue long

set queue long queue_type = Execution

set queue long max_running = 16

set queue long resources_default.nodes = 1

set queue long resources_default.walltime = 01:00:00

set queue long enabled = True

set queue long started = True

#

# Create and define queue short

#

create queue short

set queue short queue_type = Execution

set queue short max_running = 16

set queue short resources_default.nodes = 1

set queue short resources_default.walltime = 01:00:00

set queue short enabled = True

set queue short started = True

#

# Set server attributes.

#

set server scheduling = True

set server managers = root at creambo

set server operators = root at creambo

set server default_queue = batch

set server log_events = 511

set server mail_from = adm

set server resources_default.walltime = 01:00:00

set server scheduler_iteration = 600

set server node_ping_rate = 300

set server node_check_rate = 150

set server tcp_timeout = 6

set server job_stat_rate = 30

 

4)

Following momctl -d 0 -h wild1,...,wild8:

 

Host: wild1.camero-tech.com/wild1   Server: creambo   Version: 1.2.0p6

PID:                    2766

HomeDirectory:          /usr/spool/PBS/mom_priv

MOM active:             18309 seconds

Last Msg From Server:   421 seconds (ReadyToCommit)

Last Msg To Server:     20 seconds

LOGLEVEL:               9 (use SIGUSR1/SIGUSR2 to adjust)

JobList:                NONE

 

diagnostics complete

 

 

Host: wild2.camero-tech.com/wild2   Server: creambo   Version: 1.2.0p6

PID:                    2722

HomeDirectory:          /usr/spool/PBS/mom_priv

MOM active:             18280 seconds

Last Msg From Server:   8254 seconds (CLUSTER_ADDRS)

Last Msg To Server:     20 seconds

LOGLEVEL:               9 (use SIGUSR1/SIGUSR2 to adjust)

JobList:                NONE

 

diagnostics complete

 

 

Host: wild3.camero-tech.com/wild3   Server: creambo   Version: 1.2.0p6

PID:                    2738

HomeDirectory:          /usr/spool/PBS/mom_priv

MOM active:             18284 seconds

Last Msg From Server:   8253 seconds (CLUSTER_ADDRS)

Last Msg To Server:     19 seconds

LOGLEVEL:               9 (use SIGUSR1/SIGUSR2 to adjust)

JobList:                NONE

 

diagnostics complete

 

 

Host: wild4.camero-tech.com/wild4   Server: creambo   Version: 1.2.0p6

PID:                    2767

HomeDirectory:          /usr/spool/PBS/mom_priv

MOM active:             18292 seconds

Last Msg From Server:   8253 seconds (CLUSTER_ADDRS)

Last Msg To Server:     19 seconds

LOGLEVEL:               9 (use SIGUSR1/SIGUSR2 to adjust)

JobList:                NONE

 

diagnostics complete

 

 

Host: wild5.camero-tech.com/wild5   Server: creambo   Version: 1.2.0p6

PID:                    2757

HomeDirectory:          /usr/spool/PBS/mom_priv

MOM active:             18266 seconds

Last Msg From Server:   8254 seconds (CLUSTER_ADDRS)

Last Msg To Server:     20 seconds

LOGLEVEL:               9 (use SIGUSR1/SIGUSR2 to adjust)

JobList:                NONE

 

diagnostics complete

 

 

Host: wild6.camero-tech.com/wild6   Server: creambo   Version: 1.2.0p6

PID:                    2704

HomeDirectory:          /usr/spool/PBS/mom_priv

MOM active:             18264 seconds

Last Msg From Server:   8253 seconds (CLUSTER_ADDRS)

Last Msg To Server:     19 seconds

LOGLEVEL:               9 (use SIGUSR1/SIGUSR2 to adjust)

JobList:                NONE

 

diagnostics complete

 

 

Host: wild7.camero-tech.com/wild7   Server: creambo   Version: 1.2.0p6

PID:                    2758

HomeDirectory:          /usr/spool/PBS/mom_priv

MOM active:             18265 seconds

Last Msg From Server:   8254 seconds (CLUSTER_ADDRS)

Last Msg To Server:     20 seconds

LOGLEVEL:               9 (use SIGUSR1/SIGUSR2 to adjust)

JobList:                NONE

 

diagnostics complete

 

 

Host: wild8.camero-tech.com/wild8   Server: creambo   Version: 1.2.0p6

PID:                    2720

HomeDirectory:          /usr/spool/PBS/mom_priv

MOM active:             18270 seconds

Last Msg From Server:   8254 seconds (CLUSTER_ADDRS)

Last Msg To Server:     20 seconds

LOGLEVEL:               9 (use SIGUSR1/SIGUSR2 to adjust)

JobList:                NONE

 

diagnostics complete

 

5)

Following mom_priv/config file on a computing node:

 

$clienthost     creambo

$logevent       255

$loglevel       9

$usecp *:/home /home

$clienthost wild1.camero-tech.com

$clienthost wild2.camero-tech.com

$clienthost wild3.camero-tech.com

$clienthost wild4.camero-tech.com

$clienthost wild5.camero-tech.com

$clienthost wild6.camero-tech.com

$clienthost wild7.camero-tech.com

$clienthost wild8.camero-tech.com

 

 

I hope this is sufficient for getting any initial indication about the
problem.

 

Thanks,

Zvika

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20051016/43bfd8dd/attachment-0001.html


More information about the torqueusers mailing list