[torqueusers] pbs_sched does not run jobs unless forced with qrun

David Ceresuela david.ceresuela at gmail.com
Thu Jun 7 12:02:07 MDT 2012


Hi all,

I have installed torque-4.0.2 in an Ubuntu-10.04 and I don't know how to
make the scheduler run jobs. If I use qrun the job will be executed, but
otherwise it won't.
I have made sure that:
- trqauthd, pbs_server and pbs_sched are running on head node
- pbs_mom is running on compute node
- The user from which I submit the job exists in all (2) machines with same
id and same group
- iptables is not messing around and neither is a firewall
- DNS is working propperly, both fordward and reverse. I am using a DNS
server on another machine (neither head nor compute)
- pbsnodes shows

This is the setup:

head node
  - hostname: lucid-tor1
  - users: root, david
  - executing: /etc/init.d/trqauthd, pbs_server, pbs_sched (the simplest
scheduler)

compute node
  - hostname: lucid-tor2
  - users: root, david
  - executing: pbs_mom

This is the output from qmgr -c 'p s':
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_max.nodes = 2
set queue batch resources_min.nodes = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = lucid-tor1.cps.cloud
set server acl_hosts += lucid-tor1
set server managers = david at lucid-tor1
set server managers += root at lucid-tor1
set server operators = david at lucid-tor1
set server operators += root at lucid-tor1
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 300
set server job_stat_rate = 45
set server poll_jobs = True
set server log_level = 3
set server mom_job_sync = True
set server keep_completed = 10
set server next_job_number = 74
set server moab_array_compatible = True

And this is the one from pbsnodes:
lucid-tor2
     state = free
     np = 1
     ntype = cluster
     status =
rectime=1339089060,varattr=,jobs=,state=free,netload=724555,gres=,loadave=0.00,ncpus=2,physmem=1023292kb,availmem=1458140kb,totmem=1521972kb,idletime=379,nusers=0,nsessions=0,uname=Linux
lucid-tor2 2.6.32-33-server #70-Ubuntu SMP Thu Jul 7 22:28:30 UTC 2011
x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0


In the logs, I find no errors:

### sched log
06/07/2012 18:58:44;0002; pbs_sched;Svr;Log;Log opened
06/07/2012 18:58:44;0002; pbs_sched;Svr;TokenAct;Account file
/var/spool/torque/sched_priv/accounting/20120607 opened
06/07/2012 18:58:44;0002; pbs_sched;Svr;main;pbs_sched startup pid 1203

### server log
06/07/2012 18:59:28;0100;PBS_Server;Req;;Type AuthenticateUser request
received from david at lucid-tor1.cps.cloud, sock=9
06/07/2012 18:59:28;0100;PBS_Server;Req;;Type QueueJob request received
from david at lucid-tor1.cps.cloud, sock=8
06/07/2012 18:59:28;0100;PBS_Server;Req;;Type JobScript request received
from david at lucid-tor1.cps.cloud, sock=8
06/07/2012 18:59:28;0100;PBS_Server;Req;;Type ReadyToCommit request
received from david at lucid-tor1.cps.cloud, sock=8
06/07/2012 18:59:28;0100;PBS_Server;Req;;Type Commit request received from
david at lucid-tor1.cps.cloud, sock=8
06/07/2012 18:59:28;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from TRANSIT-TRANSICM to QUEUED-QUEUED
(1-10)
06/07/2012 18:59:28;0100;PBS_Server;Job;73.lucid-tor1.cps.cloud;enqueuing
into batch, state 1 hop 1
06/07/2012 18:59:28;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;Job Queued
at request of david at lucid-tor1.cps.cloud, owner = david at lucid-tor1.cps.cloud,
job name = STDIN, queue = batch
06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;message received
from addr 155.210.155.xx7:219: mom_port 15002  - rm_port 15003
06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;message STATUS (4)
received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:219) (sock
10)
06/07/2012 18:59:45;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received
from lucid-tor2.cps.cloud
06/07/2012 18:59:45;0040;PBS_Server;Req;is_stat_get;received status from
node lucid-tor2.cps.cloud
06/07/2012 18:59:45;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2 - state=512, newstate=0
06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message received
from addr 155.210.155.xx7:340: mom_port 15002  - rm_port 15003
06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4)
received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:340) (sock
8)
06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received
from lucid-tor2.cps.cloud
06/07/2012 19:00:30;0040;PBS_Server;Req;is_stat_get;received status from
node lucid-tor2.cps.cloud
06/07/2012 19:00:30;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2 - state=512, newstate=0

Then I run qrun:
root at lucid-tor1: qrun 73


And these are the logs after qrun:

### sched log (it is the same)
06/07/2012 18:58:44;0002; pbs_sched;Svr;Log;Log opened
06/07/2012 18:58:44;0002; pbs_sched;Svr;TokenAct;Account file
/var/spool/torque/sched_priv/accounting/20120607 opened
06/07/2012 18:58:44;0002; pbs_sched;Svr;main;pbs_sched startup pid 1203

### server log
...
06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message received
from addr 155.210.155.xx7:340: mom_port 15002  - rm_port 15003
06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4)
received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:340) (sock
8)
06/07/2012 19:00:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received
from lucid-tor2.cps.cloud
06/07/2012 19:00:30;0040;PBS_Server;Req;is_stat_get;received status from
node lucid-tor2.cps.cloud
06/07/2012 19:00:30;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2 - state=512, newstate=0
06/07/2012 19:01:14;0004;PBS_Server;Svr;check_nodes_work;node
lucid-tor2.cps.cloud not detected in 1339088474 seconds, marking node down
06/07/2012 19:01:14;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2.cps.cloud - state=512, newstate=2
06/07/2012 19:01:14;0040;PBS_Server;Req;update_node_state;node
lucid-tor2.cps.cloud marked down
06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;message received
from addr 155.210.155.xx7:279: mom_port 15002  - rm_port 15003
06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;message STATUS (4)
received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:279) (sock
9)
06/07/2012 19:01:15;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received
from lucid-tor2.cps.cloud
06/07/2012 19:01:15;0040;PBS_Server;Req;is_stat_get;received status from
node lucid-tor2.cps.cloud
06/07/2012 19:01:15;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2 - state=512, newstate=0
06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;message received
from addr 155.210.155.xx7:251: mom_port 15002  - rm_port 15003
06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;message STATUS (4)
received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:251) (sock
8)
06/07/2012 19:02:00;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received
from lucid-tor2.cps.cloud
06/07/2012 19:02:00;0040;PBS_Server;Req;is_stat_get;received status from
node lucid-tor2.cps.cloud
06/07/2012 19:02:00;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2 - state=512, newstate=0
06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;message received
from addr 155.210.155.xx7:448: mom_port 15002  - rm_port 15003
06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;message STATUS (4)
received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:448) (sock
9)
06/07/2012 19:02:45;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received
from lucid-tor2.cps.cloud
06/07/2012 19:02:45;0040;PBS_Server;Req;is_stat_get;received status from
node lucid-tor2.cps.cloud
06/07/2012 19:02:45;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2 - state=512, newstate=0
06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;message received
from addr 155.210.155.xx7:650: mom_port 15002  - rm_port 15003
06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;message STATUS (4)
received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:650) (sock
8)
06/07/2012 19:03:30;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received
from lucid-tor2.cps.cloud
06/07/2012 19:03:30;0040;PBS_Server;Req;is_stat_get;received status from
node lucid-tor2.cps.cloud
06/07/2012 19:03:30;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2 - state=512, newstate=0
06/07/2012 19:03:59;0100;PBS_Server;Req;;Type AuthenticateUser request
received from root at lucid-tor1.cps.cloud, sock=8
06/07/2012 19:03:59;0100;PBS_Server;Req;;Type RunJob request received from
root at lucid-tor1.cps.cloud, sock=9
06/07/2012 19:03:59;0040;PBS_Server;Req;set_nodes;allocating nodes for job
73.lucid-tor1.cps.cloud with node expression '1'
06/07/2012 19:03:59;0040;PBS_Server;Req;set_nodes;job
73.lucid-tor1.cps.cloud allocated 1 nodes (nodelist=lucid-tor2/0)
06/07/2012 19:03:59;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;Job Run at
request of root at lucid-tor1.cps.cloud
06/07/2012 19:03:59;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from QUEUED-QUEUED to RUNNING-PRERUN
(4-40)
06/07/2012 19:03:59;0002;PBS_Server;Job;73.lucid-tor1.cps.cloud;child
reported success for job after 0 seconds (dest=???), rc=0
06/07/2012 19:03:59;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from RUNNING-TRNOUTCM to RUNNING-RUNNING
(4-42)
06/07/2012 19:03:59;0002;PBS_Server;Svr;PBS_Server;Torque Server Version =
4.0.2, loglevel = 3
06/07/2012 19:04:03;0100;PBS_Server;Req;;Type StatusJob request received
from pbs_mom at lucid-tor2.cps.cloud, sock=10
06/07/2012 19:04:03;0100;PBS_Server;Req;;Type JobObituary request received
from pbs_mom at lucid-tor2.cps.cloud, sock=8
06/07/2012 19:04:03;0009;PBS_Server;Job;73.lucid-tor1.cps.cloud;obit
received - updating final job usage info
06/07/2012 19:04:03;0009;PBS_Server;Job;73.lucid-tor1.cps.cloud;job exit
status 0 handled
06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from RUNNING-RUNNING to EXITING-EXITING
(5-50)
06/07/2012 19:04:03;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;preparing
to send 'e' mail for job 73.lucid-tor1.cps.cloud to
david at lucid-tor1.cps.cloud (Exit_status=0)
06/07/2012 19:04:03;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Not sending
email: User does not want mail of this type.
06/07/2012
19:04:03;0010;PBS_Server;Job;73.lucid-tor1.cps.cloud;Exit_status=0
06/07/2012 19:04:03;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;on_job_exit
valid pjob: 0x2493f30 (substate=50)
06/07/2012
19:04:03;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;JOB_SUBSTATE_EXITING
06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from EXITING-EXITING to EXITING-RETURNSTD
(5-70)
06/07/2012 19:04:03;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from EXITING-RETURNSTD to
EXITING-STAGEDEL (5-52)
06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Post job
file processing error; job 73.lucid-tor1.cps.cloud on host lucid-tor2/0
06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;request to
copy stageout files failed on node 'lucid-tor2/0' for job
73.lucid-tor1.cps.cloud
06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;preparing
to send 'o' mail for job 73.lucid-tor1.cps.cloud to
david at lucid-tor1.cps.cloud (request to copy stageout files failed on node
'lucid-tor2/0' for)
06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from EXITING-STAGEDEL to EXITING-EXITED
(5-53)
06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from EXITING-EXITED to EXITING-ABORT
(5-54)
06/07/2012 19:04:12;000d;PBS_Server;Job;73.lucid-tor1.cps.cloud;Email 'o'
to david at lucid-tor1.cps.cloud failed: Child process 'sendmail -f adm
david at lucid-tor1.cps.cloud' returned 127 (errno 0:Success)
06/07/2012 19:04:12;0040;PBS_Server;Req;free_nodes;freeing nodes for job
73.lucid-tor1.cps.cloud
06/07/2012 19:04:12;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting
job 73.lucid-tor1.cps.cloud state from EXITING-ABORT to COMPLETE-COMPLETE
(6-59)
06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;message received
from addr 155.210.155.xx7:175: mom_port 15002  - rm_port 15003
06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;message STATUS (4)
received from mom on host lucid-tor2.cps.cloud (155.210.155.xx7:175) (sock
9)
06/07/2012 19:04:15;0004;PBS_Server;Svr;svr_is_request;IS_STATUS received
from lucid-tor2.cps.cloud
06/07/2012 19:04:15;0040;PBS_Server;Req;is_stat_get;received status from
node lucid-tor2.cps.cloud
06/07/2012 19:04:15;0040;PBS_Server;Req;update_node_state;adjusting state
for node lucid-tor2 - state=0, newstate=0
06/07/2012 19:04:27;0008;PBS_Server;Job;73.lucid-tor1.cps.cloud;on_job_exit
valid pjob: 0x2493f30 (substate=59)
06/07/2012 19:04:27;0100;PBS_Server;Job;73.lucid-tor1.cps.cloud;dequeuing
from batch, state COMPLETE

### mom log
06/07/2012 18:59:00;0002;   pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to lucid-tor1
06/07/2012 18:59:45;0002;   pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to lucid-tor1
06/07/2012 19:00:30;0002;   pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to lucid-tor1
06/07/2012 19:00:36;0002;   pbs_mom;Svr;pbs_mom;Torque Mom Version = 4.0.2,
loglevel = 3
06/07/2012 19:01:15;0002;   pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to lucid-tor1
06/07/2012 19:02:00;0002;   pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to lucid-tor1
06/07/2012 19:02:45;0002;   pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to lucid-tor1
06/07/2012 19:03:30;0002;   pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to lucid-tor1
06/07/2012 19:03:59;0008;   pbs_mom;Job;mom_process_request;request type
QueueJob from host lucid-tor1.cps.cloud allowed
06/07/2012 19:03:59;0008;   pbs_mom;Job;mom_process_request;request type
JobScript from host lucid-tor1.cps.cloud allowed
06/07/2012 19:03:59;0008;   pbs_mom;Job;mom_process_request;request type
ReadyToCommit from host lucid-tor1.cps.cloud allowed
06/07/2012 19:03:59;0008;   pbs_mom;Job;mom_process_request;request type
Commit from host lucid-tor1.cps.cloud allowed
06/07/2012 19:03:59;0001;   pbs_mom;Job;job_nodes;job:
73.lucid-tor1.cps.cloud numnodes=1 numvnod=1
06/07/2012 19:03:59;0001;   pbs_mom;Job;73.lucid-tor1.cps.cloud;phase 2 of
job launch successfully completed
06/07/2012 19:03:59;0001;   pbs_mom;Job;TMomFinalizeJob3;Job
73.lucid-tor1.cps.cloud read start return code=0 session=959
06/07/2012 19:03:59;0001;   pbs_mom;Job;TMomFinalizeJob3;job
73.lucid-tor1.cps.cloud started, pid = 959
06/07/2012 19:03:59;0001;
pbs_mom;Job;73.lucid-tor1.cps.cloud;exec_job_on_ms:job successfully started
06/07/2012 19:04:03;0008;   pbs_mom;Job;scan_for_terminated;pid 959
harvested for job 73.lucid-tor1.cps.cloud, task 1, exitcode=0
06/07/2012 19:04:03;0080;
pbs_mom;Job;73.lucid-tor1.cps.cloud;scan_for_terminated: job
73.lucid-tor1.cps.cloud task 1 terminated, sid=959
06/07/2012 19:04:03;0080;   pbs_mom;Svr;scan_for_exiting;searching for
exiting jobs
06/07/2012 19:04:03;0008;   pbs_mom;Job;kill_job;scan_for_exiting: sending
signal 9, "KILL" to job 73.lucid-tor1.cps.cloud, reason: local task
termination detected
06/07/2012 19:04:03;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
06/07/2012 19:04:03;0080;
pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of
while loop
06/07/2012 19:04:03;0080;   pbs_mom;Svr;preobit_reply;in while loop, no
error from job stat
06/07/2012 19:04:03;0080;   pbs_mom;Job;73.lucid-tor1.cps.cloud;performing
job clean-up in preobit_reply()
06/07/2012 19:04:03;0080;   pbs_mom;Job;73.lucid-tor1.cps.cloud;epilog
subtask created with pid 962 - substate set to JOB_SUBSTATE_OBIT -
registered post_epilogue
06/07/2012 19:04:03;0080;   pbs_mom;Req;post_epilogue;preparing obit
message for job 73.lucid-tor1.cps.cloud
06/07/2012 19:04:03;0080;   pbs_mom;Job;73.lucid-tor1.cps.cloud;obit sent
to server
06/07/2012 19:04:03;0008;   pbs_mom;Job;mom_process_request;request type
CopyFiles from host lucid-tor1.cps.cloud allowed
06/07/2012 19:04:03;0008;   pbs_mom;Job;73.lucid-tor1.cps.cloud;forking to
user, uid: 1000  gid: 1000  homedir: '/home/david'
06/07/2012 19:04:12;0008;   pbs_mom;Job;scan_for_terminated;pid 963 not
tracked, statloc=0, exitval=0
06/07/2012 19:04:12;0008;   pbs_mom;Job;mom_process_request;request type
DeleteJob from host lucid-tor1.cps.cloud allowed
06/07/2012 19:04:12;0008;   pbs_mom;Job;73.lucid-tor1.cps.cloud;deleting job
06/07/2012 19:04:12;0080;   pbs_mom;Job;73.lucid-tor1.cps.cloud;deleting
job 73.lucid-tor1.cps.cloud in state EXITED
06/07/2012 19:04:12;0080;   pbs_mom;Job;73.lucid-tor1.cps.cloud;removed job
script
06/07/2012 19:04:15;0002;   pbs_mom;n/a;mom_server_update_stat;status
update successfully sent to lucid-tor1

So the job gets executed on the compute node, but only if I force it with
qrun.
I have already found these questions
-
http://serverfault.com/questions/258195/torque-jobs-does-not-enter-e-state-unless-qrun
-
http://www.clusterresources.com/pipermail/torqueusers/2004-October/000871.html
- http://www.supercluster.org/pipermail/torqueusers/2011-April/012609.html
but the answers don't help that much.

Did I miss anything?
In case I didn't, do you have any idea of what could be going wrong?

Thank you very much.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20120607/2e778f02/attachment-0001.html 


More information about the torqueusers mailing list