[torqueusers] Mom Terminates Job before it is run

Adam Emerich aemerich at us.ibm.com
Tue Jul 10 09:03:16 MDT 2007


I am trying to integrate torque into our cluster, but I am having problems
starting the jobs on the mom.  The cluster is set up as follows:

Management Node (Torque Server) -> Service Node -> Compute Nodes

In our case the management node has a direct connection to the compute
nodes via a VLAN, however the compute nodes have to have request forwarded
by the ethernet switch to the management node on a separate VLAN.  For this
case the compute nodes use hostname "mnc" as the hostname for the
management node, but the true hostname of the management node is "mn".

My jobs seem to get scheduled and get sent off to the mom on the compute
nodes, however the mom client terminates the job.  I am trying to run an
interactive job in this example.

Here are the pertinent logs:

Mom Client (Tracejob)
Job: 19.mn

07/10/2007 04:31:59  M    ready to commit job
07/10/2007 04:31:59  M    ready to commit job completed
07/10/2007 04:31:59  M    committing job
07/10/2007 04:31:59  M    starting job execution
07/10/2007 04:31:59  M    evaluating limits for job
07/10/2007 04:31:59  M    phase 2 of job launch successfully completed
07/10/2007 04:31:59  M    job execution started
07/10/2007 04:31:59  M    start failed on unknown node
07/10/2007 04:31:59  M    local task termination detected.  killing job
07/10/2007 04:31:59  M    kill_job
07/10/2007 04:31:59  M    kill_job done
07/10/2007 04:31:59  M    performing job clean-up
07/10/2007 04:31:59  M    modifying attribute resource of job (value:
'???')
07/10/2007 04:31:59  M    Job Modified at request of
PBS_Server at roadrunner.cluster.net
07/10/2007 04:32:05  M    deleting job 19.mn in state EXITED
07/10/2007 04:32:05  M    removing job
07/10/2007 04:32:05  M    removed job file

/var/log/messages
Jul 10 09:30:37 n01-001-0 pbs_mom: Success (0) in TMomFinalizeChild, cannot
open qsub sock
Jul 10 09:30:42 n01-001-0 logger: killing n01-001-0 4172
Jul 10 09:30:42 n01-001-0 logger: killing n01-001-0 4173
Jul 10 09:30:42 n01-001-0 logger: KILLING n01-001-0 4173

Server (Tracejob)
Job: 19.mn

07/10/2007 09:30:36  S    Job Queued at request of aemerich at mn, owner =
aemerich at mn, job name = STDIN,                          queue = dque
07/10/2007 09:30:36  A    queue=dque
07/10/2007 09:30:37  S    Job Modified at request of root at mnc
07/10/2007 09:30:37  S    Job Run at request of root at mnc
07/10/2007 09:30:37  S    Job Modified at request of root at mnc
07/10/2007 09:30:37  A    user=aemerich group=users jobname=STDIN
queue=dque ctime=1184077836
                          qtime=1184077836 etime=1184077836
start=1184077837 exec_host=n01-001-0/0
                          Resource_List.neednodes=n01-001-0
Resource_List.nodect=1
                          Resource_List.nodes=n01-001-0:ppn=1
Resource_List.walltime=00:01:00
07/10/2007 09:30:42  S    Exit_status=-1
07/10/2007 09:30:42  A    user=aemerich group=users jobname=STDIN
queue=dque ctime=1184077836
                          qtime=1184077836 etime=1184077836
start=1184077837 exec_host=n01-001-0/0
                          Resource_List.neednodes=n01-001-0:ppn=1
Resource_List.nodect=1
                          Resource_List.nodes=n01-001-0:ppn=1
Resource_List.walltime=00:01:00
                          session=0 end=1184077842 Exit_status=-1

Server (config)
Server mn
        server_state = Active
        scheduling = True
        total_jobs = 0
        state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0
Exiting:0
        managers = root at mn,root at mnc
        default_queue = dque
        log_events = 127
        mail_from = adm
        query_other_jobs = True
        resources_default.walltime = 00:01:00
        resources_assigned.nodect = 0
        scheduler_iteration = 60
        node_check_rate = 150
        tcp_timeout = 6
        node_pack = False
        pbs_version = 2.1.8
        keep_completed = 300
        submit_hosts = mn,mnc
        server_name = mn

Any help would be greatly appreciated.

Adam Emerich
IBM Corporation - Rochester, MN



More information about the torqueusers mailing list