[torqueusers] dead job with Obituary
michael young
mhyoung at valdosta.edu
Fri Mar 9 11:33:27 MST 2007
Hi all,
Background:
We have a cluster of Sun servers.
1 master and 12 slave nodes.
AMD Opteron Processor 248 2.2 GHz, 4GB ram, 74 GB SCSI HD
It runs Spartan '04 on Red Hat Enterprise Linux AS release 4 (Nahant
Update 1).
master node's name: cluster
slave node's names: he1 - he12
Problem:
When I send a job at the terminal 'echo "sleep 30" | qsub' everything
works fine.
I can even submit this multiple times and it works.
If I submit a job through Spartan, the job dies, or is killed.
Below are the logs.
I know the time stamps are off. The clock on he12 is off abit.
Can anyone see anything amiss?
Is any other info needed?
thank you,
Michael
Server logs on master node
/var/spool/PBS/server_logs/20070309
#############Start#####################
03/09/2007 13:12:20;0100;PBS_Server;Req;;Type AuthenticateUser request
received from spartan at cluster.chemistry.valdosta.edu, sock=11
03/09/2007 13:12:20;0100;PBS_Server;Req;;Type StatusJob request received
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type AuthenticateUser request
received from spartan at cluster.chemistry.valdosta.edu, sock=11
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type QueueJob request received
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type JobScript request received
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type ReadyToCommit request
received from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type Commit request received
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007
13:12:31;0100;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;enqueuing
into default, state 1 hop 1
03/09/2007 13:12:31;0002;PBS_Server;Svr;Act;Account file
/var/spool/PBS/server_priv/accounting/20070309 opened
03/09/2007
13:12:31;0008;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Job
Queued at request of spartan at cluster.chemistry.valdosta.edu, owner =
spartan at cluster.chemistry.valdosta.edu, job name = qsub_script, queue =
default
03/09/2007
13:12:31;0040;PBS_Server;Svr;cluster.chemistry.valdosta.edu;Scheduler
sent command new
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type StatusNode request
received from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type StatusQueue request
received from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type StatusJob request received
from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type ModifyJob request received
from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007
13:12:33;0008;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Job
Modified at request of root at cluster.chemistry.valdosta.edu
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type RunJob request received
from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007
13:12:33;0008;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Job Run
at request of root at cluster.chemistry.valdosta.edu
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type ModifyJob request received
from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007
13:12:33;0008;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Job
Modified at request of root at cluster.chemistry.valdosta.edu
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type JobObituary request
received from pbs_mom at he12.cluster.chemistry.valdosta.edu, sock=12
03/09/2007
13:12:33;0010;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Exit_status=1
resources_used.cput=00:00:00 resources_used.mem=0kb
resources_used.vmem=0kb resources_used.walltime=00:00:01
03/09/2007
13:12:56;000d;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Post job
file processing error; job 279.cluster.chemistry.valdosta.edu on host he12/0
03/09/2007
13:12:56;0100;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;dequeuing
from default, state EXITING
03/09/2007
13:12:56;0040;PBS_Server;Svr;cluster.chemistry.valdosta.edu;Scheduler
sent command term
##############END####################
mom logs on he12
/var/spool/PBS/mom_logs/20070309
#############Start#####################
03/09/2007 12:49:01;0002; pbs_mom;Svr;Log;Log opened
03/09/2007 12:49:01;0001; pbs_mom;Job;TMomFinalizeJob3;job
279.cluster.chemistry.valdosta.edu started, pid = 22230
03/09/2007 12:49:01;0080;
pbs_mom;Job;279.cluster.chemistry.valdosta.edu;scan_for_terminated: job
279.cluster.chemistry.valdosta.edu task 1 terminated, sid 22230
03/09/2007 12:49:01;0008;
pbs_mom;Job;279.cluster.chemistry.valdosta.edu;Terminated
03/09/2007 12:49:01;0008;
pbs_mom;Job;279.cluster.chemistry.valdosta.edu;Job Modified at request
of PBS_Server at cluster.chemistry.valdosta.edu
###################End#####################
More information about the torqueusers
mailing list