[torqueusers] dead job with Obituary

michael young mhyoung at valdosta.edu
Fri Mar 9 11:33:27 MST 2007


Hi all,

Background:
We have a cluster of Sun servers.
1 master and 12 slave nodes.
AMD Opteron Processor 248 2.2 GHz, 4GB ram, 74 GB SCSI HD
It runs Spartan '04 on Red Hat Enterprise Linux AS release 4 (Nahant 
Update 1).
master node's name: cluster
slave node's names: he1 - he12

Problem:
When I send a job at the terminal 'echo "sleep 30" | qsub' everything 
works fine.
I can even submit this multiple times and it works.
If I submit a job through Spartan, the job dies, or is killed.
Below are the logs.
I know the time stamps are off. The clock on he12 is off abit.
Can anyone see anything amiss?
Is any other info needed?

thank you,
Michael

Server logs on master node
/var/spool/PBS/server_logs/20070309
#############Start#####################
03/09/2007 13:12:20;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from spartan at cluster.chemistry.valdosta.edu, sock=11
03/09/2007 13:12:20;0100;PBS_Server;Req;;Type StatusJob request received 
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from spartan at cluster.chemistry.valdosta.edu, sock=11
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type QueueJob request received 
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type JobScript request received 
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type ReadyToCommit request 
received from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 13:12:31;0100;PBS_Server;Req;;Type Commit request received 
from spartan at cluster.chemistry.valdosta.edu, sock=10
03/09/2007 
13:12:31;0100;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;enqueuing 
into default, state 1 hop 1
03/09/2007 13:12:31;0002;PBS_Server;Svr;Act;Account file 
/var/spool/PBS/server_priv/accounting/20070309 opened
03/09/2007 
13:12:31;0008;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Job 
Queued at request of spartan at cluster.chemistry.valdosta.edu, owner = 
spartan at cluster.chemistry.valdosta.edu, job name = qsub_script, queue = 
default
03/09/2007 
13:12:31;0040;PBS_Server;Svr;cluster.chemistry.valdosta.edu;Scheduler 
sent command new
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type StatusNode request 
received from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type StatusQueue request 
received from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type StatusJob request received 
from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type ModifyJob request received 
from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 
13:12:33;0008;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Job 
Modified at request of root at cluster.chemistry.valdosta.edu
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type RunJob request received 
from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 
13:12:33;0008;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Job Run 
at request of root at cluster.chemistry.valdosta.edu
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type ModifyJob request received 
from root at cluster.chemistry.valdosta.edu, sock=9
03/09/2007 
13:12:33;0008;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Job 
Modified at request of root at cluster.chemistry.valdosta.edu
03/09/2007 13:12:33;0100;PBS_Server;Req;;Type JobObituary request 
received from pbs_mom at he12.cluster.chemistry.valdosta.edu, sock=12
03/09/2007 
13:12:33;0010;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Exit_status=1 
resources_used.cput=00:00:00 resources_used.mem=0kb 
resources_used.vmem=0kb resources_used.walltime=00:00:01
03/09/2007 
13:12:56;000d;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;Post job 
file processing error; job 279.cluster.chemistry.valdosta.edu on host he12/0
03/09/2007 
13:12:56;0100;PBS_Server;Job;279.cluster.chemistry.valdosta.edu;dequeuing 
from default, state EXITING
03/09/2007 
13:12:56;0040;PBS_Server;Svr;cluster.chemistry.valdosta.edu;Scheduler 
sent command term
##############END####################

mom logs on he12
/var/spool/PBS/mom_logs/20070309
#############Start#####################
03/09/2007 12:49:01;0002;   pbs_mom;Svr;Log;Log opened
03/09/2007 12:49:01;0001;   pbs_mom;Job;TMomFinalizeJob3;job 
279.cluster.chemistry.valdosta.edu started, pid = 22230
03/09/2007 12:49:01;0080;   
pbs_mom;Job;279.cluster.chemistry.valdosta.edu;scan_for_terminated: job 
279.cluster.chemistry.valdosta.edu task 1 terminated, sid 22230
03/09/2007 12:49:01;0008;   
pbs_mom;Job;279.cluster.chemistry.valdosta.edu;Terminated
03/09/2007 12:49:01;0008;   
pbs_mom;Job;279.cluster.chemistry.valdosta.edu;Job Modified at request 
of PBS_Server at cluster.chemistry.valdosta.edu
###################End#####################


More information about the torqueusers mailing list