[torqueusers] Re: can't hold the job using qhold

garrick at clusterresources.com garrick at clusterresources.com
Mon Oct 23 09:17:31 MDT 2006


On Mon, Oct 23, 2006 at 03:09:21PM +0100, Vadivelan Ranjith alleged:
> Hi
> I am using torque-2.1.0. I tried to hold the job(id
> 10439). But job is running. its not holding. i checked
> the server_logs. Its showing job 10362 is request to
> delete(I dont know how it got). But no job id 10362 is
> running. I checked mom_logs in compute node. jobs are
> running finely. but i want to hold some jobs. Early it
> was worked. i dont know what i changed. I cant able to
> figure it out. can you help me how to fix it.

Is the pbs_mom daemon down?  It looks like pbs_server can't contact it.

 
> Thanks
> Velan
> 
> ----------------------------------------------------------------
> server_logs
> ----------------------------------------------------------------
> 10/23/2006 19:24:15;0100;PBS_Server;Req;;Type
> AuthenticateUser request received from
> root at galaxy.aero.iitb.ac.in, sock=77
> 10/23/2006 19:24:15;0100;PBS_Server;Req;;Type HoldJob
> request received from root at galaxy.aero.iitb.ac.in,
> sock=11
> 10/23/2006
> 19:24:15;0008;PBS_Server;Job;10439.galaxy.aero.iitb.ac.in;Holds
> u set at request of root at galaxy.aero.iitb.ac.in
> 10/23/2006 19:24:37;0100;PBS_Server;Req;;Type
> StatusNode request received from
> root at galaxy.aero.iitb.ac.in, sock=75
> 10/23/2006 19:24:37;0100;PBS_Server;Req;;Type
> StatusQueue request received from
> root at galaxy.aero.iitb.ac.in, sock=75
> 10/23/2006 19:24:37;0100;PBS_Server;Req;;Type
> StatusJob request received from
> root at galaxy.aero.iitb.ac.in, sock=75
> 10/23/2006 19:24:37;0100;PBS_Server;Req;;Type
> DeleteJob request received from
> root at galaxy.aero.iitb.ac.in, sock=75
> 10/23/2006
> 19:24:37;0008;PBS_Server;Job;10362.galaxy.aero.iitb.ac.in;Job
> deleted at request of root at galaxy.aero.iitb.ac.in
> 10/23/2006 19:24:37;0001;PBS_Server;Req;;Server could
> not connect to MOM
> 10/23/2006
> 19:24:37;0080;PBS_Server;Req;req_reject;Reject reply
> code=15070(Server could not connect to MOM), aux=0,
> type=DeleteJob, from root at galaxy.aero.iitb.ac.in
> 10/23/2006
> 19:24:37;0008;PBS_Server;Job;10362.galaxy.aero.iitb.ac.in;Job
> sent signal SIGTERM on delete
> 
> 
> ---------------------------------------------------------------
> mom_logs
> ---------------------------------------------------------------
> 10/23/2006 15:10:59;0002;   pbs_mom;Svr;Log;Log opened
> 10/23/2006 15:10:59;0080;  
> pbs_mom;Job;10413.galaxy.aero.iitb.ac.in;scan_for_terminated:
> job 10413.galaxy.aero.iitb.ac.in task 1 terminated,
> sid 2576
> 10/23/2006 15:10:59;0008;  
> pbs_mom;Job;10413.galaxy.aero.iitb.ac.in;Terminated
> 10/23/2006 16:10:29;0008;  
> pbs_mom;Job;10412.galaxy.aero.iitb.ac.in;kill_task:
> killing pid 2565 task 1 with sig 15
> 10/23/2006 16:10:29;0008;  
> pbs_mom;Job;10412.galaxy.aero.iitb.ac.in;kill_task:
> killing pid 2603 task 1 with sig 15
> 10/23/2006 16:10:29;0008;  
> pbs_mom;Job;10412.galaxy.aero.iitb.ac.in;kill_task:
> killing pid 2606 task 1 with sig 15
> 10/23/2006 16:10:29;0008;  
> pbs_mom;Job;10412.galaxy.aero.iitb.ac.in;kill_task:
> killing pid 2607 task 1 with sig 15
> 10/23/2006 16:10:29;0008;  
> pbs_mom;Job;10412.galaxy.aero.iitb.ac.in;kill_task:
> not killing pid 0 with sig 9
> 10/23/2006 16:10:30;0008;  
> pbs_mom;Job;10412.galaxy.aero.iitb.ac.in;kill_task:
> killing pid 2606 task 1 with sig 9
> 10/23/2006 16:10:30;0080;  
> pbs_mom;Job;10412.galaxy.aero.iitb.ac.in;scan_for_terminated:
> job 10412.galaxy.aero.iitb.ac.in task 1 terminated,
> sid 2565
> 10/23/2006 16:10:30;0008;  
> pbs_mom;Job;10412.galaxy.aero.iitb.ac.in;Terminated
> 10/23/2006 16:10:57;0001;  
> pbs_mom;Job;TMomFinalizeJob3;job
> 10444.galaxy.aero.iitb.ac.in started, pid = 24856
> 10/23/2006 16:10:57;0008;  
> pbs_mom;Job;10444.galaxy.aero.iitb.ac.in;Job Modified
> at request of PBS_Server at master1.cluster2.iitb.ac.in
> 10/23/2006 16:33:10;0008;  
> pbs_mom;Job;10439.galaxy.aero.iitb.ac.in;JOIN JOB as
> node 1
> ----------------------------------------------------------
> 
> qmgr -c 'p s'
> ----------------------------------------------------------
> 
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue batch
> #
> create queue batch
> set queue batch queue_type = Execution
> set queue batch resources_max.nodes = 4
> set queue batch resources_max.walltime = 120:00:00
> set queue batch resources_default.nodes = 0
> set queue batch resources_default.walltime = 01:00:00
> set queue batch enabled = True
> set queue batch started = True
> #
> # Create and define queue short
> #
> create queue short
> set queue short queue_type = Execution
> set queue short resources_max.nodes = 1
> set queue short resources_max.walltime = 24:00:00
> set queue short enabled = True
> set queue short started = True
> #
> # Set server attributes.
> #
> set server scheduling = False
> set server managers = root at galaxy.aero.iitb.ac.in
> set server operators = root at galaxy.aero.iitb.ac.in
> set server default_queue = batch
> set server log_events = 511
> set server mail_from = adm
> set server query_other_jobs = True
> set server scheduler_iteration = 600
> set server node_ping_rate = 300
> set server node_check_rate = 600
> set server tcp_timeout = 6
> set server job_stat_rate = 30
> set server pbs_version = 2.1.0p0
> 
> 
> 		
> __________________________________________________________
> Yahoo! India Answers: Share what you know. Learn something new
> http://in.answers.yahoo.com/


More information about the torqueusers mailing list