[torqueusers] KILL/ABORT request in pbs_mom log

Lorenzo Campo lorenzo118 at interfree.it
Fri Sep 2 02:33:52 MDT 2005


Hi,
I installed torque 1.2.p04 on my linux cluster, mono-processor jobs work 
fine, but I have a problem with the multi-processor jobs :jobs exit without 
executing, and don't produce any file. I checked in mom_logs and I found 
the following messages:


08/29/2005 19:14:52;0002; pbs_mom;Svr;Log;Log opened
08/29/2005 19:14:52;0008; pbs_mom;Job;37.medusa.dicea.unifi.it;JOIN JOB as 
node 1
08/29/2005 19:14:52;0001; pbs_mom;Svr;pbs_mom;im_request, im_request: 
received KILL/ABORT request for job 37.medusa.dicea.unifi.it from node 
192.168.65.70:1023


In server_logs there are following messages (only those regarding job 37):



08/29/2005 19:14:44;0100;PBS_Server;Job;37.medusa.dicea.unifi.it;enqueuing 
into batch, state 1 hop 1
08/29/2005 19:14:44;0002;PBS_Server;Svr;Act;Account file 
/usr/spool/PBS/server_priv/accounting/20050829 opened
08/29/2005 19:14:44;0008;PBS_Server;Job;37.medusa.dicea.unifi.it;Job Queued 
at request of lcampo at medusa000.dicea.unifi.it, owner = 
lcampo at medusa000.dicea.unifi.it, job name = script, queue = batch
08/29/2005 19:14:44;0040;PBS_Server;Svr;medusa.dicea.unifi.it;Scheduler 
sent command new
08/29/2005 19:14:44;0100;PBS_Server;Req;;Type StatusServer request received 
from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:44;0100;PBS_Server;Req;;Type StatusNode request received 
from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:51;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from lcampo at medusa000.dicea.unifi.it, sock=11
08/29/2005 19:14:51;0100;PBS_Server;Req;;Type StatusJob request received 
from lcampo at medusa000.dicea.unifi.it, sock=9
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type StatusQueue request received 
from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type SelStat request received from 
Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type ResourceQuery request 
received from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type ModifyJob request received 
from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0008;PBS_Server;Job;37.medusa.dicea.unifi.it;Job 
Modified at request of Scheduler at medusa.dicea.unifi.it
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type RunJob request received from 
Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0008;PBS_Server;Job;37.medusa.dicea.unifi.it;Job Run at 
request of Scheduler at medusa.dicea.unifi.it
08/29/2005 19:14:52;0040;PBS_Server;Svr;medusa.dicea.unifi.it;Scheduler 
sent command recyc08/29/2005 19:14:52;0100;PBS_Server;Req;;Type JobObituary 
request received from pbs_mom at medusa005.dicea.unifi.it, sock=9
08/29/2005 
19:14:52;0010;PBS_Server;Job;37.medusa.dicea.unifi.it;Exit_status=-2 
resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb 
resources_used.walltime=00:00:00
08/29/2005 19:14:52;000d;PBS_Server;Job;37.medusa.dicea.unifi.it;Post job 
file processing error; job 37.medusa.dicea.unifi.it on host 
medusa005.dicea.unifi.it/0+medusa000.dicea.unifi.it/0
08/29/2005 19:14:52;0100;PBS_Server;Job;37.medusa.dicea.unifi.it;dequeuing 
from batch, state EXITING
08/29/2005 19:14:52;0040;PBS_Server;Svr;medusa.dicea.unifi.it;Scheduler 
sent command term
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type StatusServer request received 
from Scheduler at medusa.dicea.unifi.it, sock=9
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type StatusNode request received 
from Scheduler at medusa.dicea.unifi.it, sock=9
08/29/2005 19:14:53;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from lcampo at medusa000.dicea.unifi.it, sock=11
08/29/2005 19:14:53;0100;PBS_Server;Req;;Type StatusJob request received 
from lcampo at medusa000.dicea.unifi.it, sock=10
08/29/2005 19:15:00;0100;PBS_Server;Req;;Type StatusQueue request received 
from Scheduler at medusa.dicea.unifi.it, sock=9
08/29/2005 19:15:00;0100;PBS_Server;Req;;Type SelStat request received from 
Scheduler at medusa.dicea.unifi.it, sock=9
08/29/2005 19:16:47;0100;PBS_Server;Req;;Type AuthenticateUser request 
received from lcampo at medusa000.dicea.unifi.it, sock=10



Job 37 was launched with this script:


#PBS -l nodes=2,walltime=00:05:00
#PBS -e script.err
#PBS -o script.out
#PBS -V
mpirun -np 2 ./hello


Everything is launched on the /home directory of a non-root user, that is 
shared with NFS, so it's visible to all nodes.
What can cause the KILL/ABORT request in pbs_mom log?

Lorenzo Campo




More information about the torqueusers mailing list