[torqueusers] can't execute multi-processors run

lorenzo118 at interfree.it lorenzo118 at interfree.it
Mon Aug 29 14:07:19 MDT 2005


Hi, 
I solved the problem "-bash: line 1: /usr/spool/PBS/mom_priv/jobs/34.medusa.d.SC: No such file or directory:" simply submitting a script (for a mono-processor job) instead of using qsub directly from the prompt. Now it works and produces an empty error file and the right output file. 
Problem with the multi-processor jobs remains (job exits without executing, and doesn't produce any file), I checked in mom_logs and I found the following messages:


08/29/2005 19:14:52;0002;   pbs_mom;Svr;Log;Log opened
08/29/2005 19:14:52;0008;   pbs_mom;Job;37.medusa.dicea.unifi.it;JOIN JOB as node 1
08/29/2005 19:14:52;0001;   pbs_mom;Svr;pbs_mom;im_request, im_request: received KILL/ABORT request for job 37.medusa.dicea.unifi.it from node 192.168.65.70:1023


In server_logs there are following messages (only those regarding job 37):



08/29/2005 19:14:44;0100;PBS_Server;Job;37.medusa.dicea.unifi.it;enqueuing into batch, state 1 hop 1
08/29/2005 19:14:44;0002;PBS_Server;Svr;Act;Account file /usr/spool/PBS/server_priv/accounting/20050829 opened
08/29/2005 19:14:44;0008;PBS_Server;Job;37.medusa.dicea.unifi.it;Job Queued at request of lcampo at medusa000.dicea.unifi.it, owner = lcampo at medusa000.dicea.unifi.it, job name = script, queue = batch
08/29/2005 19:14:44;0040;PBS_Server;Svr;medusa.dicea.unifi.it;Scheduler sent command new
08/29/2005 19:14:44;0100;PBS_Server;Req;;Type StatusServer request received from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:44;0100;PBS_Server;Req;;Type StatusNode request received from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:51;0100;PBS_Server;Req;;Type AuthenticateUser request received from lcampo at medusa000.dicea.unifi.it, sock=11
08/29/2005 19:14:51;0100;PBS_Server;Req;;Type StatusJob request received from lcampo at medusa000.dicea.unifi.it, sock=9
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type StatusQueue request received from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type SelStat request received from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type ResourceQuery request received from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type ModifyJob request received from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0008;PBS_Server;Job;37.medusa.dicea.unifi.it;Job Modified at request of Scheduler at medusa.dicea.unifi.it
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type RunJob request received from Scheduler at medusa.dicea.unifi.it, sock=10
08/29/2005 19:14:52;0008;PBS_Server;Job;37.medusa.dicea.unifi.it;Job Run at request of Scheduler at medusa.dicea.unifi.it
08/29/2005 19:14:52;0040;PBS_Server;Svr;medusa.dicea.unifi.it;Scheduler sent command recyc08/29/2005 19:14:52;0100;PBS_Server;Req;;Type JobObituary request received from pbs_mom at medusa005.dicea.unifi.it, sock=9
08/29/2005 19:14:52;0010;PBS_Server;Job;37.medusa.dicea.unifi.it;Exit_status=-2 resources_used.cput=00:00:00 resources_used.mem=0kb resources_used.vmem=0kb resources_used.walltime=00:00:00
08/29/2005 19:14:52;000d;PBS_Server;Job;37.medusa.dicea.unifi.it;Post job file processing error; job 37.medusa.dicea.unifi.it on host medusa005.dicea.unifi.it/0+medusa000.dicea.unifi.it/0
08/29/2005 19:14:52;0100;PBS_Server;Job;37.medusa.dicea.unifi.it;dequeuing from batch, state EXITING
08/29/2005 19:14:52;0040;PBS_Server;Svr;medusa.dicea.unifi.it;Scheduler sent command term
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type StatusServer request received from Scheduler at medusa.dicea.unifi.it, sock=9
08/29/2005 19:14:52;0100;PBS_Server;Req;;Type StatusNode request received from Scheduler at medusa.dicea.unifi.it, sock=9
08/29/2005 19:14:53;0100;PBS_Server;Req;;Type AuthenticateUser request received from lcampo at medusa000.dicea.unifi.it, sock=11
08/29/2005 19:14:53;0100;PBS_Server;Req;;Type StatusJob request received from lcampo at medusa000.dicea.unifi.it, sock=10
08/29/2005 19:15:00;0100;PBS_Server;Req;;Type StatusQueue request received from Scheduler at medusa.dicea.unifi.it, sock=9
08/29/2005 19:15:00;0100;PBS_Server;Req;;Type SelStat request received from Scheduler at medusa.dicea.unifi.it, sock=9
08/29/2005 19:16:47;0100;PBS_Server;Req;;Type AuthenticateUser request received from lcampo at medusa000.dicea.unifi.it, sock=10



Job 37 was launched with this script:


#PBS -l nodes=2,walltime=00:05:00
#PBS -e script.err
#PBS -o script.out
#PBS -V
mpirun -np 2 ./hello


Everything is launched on the /home directory of a non-root user, that is shared with NFS, so it's visible to all nodes.
What can cause the KILL/ABORT request in pbs_mom log?

Lorenzo



At 17.53 29/08/2005, you wrote:

On Sat, Aug 27, 2005 at 09:00:30PM +0000, lorenzo118 at interfree.it alleged:
> 
> Hi,
> I installed Torque 1.2.p04 on my linux cluster of 16 Pentium 4
> processors (with Fedora Core 3), on master and on some node, it
> compiled with no problems, I started all demons and it marks as "free"
> all nodes. Problem is that when I launch a one processor job, I obtain
> only the error file (.e) that contains:
> 
> 
> -bash: line 1: /usr/spool/PBS/mom_priv/jobs/34.medusa.d.SC: No such file or directory:

This is usually a broken #! line at the top of the script.  Either the
path is wrong or it has a DOS EOL character.  Another common cause is a
full /tmp or /usr that is preventing the script file from being copied
correctly.


> Moreover, if I try to launch a two processors job (with qsub -l
> nodes=2 ./hello), it exits after a few seconds without writing
> anything (neither empty .o and .e files). While it's in queue, if I
> try to type qstat -f, I obtain this result:

Are the output files stuck in the node's undelivered directory?  Are
there any errors in mom's log?


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers

-------------------------------------------------------------------------
Visita http://domini.interfree.it, il sito di Interfree dove trovare
soluzioni semplici e complete che soddisfano le tue esigenze in Internet,
ecco due esempi di offerte:

-  Registrazione Dominio: un dominio con 1 MB di spazio disco +  2 caselle
   email a soli 18,59 euro
-  MioDominio: un dominio con 20 MB di spazio disco + 5 caselle email 
   a soli 51,13 euro

Vieni a trovarci!

Lo Staff di Interfree 
-------------------------------------------------------------------------



More information about the torqueusers mailing list