[torqueusers] Question about qsub file with argument

Abraham Zamudio abraham.zamudio at gmail.com
Thu Sep 30 10:12:08 MDT 2010


Thx for your comments .

Troy, torque run my script with  your modifications  , now in the output
files ( mpidata.$PBS_JOBID.$FILE ) i have the following error  :

*mpiexec: Warning: task 0 died with signal 11 (Segmentation fault).*
*mpiexec: Warning: tasks 1-11 died with signal 15 (Terminated).*


The log of my nodes :

*cat /var/spool/torque/mom_logs/20100929 | grep 1040.master*
09/29/2010 18:02:17;0001;   pbs_mom;Job;TMomFinalizeJob3;job 1040.master
started, pid = 29017
09/29/2010 18:02:17;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 2, sid 29065, cmd /bin/sh
09/29/2010 18:02:17;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 3, sid 29066, cmd /bin/sh
09/29/2010 18:02:17;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 4, sid 29067, cmd /bin/sh
09/29/2010 18:02:17;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 5, sid 29068, cmd /bin/sh
09/29/2010 18:02:18;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 2 terminated, sid=29065
09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 3 signal 9
09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
29066 task 3 gracefully with sig 15
09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=29066/state=Z) with sig 9
09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 4 signal 9
09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
29067 task 4 gracefully with sig 15
09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=29067/state=Z) with sig 9
09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 5 signal 9
09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
29068 task 5 gracefully with sig 15
09/29/2010 18:02:38;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=29068/state=Z) with sig 9
09/29/2010 18:02:38;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 3 terminated, sid=29066
09/29/2010 18:02:38;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 4 terminated, sid=29067
09/29/2010 18:02:38;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 5 terminated, sid=29068
09/29/2010 18:03:02;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
29018 task 1 gracefully with sig 15
09/29/2010 18:03:02;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 1 terminated, sid=29017
09/29/2010 18:03:02;0008;   pbs_mom;Job;1040.master;job was terminated
09/29/2010 18:03:02;0080;   pbs_mom;Job;1040.master;obit sent to server



*[mpiX at quad4 ~]$ cat /var/spool/torque/mom_logs/20100929 | grep 1040.master*
09/29/2010 18:02:07;0008;   pbs_mom;Job;1040.master;JOIN JOB as node 1
09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 6, sid 9232, cmd /bin/sh
09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 7, sid 9233, cmd /bin/sh
09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 8, sid 9234, cmd /bin/sh
09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 9, sid 9235, cmd /bin/sh
09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 10, sid 9236, cmd /bin/sh
09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 11, sid 9237, cmd /bin/sh
09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 12, sid 9238, cmd /bin/sh
09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
started, tid 13, sid 9239, cmd /bin/sh
09/29/2010 18:02:14;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 6 signal 9
09/29/2010 18:02:14;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
9232 task 6 gracefully with sig 15
09/29/2010 18:02:19;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=9232/state=Z) with sig 9
09/29/2010 18:02:19;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 7 signal 9
09/29/2010 18:02:19;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
9233 task 7 gracefully with sig 15
09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=9233/state=Z) with sig 9
09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 8 signal 9
09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
9234 task 8 gracefully with sig 15
09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=9234/state=Z) with sig 9
09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 9 signal 9
09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
9235 task 9 gracefully with sig 15
09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=9235/state=Z) with sig 9
09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 10 signal 9
09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
9236 task 10 gracefully with sig 15
09/29/2010 18:02:38;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=9236/state=Z) with sig 9
09/29/2010 18:02:38;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 11 signal 9
09/29/2010 18:02:38;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
9237 task 11 gracefully with sig 15
09/29/2010 18:02:42;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=9237/state=Z) with sig 9
09/29/2010 18:02:42;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 12 signal 9
09/29/2010 18:02:42;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
9238 task 12 gracefully with sig 15
09/29/2010 18:02:47;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=9238/state=Z) with sig 9
09/29/2010 18:02:47;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
1040.master from node 0 task 13 signal 9
09/29/2010 18:02:47;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
9239 task 13 gracefully with sig 15
09/29/2010 18:02:52;0008;   pbs_mom;Job;1040.master;kill_task: not killing
process (pid=9239/state=Z) with sig 9
09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 6 terminated, sid=9232
09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 7 terminated, sid=9233
09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 8 terminated, sid=9234
09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 9 terminated, sid=9235
09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 10 terminated, sid=9236
09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 11 terminated, sid=9237
09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 12 terminated, sid=9238
09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated: job
1040.master task 13 terminated, sid=9239



On Thu, Sep 30, 2010 at 8:26 AM, Glen Beane <glen.beane at gmail.com> wrote:

> On Wed, Sep 29, 2010 at 3:42 PM, Troy Baer <tbaer at utk.edu> wrote:
> > On Wed, 2010-09-29 at 14:13 -0500, Abraham Zamudio wrote:
> >> I have a mpich2 program , This program takes one ( argv[1] ) argument
> >> (  ./program    file_to_analyze ) .
> >>
> >> I send him to the queue  of torque
> >
> >> #####################
> >> #### run_all_files.sh ####
> >> #####################
> >> $FOLDER = /path/to/files
> >> for i in $(ls $FOLDER ); do
> >>     qsub cola.qsub $i
> >> done
> >> #####################
> >
> >> #################
> >> #### cola.qsub ####
> >> #################
> >> #PBS -S /bin/bash
> >> #PBS -N proof
> >> #PBS -q queue_2
> >> #PBS -l nodes=Four_processors:ppn=4+Eight_processors:ppn=8
> >> #PBS -j oe
> >> #PBS -o cola.$PBS_JOBID.$1
> >>
> >> mpiexec /PATH/TO/MPI_SOFTWARE/program   $1
> >> #################
> >
> > That's not how qsub processes its command line arguments.  Setting an
> > environment variable that gets propagated into the jobs using the -v
> > flag to qsub might work, though:
> >
> > ########################
> > ### run_all_files.sh ###
> > ########################
> > $FOLDER = /path/to/files
> > for i in $(ls $FOLDER )
> > do
> >    qsub -v FILE=$i cola.qsub
> > done
> >
> > #################
> > ### cola.qsub ###
> > #################
> > #PBS -S /bin/bash
> > #PBS -N proof
> > #PBS -q queue_2
> > #PBS -l nodes=Four_processors:ppn=4+Eight_processors:ppn=8
> > #PBS -j oe
> > #PBS -o cola.$PBS_JOBID.$FILE
> > mpiexec /PATH/TO/MPI_SOFTWARE/program $FILE
> >
> > Do environment variable macro substitutions work in the arguments to the
> > -e and -o flags?  (I was under the impression that they didn't.)
>
> torque will use wordexp to expand shell variables in the -o and -e
> arguments, so your example should work provided wordexp was found by
> ./configure
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>



-- 
Abraham Zamudio Ch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100930/94cf156f/attachment-0001.html 


More information about the torqueusers mailing list