[torqueusers] Question about qsub file with argument

Abraham Zamudio abraham.zamudio at gmail.com
Thu Sep 30 10:45:33 MDT 2010


Sorry , the log of my node master :

*cat /var/spool/torque/server_logs/20100929 | grep 1040.master*
09/29/2010 17:52:04;0100;PBS_Server;Job;1040.master;enqueuing into batch,
state 1 hop 1
09/29/2010 17:52:04;0008;PBS_Server;Job;1040.master;Job Queued at request of
mpiX at master, owner = mpiX at master, job name = mpi_fitting, queue = batch
09/29/2010 18:05:56;0008;PBS_Server;Job;1040.master;Job Run at request of
root at master
09/29/2010 18:05:56;000d;PBS_Server;Job;1040.master;Not sending email: User
does not want mail of this type.
09/29/2010 18:06:41;000d;PBS_Server;Job;1040.master;Not sending email: User
does not want mail of this type.
09/29/2010 18:06:41;0010;PBS_Server;Job;1040.master;Exit_status=11
resources_used.cput=00:01:57 resources_used.mem=11196kb
resources_used.vmem=407492kb resources_used.walltime=00:00:45
09/29/2010 18:06:41;0100;PBS_Server;Job;1040.master;dequeuing from batch,
state COMPLETE


On Thu, Sep 30, 2010 at 11:12 AM, Abraham Zamudio <abraham.zamudio at gmail.com
> wrote:

> Thx for your comments .
>
> Troy, torque run my script with  your modifications  , now in the output
> files ( mpidata.$PBS_JOBID.$FILE ) i have the following error  :
>
> *mpiexec: Warning: task 0 died with signal 11 (Segmentation fault).*
> *mpiexec: Warning: tasks 1-11 died with signal 15 (Terminated).*
>
>
> The log of my nodes :
>
> *cat /var/spool/torque/mom_logs/20100929 | grep 1040.master*
> 09/29/2010 18:02:17;0001;   pbs_mom;Job;TMomFinalizeJob3;job 1040.master
> started, pid = 29017
> 09/29/2010 18:02:17;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 2, sid 29065, cmd /bin/sh
> 09/29/2010 18:02:17;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 3, sid 29066, cmd /bin/sh
> 09/29/2010 18:02:17;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 4, sid 29067, cmd /bin/sh
> 09/29/2010 18:02:17;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 5, sid 29068, cmd /bin/sh
> 09/29/2010 18:02:18;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 2 terminated, sid=29065
> 09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 3 signal 9
> 09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 29066 task 3 gracefully with sig 15
> 09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=29066/state=Z) with sig 9
> 09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 4 signal 9
> 09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 29067 task 4 gracefully with sig 15
> 09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=29067/state=Z) with sig 9
> 09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 5 signal 9
> 09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 29068 task 5 gracefully with sig 15
> 09/29/2010 18:02:38;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=29068/state=Z) with sig 9
> 09/29/2010 18:02:38;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 3 terminated, sid=29066
> 09/29/2010 18:02:38;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 4 terminated, sid=29067
> 09/29/2010 18:02:38;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 5 terminated, sid=29068
> 09/29/2010 18:03:02;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 29018 task 1 gracefully with sig 15
> 09/29/2010 18:03:02;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 1 terminated, sid=29017
> 09/29/2010 18:03:02;0008;   pbs_mom;Job;1040.master;job was terminated
> 09/29/2010 18:03:02;0080;   pbs_mom;Job;1040.master;obit sent to server
>
>
>
> *[mpiX at quad4 ~]$ cat /var/spool/torque/mom_logs/20100929 | grep
> 1040.master*
> 09/29/2010 18:02:07;0008;   pbs_mom;Job;1040.master;JOIN JOB as node 1
> 09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 6, sid 9232, cmd /bin/sh
> 09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 7, sid 9233, cmd /bin/sh
> 09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 8, sid 9234, cmd /bin/sh
> 09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 9, sid 9235, cmd /bin/sh
> 09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 10, sid 9236, cmd /bin/sh
> 09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 11, sid 9237, cmd /bin/sh
> 09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 12, sid 9238, cmd /bin/sh
> 09/29/2010 18:02:08;0008;   pbs_mom;Job;1040.master;start_process: task
> started, tid 13, sid 9239, cmd /bin/sh
> 09/29/2010 18:02:14;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 6 signal 9
> 09/29/2010 18:02:14;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 9232 task 6 gracefully with sig 15
> 09/29/2010 18:02:19;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=9232/state=Z) with sig 9
> 09/29/2010 18:02:19;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 7 signal 9
> 09/29/2010 18:02:19;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 9233 task 7 gracefully with sig 15
> 09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=9233/state=Z) with sig 9
> 09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 8 signal 9
> 09/29/2010 18:02:23;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 9234 task 8 gracefully with sig 15
> 09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=9234/state=Z) with sig 9
> 09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 9 signal 9
> 09/29/2010 18:02:28;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 9235 task 9 gracefully with sig 15
> 09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=9235/state=Z) with sig 9
> 09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 10 signal 9
> 09/29/2010 18:02:33;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 9236 task 10 gracefully with sig 15
> 09/29/2010 18:02:38;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=9236/state=Z) with sig 9
> 09/29/2010 18:02:38;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 11 signal 9
> 09/29/2010 18:02:38;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 9237 task 11 gracefully with sig 15
> 09/29/2010 18:02:42;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=9237/state=Z) with sig 9
> 09/29/2010 18:02:42;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 12 signal 9
> 09/29/2010 18:02:42;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 9238 task 12 gracefully with sig 15
> 09/29/2010 18:02:47;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=9238/state=Z) with sig 9
> 09/29/2010 18:02:47;0008;   pbs_mom;Job;1040.master;im_request: SIGNAL_TASK
> 1040.master from node 0 task 13 signal 9
> 09/29/2010 18:02:47;0008;   pbs_mom;Job;1040.master;kill_task: killing pid
> 9239 task 13 gracefully with sig 15
> 09/29/2010 18:02:52;0008;   pbs_mom;Job;1040.master;kill_task: not killing
> process (pid=9239/state=Z) with sig 9
> 09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 6 terminated, sid=9232
> 09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 7 terminated, sid=9233
> 09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 8 terminated, sid=9234
> 09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 9 terminated, sid=9235
> 09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 10 terminated, sid=9236
> 09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 11 terminated, sid=9237
> 09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 12 terminated, sid=9238
> 09/29/2010 18:02:52;0080;   pbs_mom;Job;1040.master;scan_for_terminated:
> job 1040.master task 13 terminated, sid=9239
>
>
>
> On Thu, Sep 30, 2010 at 8:26 AM, Glen Beane <glen.beane at gmail.com> wrote:
>
>> On Wed, Sep 29, 2010 at 3:42 PM, Troy Baer <tbaer at utk.edu> wrote:
>> > On Wed, 2010-09-29 at 14:13 -0500, Abraham Zamudio wrote:
>> >> I have a mpich2 program , This program takes one ( argv[1] ) argument
>> >> (  ./program    file_to_analyze ) .
>> >>
>> >> I send him to the queue  of torque
>> >
>> >> #####################
>> >> #### run_all_files.sh ####
>> >> #####################
>> >> $FOLDER = /path/to/files
>> >> for i in $(ls $FOLDER ); do
>> >>     qsub cola.qsub $i
>> >> done
>> >> #####################
>> >
>> >> #################
>> >> #### cola.qsub ####
>> >> #################
>> >> #PBS -S /bin/bash
>> >> #PBS -N proof
>> >> #PBS -q queue_2
>> >> #PBS -l nodes=Four_processors:ppn=4+Eight_processors:ppn=8
>> >> #PBS -j oe
>> >> #PBS -o cola.$PBS_JOBID.$1
>> >>
>> >> mpiexec /PATH/TO/MPI_SOFTWARE/program   $1
>> >> #################
>> >
>> > That's not how qsub processes its command line arguments.  Setting an
>> > environment variable that gets propagated into the jobs using the -v
>> > flag to qsub might work, though:
>> >
>> > ########################
>> > ### run_all_files.sh ###
>> > ########################
>> > $FOLDER = /path/to/files
>> > for i in $(ls $FOLDER )
>> > do
>> >    qsub -v FILE=$i cola.qsub
>> > done
>> >
>> > #################
>> > ### cola.qsub ###
>> > #################
>> > #PBS -S /bin/bash
>> > #PBS -N proof
>> > #PBS -q queue_2
>> > #PBS -l nodes=Four_processors:ppn=4+Eight_processors:ppn=8
>> > #PBS -j oe
>> > #PBS -o cola.$PBS_JOBID.$FILE
>> > mpiexec /PATH/TO/MPI_SOFTWARE/program $FILE
>> >
>> > Do environment variable macro substitutions work in the arguments to the
>> > -e and -o flags?  (I was under the impression that they didn't.)
>>
>> torque will use wordexp to expand shell variables in the -o and -e
>> arguments, so your example should work provided wordexp was found by
>> ./configure
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
>>
>
>
>
> --
> Abraham Zamudio Ch.
>
>


-- 
Abraham Zamudio Ch.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100930/1d425c14/attachment-0001.html 


More information about the torqueusers mailing list