[torqueusers] Job Checkpoint and Restart with torque 2.4.6 & BLCR
Rajiv Rajaian
rajiv.care at gmail.com
Fri Mar 26 07:42:11 MDT 2010
Hi Jazcek Braden
Hi Ive modified that * blcr_checkpoint_script *and removed that "depth"
variable and again submitted that test job and I cant hold that jobs ..Job
still remains in running state .. The steps ive done are as follows .. Pls
help me to solve this issue
[guser02 at gcluster ~]$ qsub -c enabled,periodic,shutdown,interval=1 test.sh
guser02 at gcluster ~]$ qhold 8
[guser02 at gcluster ~]$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
8.gcluster test.sh guser02 0 R workq
[guser02 at gcluster ~]$ qstat -f
Job Id: 8.gcluster.grid
Job_Name = test.sh
Job_Owner = guser02 at gcluster.grid
job_state = R
queue = workq
server = gcluster.grid
Checkpoint = enabled,periodic,shutdown,interval=1
ctime = Fri Mar 26 19:07:07 2010
Error_Path = gcluster.grid:/home/guser02/test.sh.e8
exec_host = gcluster.grid/0
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Fri Mar 26 19:07:12 2010
Output_Path = gcluster.grid:/home/guser02/test.sh.o8
Priority = 0
qtime = Fri Mar 26 19:07:07 2010
Rerunable = True
Resource_List.nodect = 1
Resource_List.nodes = 1
session_id = 20882
Variable_List = PBS_O_HOME=/home/guser02,PBS_O_LOGNAME=guser02,
PBS_O_PATH=/usr/local/firefox/:/opt/mpich-1.2.6/bin:/usr/local/jdk1.5
.0_03/bin/:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local
/tomcat-5.0.27/bin:/usr/local/ant-1.6.4/bin:/usr/local/globus-4.0.3/bi
n:/usr/local/globus-4.0.3/sbin:/bin:/usr/local/maui/bin:/usr/local/gw/
bin:/usr/local/rrdtool/bin:/opt/ganglia/bin:/usr/local/sbin:/usr/local
/bin:/usr/local/pdftk-1.41/pdftk:/home/guser02/bin,
PBS_O_MAIL=/var/spool/mail/guser02,PBS_O_SHELL=/bin/bash,
PBS_O_HOST=gcluster.grid,PBS_SERVER=gcluster.grid,
PBS_O_WORKDIR=/home/guser02,PBS_O_QUEUE=workq
comment = Usage: /var/spool/PBS/mom_priv/blcr_checkpoint_script
etime = Fri Mar 26 19:07:07 2010
submit_args = -c enabled,periodic,shutdown,interval=1 test.sh
start_time = Fri Mar 26 19:07:07 2010
start_count = 1
fault_tolerant = False
On Fri, Mar 26, 2010 at 5:15 PM, Jazcek Braden <jazcek at gmail.com> wrote:
> There is a typo in the script in the documentation, they try to use a
> variable called depth without defining it in the my statement a few
> lines up
>
> -- Jazcek
>
> On Fri, Mar 26, 2010 at 6:59 AM, Rajiv Rajaian <rajiv.care at gmail.com>
> wrote:
> > Hi all
> > I ve installed torque 2.4.6 and enabled the blcr with the following
> option
> > while installing
> >
> > ./configure --disable-gui --with-server-home=/var/spool/PBS
> > --with-default-server=gcluster.grid --enable-unixsockets=no --enable-blcr
> > --disable-gcc-warnings
> >
> > Also my mom_priv/config looks like
> >
> > /var/spool/PBS/mom_priv/config
> > $checkpoint_script /var/spool/PBS/mom_priv/blcr_checkpoint_script
> > $restart_script /var/spool/PBS/mom_priv/blcr_restart_script
> > $checkpoint_run_exe /usr/local/bin/cr_run
> > $pbsserver gcluster.grid
> > $loglevel 7
> >
> >
> > I ve created blcr_checkpoint_script & blcr_restart_script scripts too
> >
> > While job submission Im getting the following error .. Please help me to
> > solve this error.. Is there any thing else to be configured for this??
> >
> > [guser02 at gcluster ~]$ qsub -c enabled,periodic,shutdown,interval=1
> test.sh
> > 1.gcluster.grid
> >
> > [guser02 at gcluster ~]$ qhold 1
> >
> > [guser02 at gcluster ~]$ qstat
> > Job id Name User Time Use S
> Queue
> > ------------------------- ---------------- --------------- -------- -
> -----
> > 1.gcluster test.sh guser02 0 R
> workq
> >
> > [guser02 at gcluster ~]$ qstat -f
> > Job Id: 1.gcluster.grid
> > Job_Name = test.sh
> > Job_Owner = guser02 at gcluster.grid
> > job_state = R
> > queue = workq
> > server = gcluster.grid
> > Checkpoint = enabled,periodic,shutdown,interval=1
> > ctime = Fri Mar 26 17:20:03 2010
> > Error_Path = gcluster.grid:/home/guser02/test.sh.e1
> > exec_host = gcluster.grid/0
> > Hold_Types = n
> > Join_Path = n
> > Keep_Files = n
> > Mail_Points = a
> > mtime = Fri Mar 26 17:20:05 2010
> > Output_Path = gcluster.grid:/home/guser02/test.sh.o1
> > Priority = 0
> > qtime = Fri Mar 26 17:20:03 2010
> > Rerunable = True
> > Resource_List.nodect = 1
> > Resource_List.nodes = 1
> > session_id = 19993
> > Variable_List = PBS_O_HOME=/home/guser02,PBS_O_LOGNAME=guser02,
> >
> > PBS_O_PATH=/usr/local/firefox/:/opt/mpich-1.2.6/bin:/usr/local/jdk1.5
> >
> > .0_03/bin/:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/local
> >
> > /tomcat-5.0.27/bin:/usr/local/ant-1.6.4/bin:/usr/local/globus-4.0.3/bi
> >
> > n:/usr/local/globus-4.0.3/sbin:/bin:/usr/local/maui/bin:/usr/local/gw/
> >
> > bin:/usr/local/rrdtool/bin:/opt/ganglia/bin:/usr/local/sbin:/usr/local
> > /bin:/usr/local/pdftk-1.41/pdftk:/home/guser02/bin,
> > PBS_O_MAIL=/var/spool/mail/guser02,PBS_O_SHELL=/bin/bash,
> > PBS_O_HOST=gcluster.grid,PBS_SERVER=gcluster.grid,
> > PBS_O_WORKDIR=/home/guser02,PBS_O_QUEUE=workq
> > comment = Scalar found where operator expected at
> > /var/spool/PBS/mom_priv/
> > blcr_checkpoint_script line 31,
> > near "$signalNum $depth"
> > (Missing operator before $depth?)
> > syntax e
> > rror at /var/spool/PBS/mom_priv/blcr_checkpoint_script line 31,
> > near "$signalNum $depth"
> > Global symbol "$depth" requires explicit pa
> > ckage name at /var/spool/PBS/mom_priv/blcr_checkpoint_script line
> > 31.
> >
> > Execution of /var/spool/PBS/mom_priv/blcr_checkpoint_script
> aborted
> > du
> > e to compilation errors.
> >
> > etime = Fri Mar 26 17:20:03 2010
> > submit_args = -c enabled,periodic,shutdown,interval=1 test.sh
> > start_time = Fri Mar 26 17:20:03 2010
> > start_count = 1
> > fault_tolerant = False
> >
> >
> > Regards
> > Rajiv R
> > Project Associate,
> > CARE,MIT,
> > Anna university ,Chennai
> >
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
> >
>
>
>
> --
> Jazcek Braden
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100326/39896f85/attachment-0001.html
More information about the torqueusers
mailing list