[torqueusers] jobs terminated half way

RB. Ezhilalan (Principal Physicist, CUH) RB.Ezhilalan at hse.ie
Tue Oct 29 09:29:54 MDT 2013


Hi

The jobs ID's were: 2054-2058, 2059-2065, 2066-2072. I have attached log
files of mom and the server for last three days, I am not sue whether
you were able to access them. I have copied below a section of server
logs if that's of some help.
***********************************************************
"10/27/2013 04:02:45;0002;PBS_Server;Svr;Act;Account file
/var/spool/torque/server_priv/accounting/20131027 opened
10/27/2013
04:02:45;0010;PBS_Server;Job;2054.linux-01.physics;Exit_status=0
resources_used.cput=09:15:35 resources_used.mem=164028kb
resources_used.vmem=670368kb resources_used.walltime=09:17:31
10/27/2013 04:02:46;0040;PBS_Server;Svr;linux-01.physics;Scheduler was
sent the command term
10/27/2013 04:02:46;0008;PBS_Server;Job;2059.linux-01.physics;Job
Modified at request of Scheduler at linux-01.physics
10/27/2013 04:02:46;0008;PBS_Server;Job;2059.linux-01.physics;Job Run at
request of Scheduler at linux-01.physics
10/27/2013 04:02:49;0040;PBS_Server;Svr;linux-01.physics;Scheduler was
sent the command recyc
10/27/2013 04:02:49;000d;PBS_Server;Job;2059.linux-01.physics;Not
sending email: User does not want mail of this type.
10/27/2013 04:03:09;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 2.4.10, loglevel = 0
10/27/2013
04:07:27;0010;PBS_Server;Job;2053.linux-01.physics;Exit_status=0
resources_used.cput=09:20:22 resources_used.mem=164036kb
resources_used.vmem=670368kb resources_used.walltime=09:22:14
10/27/2013 04:07:28;0040;PBS_Server;Svr;linux-01.physics;Scheduler was
sent the command term
10/27/2013 04:07:28;0008;PBS_Server;Job;2060.linux-01.physics;Job
Modified at request of Scheduler at linux-01.physics
10/27/2013 04:07:28;0008;PBS_Server;Job;2060.linux-01.physics;Job Run at
request of Scheduler at linux-01.physics
10/27/2013 04:07:31;0040;PBS_Server;Svr;linux-01.physics;Scheduler was
sent the command recyc
10/27/2013 04:07:31;000d;PBS_Server;Job;2060.linux-01.physics;Not
sending email: User does not want mail of this type.
10/27/2013 04:07:46;0100;PBS_Server;Job;2054.linux-01.physics;dequeuing
from long, state COMPLETE
10/27/2013 04:07:46;0040;PBS_Server;Svr;linux-01.physics;Scheduler was
sent the command term
10/27/2013 04:08:09;0002;PBS_Server;Svr;PBS_Server;Torque Server Version
= 2.4.10, loglevel = 0
10/27/2013 04:12:28;0100;PBS_Server;Job;2053.linux-01.physics;dequeuing
from long, state COMPLETE" 
************************************************************************
****
Regards,
Ezhil
Ezhilalan Ramalingam M.Sc.,DABR.,
Principal Physicist (Radiotherapy),
Medical Physics Department,
Cork University Hospital,
Wilton, Cork
Ireland
Tel. 00353 21 4922533
Fax.00353 21 4921300
Email: rb.ezhilalan at hse.ie 
-----Original Message-----
From: torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of
torqueusers-request at supercluster.org
Sent: 29 October 2013 15:01
To: torqueusers at supercluster.org
Subject: torqueusers Digest, Vol 111, Issue 37

Send torqueusers mailing list submissions to
	torqueusers at supercluster.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://www.supercluster.org/mailman/listinfo/torqueusers
or, via email, send a message with subject or body 'help' to
	torqueusers-request at supercluster.org

You can reach the person managing the list at
	torqueusers-owner at supercluster.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of torqueusers digest..."


Today's Topics:

   1. Re: FW: job terminated half way (Ricardo Rom?n Brenes)
   2.  Problem building rpms torque-2.5.13 (Carles Acosta)
   3. Re: Problem building rpms torque-2.5.13 (Carles Acosta)


----------------------------------------------------------------------

Message: 1
Date: Tue, 29 Oct 2013 07:50:45 -0600
From: Ricardo Rom?n Brenes <roman.ricardo at gmail.com>
Subject: Re: [torqueusers] FW: job terminated half way
To: Torque Users Mailing List <torqueusers at supercluster.org>
Message-ID:
	
<CAG-vK_woxZ62UMsC0mt8G4EU3Sf-aPhCkP+0b42M0NLv_Cn9+w at mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Hi

What were the jobs ids?
On Oct 29, 2013 7:43 AM, "RB. Ezhilalan (Principal Physicist, CUH)" <
RB.Ezhilalan at hse.ie> wrote:

> ** ** ** ** **
>
> Files attached this time. Sorry!****
>
> ** **
>
> Ezhilalan Ramalingam M.Sc.,DABR.,****
>
> Principal Physicist (Radiotherapy),****
>
> Medical Physics Department,****
>
> ****Cork** **University** **Hospital****,****
>
> **Wilton**, ****Cork********
>
> ****Ireland********
>
> Tel. 00353 21 4922533****
>
> Fax.00353 21 4921300****
>
> Email: rb.ezhilalan at hse.ie ****
>   ------------------------------
>
> *From:* RB. Ezhilalan (Principal Physicist, CUH)
> *Sent:* 29 October 2013 13:37
> *To:* torqueusers at supercluster.org
> *Subject:* job terminated half way****
>
> ** **
>
> Hi All,****
>
> ** **
>
> I have submitted three jobs to 7 processors located in 6 linux PCs (1
dual
> core), but all the jobs have terminated half way for some reason. ****
>
> ** **
>
> I could not interpret the log files from torque home (attached). I
would
> be grateful for any help to resolve/identify the underlying cause for
the
> termination of the jobs.****
>
> ** **
>
> Regards,****
>
> ** **
>
> Ezhil****
>
> Ezhilalan Ramalingam M.Sc.,DABR.,****
>
> Principal Physicist (Radiotherapy),****
>
> Medical Physics Department,****
>
> ****Cork** **University** **Hospital****,****
>
> **Wilton**, ****Cork********
>
> ****Ireland********
>
> Tel. 00353 21 4922533****
>
> Fax.00353 21 4921300****
>
> Email: rb.ezhilalan at hse.ie ****
>
> ** **
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/torqueusers/attachments/20131029/7
95075ec/attachment-0001.html 

------------------------------

Message: 2
Date: Tue, 29 Oct 2013 15:40:04 +0100
From: Carles Acosta <cacosta at pic.es>
Subject: [torqueusers]  Problem building rpms torque-2.5.13
To: torqueusers at supercluster.org
Message-ID: <526FC8C4.9050400 at pic.es>
Content-Type: text/plain; charset="iso-8859-1"

Dear all,

I am trying to build the rpms for the new torque 2.5.13 release. After 
applying the patch fix_mom_priv_2.5.patch, I use the following options:

# rpmbuild -ta --with munge --with scp --define 'torque_home 
/var/spool/pbs' --define 'torque_server XXXXXXX' --define 'acflags 
--enable-maxdefault --with-readline --with-tcp-retry-limit=2 
--disable-spool' torque-2.5.13.tar.gz

The process fails with the error:

##############
gcc -DPBS_SERVER_HOME=\"/var/spool/pbs\" 
-DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -O2 -g 
-D_LARGEFILE64_SOURCE -DUSE_HA_THREADS -DSERVER_XML -DMUNGE_AUTH -o 
.libs/pbs_server accounting.o array_func.o array_upgrade.o attr_recov.o 
dis_read.o geteusernam.o issue_request.o job_attr_def.o job_func.o 
job_recov.o job_route.o node_attr_def.o node_func.o node_manager.o 
pbsd_init.o pbsd_main.o process_request.o queue_attr_def.o queue_func.o 
queue_recov.o reply_send.o req_delete.o req_deletearray.o req_getcred.o 
req_gpuctrl.o req_holdjob.o req_jobobit.o req_locate.o req_manager.o 
req_message.o req_modify.o req_movejob.o req_quejob.o req_register.o 
req_rerun.o req_rescq.o req_runjob.o req_select.o req_shutdown.o 
req_signal.o req_stat.o req_track.o resc_def_all.o run_sched.o 
stat_job.o svr_attr_def.o svr_chk_owner.o svr_connect.o svr_func.o 
svr_jobfunc.o svr_mail.o svr_movejob.o svr_recov.o svr_resccost.o 
svr_task.o req_tokens.o job_qs_upgrade.o req_holdarray.o 
svr_format_job.o  ../lib/Libattr/libattr.a ../lib/Libsite/libsite.a 
../lib/Libutils/libutils.a ../lib/Libpbs/.libs/libtorque.so -lpthread
req_getcred.o: In function `req_altauthenuser':
/root/rpmbuild/BUILD/torque-2.5.13/src/server/req_getcred.c:669: 
undefined reference to `unmunge_request'
../lib/Libpbs/.libs/libtorque.so: undefined reference to 
`PBSD_munge_authenticate'
collect2: ld returned 1 exit status
make[2]: *** [pbs_server] Error 1
make[2]: Leaving directory
`/root/rpmbuild/BUILD/torque-2.5.13/src/server'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-2.5.13/src'
make: *** [all-recursive] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.5oGHIR (%build)

RPM build errors:
     Bad exit status from /var/tmp/rpm-tmp.5oGHIR (%build)
##############

It seems that the error is related with munge. However, using the same 
options for torque-2.5.12 in the same machine (with munge, munge-libs 
and munge-devel packages installed), the rpms were built successfully.

I only change the ".cri" nomenclature to ".munge.patch" in the 
torque.spec file (lines 66 and 68).

  66 %{expand:%%define release 0.*munge.patch*.snap.%(echo %{tarversion}

| sed 's/^.*-snap\.//')}
  67 %else
  68 %define version %{tarversion}
  69 %define release *1.munge.patch*

Any ideas?

Thank you in advance.

Best regards,

Carles

-- 
Carles Acosta i Silva
PIC (Port d'Informaci? Cient?fica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.es
Av?s - Aviso - Legal Notice: http://www.ifae.es/legal.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/torqueusers/attachments/20131029/2
b82d68d/attachment-0001.html 

------------------------------

Message: 3
Date: Tue, 29 Oct 2013 15:51:12 +0100
From: Carles Acosta <cacosta at pic.es>
Subject: Re: [torqueusers] Problem building rpms torque-2.5.13
To: torqueusers at supercluster.org
Message-ID: <526FCB60.5080001 at pic.es>
Content-Type: text/plain; charset="iso-8859-1"

Hello again,

Just to update the information, choosing "--whitout munge", there are no

problems. So, I do not know if I am doing something wrong or there is a 
bug to build torque-2.5.13 rpms with munge authentication.

Regards,

Carles

On 10/29/2013 03:40 PM, Carles Acosta wrote:
> Dear all,
>
> I am trying to build the rpms for the new torque 2.5.13 release. After

> applying the patch fix_mom_priv_2.5.patch, I use the following
options:
>
> # rpmbuild -ta --with munge --with scp --define 'torque_home 
> /var/spool/pbs' --define 'torque_server XXXXXXX' --define 'acflags 
> --enable-maxdefault --with-readline --with-tcp-retry-limit=2 
> --disable-spool' torque-2.5.13.tar.gz
>
> The process fails with the error:
>
> ##############
> gcc -DPBS_SERVER_HOME=\"/var/spool/pbs\" 
> -DPBS_ENVIRON=\"/var/spool/pbs/pbs_environment\" -O2 -g 
> -D_LARGEFILE64_SOURCE -DUSE_HA_THREADS -DSERVER_XML -DMUNGE_AUTH -o 
> .libs/pbs_server accounting.o array_func.o array_upgrade.o 
> attr_recov.o dis_read.o geteusernam.o issue_request.o job_attr_def.o 
> job_func.o job_recov.o job_route.o node_attr_def.o node_func.o 
> node_manager.o pbsd_init.o pbsd_main.o process_request.o 
> queue_attr_def.o queue_func.o queue_recov.o reply_send.o req_delete.o 
> req_deletearray.o req_getcred.o req_gpuctrl.o req_holdjob.o 
> req_jobobit.o req_locate.o req_manager.o req_message.o req_modify.o 
> req_movejob.o req_quejob.o req_register.o req_rerun.o req_rescq.o 
> req_runjob.o req_select.o req_shutdown.o req_signal.o req_stat.o 
> req_track.o resc_def_all.o run_sched.o stat_job.o svr_attr_def.o 
> svr_chk_owner.o svr_connect.o svr_func.o svr_jobfunc.o svr_mail.o 
> svr_movejob.o svr_recov.o svr_resccost.o svr_task.o req_tokens.o 
> job_qs_upgrade.o req_holdarray.o svr_format_job.o 
> ../lib/Libattr/libattr.a ../lib/Libsite/libsite.a 
> ../lib/Libutils/libutils.a ../lib/Libpbs/.libs/libtorque.so -lpthread
> req_getcred.o: In function `req_altauthenuser':
> /root/rpmbuild/BUILD/torque-2.5.13/src/server/req_getcred.c:669: 
> undefined reference to `unmunge_request'
> ../lib/Libpbs/.libs/libtorque.so: undefined reference to 
> `PBSD_munge_authenticate'
> collect2: ld returned 1 exit status
> make[2]: *** [pbs_server] Error 1
> make[2]: Leaving directory
`/root/rpmbuild/BUILD/torque-2.5.13/src/server'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/root/rpmbuild/BUILD/torque-2.5.13/src'
> make: *** [all-recursive] Error 1
> error: Bad exit status from /var/tmp/rpm-tmp.5oGHIR (%build)
>
> RPM build errors:
>     Bad exit status from /var/tmp/rpm-tmp.5oGHIR (%build)
> ##############
>
> It seems that the error is related with munge. However, using the same

> options for torque-2.5.12 in the same machine (with munge, munge-libs 
> and munge-devel packages installed), the rpms were built successfully.
>
> I only change the ".cri" nomenclature to ".munge.patch" in the 
> torque.spec file (lines 66 and 68).
>
>  66 %{expand:%%define release 0.*munge.patch*.snap.%(echo 
> %{tarversion} | sed 's/^.*-snap\.//')}
>  67 %else
>  68 %define version %{tarversion}
>  69 %define release *1.munge.patch*
>
> Any ideas?
>
> Thank you in advance.
>
> Best regards,
>
> Carles
> -- 
> Carles Acosta i Silva
> PIC (Port d'Informaci? Cient?fica)
> Campus UAB, Edifici D
> E-08193 Bellaterra, Barcelona
> Tel: +34 93 581 33 08
> Fax: +34 93 581 41 10
> http://www.pic.es  
> Av?s - Aviso - Legal Notice:http://www.ifae.es/legal.html


-- 
Carles Acosta i Silva
PIC (Port d'Informaci? Cient?fica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 08
Fax: +34 93 581 41 10
http://www.pic.es
Av?s - Aviso - Legal Notice: http://www.ifae.es/legal.html

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://www.supercluster.org/pipermail/torqueusers/attachments/20131029/0
a23cb5c/attachment.html 

------------------------------

_______________________________________________
torqueusers mailing list
torqueusers at supercluster.org
http://www.supercluster.org/mailman/listinfo/torqueusers


End of torqueusers Digest, Vol 111, Issue 37
********************************************


More information about the torqueusers mailing list