From prithu at zresearch.com Sat Mar 22 06:52:17 2008 From: prithu at zresearch.com (prithu at zresearch.com) Date: Sat, 22 Mar 2008 05:52:17 -0700 (PDT) Subject: [Ccsm-users] running tests for CCSM In-Reply-To: <42751.59.92.133.179.1206176289.squirrel@zresearch.com> References: <42751.59.92.133.179.1206176289.squirrel@zresearch.com> Message-ID: <41364.59.92.133.179.1206190337.squirrel@zresearch.com> > Hi, > I am a newbee. > We are trying to test run of CCSM3 on a linux cluster. > 2xQuadcore 2.66GHz processor/node. 16 such nodes connected with > Infiniband. > The test run built by us > > TER.01a.T31_gx3v5.B.generic_linux > > using OFED and OpenMPI and pgi compilers. > The "run" script was modified so that each instance of each component > runs on a different core using OpenMPI. > > The job is running with each instance of each component > runs on a different core - is not idling processor usage > even communication shows running status. (For more than > 14 Hrs now). > > How long such a configuration would run before giving any result? Even > guess would be helpful. > > Are there any intermediate outputs generated ? where and what form? > If any results were expected by now how would I know if the run has not > gone into infinite loop/divergence ? > > WHile compiling I faced a problem - NetCDF wont compile for pgi > compilers > (netcdf-3.6.2 and pgi-7.1-5) so finally downloaded pre-copiled Netcdf for > x86-64 and used that to build the model. OFED & OpenMPI were compiled > using pgi. > > A quick response would be appreciated. > regards > Prithu > > From prithu at zresearch.com Sun Mar 23 21:13:27 2008 From: prithu at zresearch.com (prithu at zresearch.com) Date: Sun, 23 Mar 2008 20:13:27 -0700 (PDT) Subject: [Ccsm-users] running tests for CCSM In-Reply-To: <41364.59.92.133.179.1206190337.squirrel@zresearch.com> References: <42751.59.92.133.179.1206176289.squirrel@zresearch.com> <41364.59.92.133.179.1206190337.squirrel@zresearch.com> Message-ID: <35387.59.92.173.62.1206328407.squirrel@zresearch.com> Hi, Here are some further inputs. In run one number of processes given to coupler was 1. However after the last entry to logfile(cpl*.log) was generated within 5min. of start, no further log entries were generated even though the program was run for more than 2days (>48Hrs). One more run was given, this time with coupler given 8 processes. This is also running merrily for more than 12 hrs now. The last part of log messages in cpl.log show following ----------------------------------------------------------------------- (tStamp_write) cpl model date 0001-01-06 00000s wall clock 2008-03-23 13:48:42 avg dt 1s dt 2s (restart_write) cpl_control_caseName = TER.01a.T31_gx3v5.B.generic_linux.060452 (restart_write) cpl_control_restType = initial (restart_write) cpl_control_restCDate = 10101 (restart_write) cpl_control_restPFn = rpointer.cpl (restart_write) cpl_control_restBFn = null (restart_write) creating new file: TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000 (restart_write) appending to file: TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000, date = 00010106, 0s (cpl_iobin_appendBun) writing data for bundle = Xa2c_a (cpl_iobin_appendBun) writing data for bundle = Xi2c_i (cpl_iobin_appendBun) writing data for bundle = Xl2c_l (cpl_iobin_appendBun) writing data for bundle = Xo2c_o (cpl_iobin_appendBun) writing data for bundle = Xr2c_r (cpl_iobin_appendBun) writing data for bundle = Xc2o_o (cpl_iobin_appendBun) writing data for bundle = aoflux_o (cpl_iobin_appendBun) writing data for bundle = oalbedo_o ---------------------------------------------------------------- But this was almost 12Hrs back(machine is running on California time. Any comments? regards Prithu >> Hi, >> I am a newbee. >> We are trying to test run of CCSM3 on a linux cluster. >> 2xQuadcore 2.66GHz processor/node. 16 such nodes connected with >> Infiniband. >> The test run built by us >> >> TER.01a.T31_gx3v5.B.generic_linux >> >> using OFED and OpenMPI and pgi compilers. >> The "run" script was modified so that each instance of each component >> runs on a different core using OpenMPI. >> >> The job is running with each instance of each component >> runs on a different core - is not idling processor usage >> even communication shows running status. (For more than >> 14 Hrs now). >> >> How long such a configuration would run before giving any result? Even >> guess would be helpful. >> >> Are there any intermediate outputs generated ? where and what form? >> If any results were expected by now how would I know if the run has >> not >> gone into infinite loop/divergence ? >> >> WHile compiling I faced a problem - NetCDF wont compile for pgi >> compilers >> (netcdf-3.6.2 and pgi-7.1-5) so finally downloaded pre-copiled Netcdf >> for >> x86-64 and used that to build the model. OFED & OpenMPI were compiled >> using pgi. >> >> A quick response would be appreciated. >> regards >> Prithu >> >> > > > > From tcraig at ucar.edu Sun Mar 23 21:27:13 2008 From: tcraig at ucar.edu (tcraig) Date: Mon, 24 Mar 2008 14:27:13 +1100 Subject: [Ccsm-users] running tests for CCSM In-Reply-To: <35387.59.92.173.62.1206328407.squirrel@zresearch.com> References: <42751.59.92.133.179.1206176289.squirrel@zresearch.com> <41364.59.92.133.179.1206190337.squirrel@zresearch.com> <35387.59.92.173.62.1206328407.squirrel@zresearch.com> Message-ID: <47E71F91.9090103@ucar.edu> hi prithu, it's difficult to know how fast ccsm should run on your platform. could you tell me what your processor counts are for each component. certainly, an hour or two seems too long. on the other hand, you latest email suggests the model is running. if you "grep" for tStamp_write in your cpl.log file, that should summarize the time for each model day (or timestep), like (tStamp_write) cpl model date 0001-01-06 00000s wall clock 2008-03-23 13:48:42 avg dt 1s dt 2s if this is typical of your run time, then i suspect the model is hanging on the restart write. the TER test runs 5 days, writes a restart at 0001-01-06 00000s, then runs 5 more days. then the model is supposed to start up again on from the restart and run the same 5 days again bit-for-bit. you might want to start with a simple B case using create_newcase and then play around with turning restarts on and off. that will also help you debug. again, it looks like your run is going, but it's hanging on the restart write. but that's just a guess at this point. and based on the limited output i've seen, the job should run in less than 30 minutes (or so) in total. tony...... prithu at zresearch.com wrote: > Hi, > Here are some further inputs. > In run one number of processes given to coupler was 1. However > after the last entry to logfile(cpl*.log) was generated within 5min. > of start, no further log entries were generated even though the > program was run for more than 2days (>48Hrs). > > One more run was given, this time with coupler given 8 processes. This > is also running merrily for more than 12 hrs now. The last part of log > messages > in cpl.log show following > > ----------------------------------------------------------------------- > (tStamp_write) cpl model date 0001-01-06 00000s wall clock 2008-03-23 > 13:48:42 avg dt 1s dt 2s > (restart_write) cpl_control_caseName = > TER.01a.T31_gx3v5.B.generic_linux.060452 > (restart_write) cpl_control_restType = initial > (restart_write) cpl_control_restCDate = 10101 > (restart_write) cpl_control_restPFn = rpointer.cpl > (restart_write) cpl_control_restBFn = null > (restart_write) creating new file: > TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000 > (restart_write) appending to file: > TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000, date = > 00010106, 0s > (cpl_iobin_appendBun) writing data for bundle = Xa2c_a > (cpl_iobin_appendBun) writing data for bundle = Xi2c_i > (cpl_iobin_appendBun) writing data for bundle = Xl2c_l > (cpl_iobin_appendBun) writing data for bundle = Xo2c_o > (cpl_iobin_appendBun) writing data for bundle = Xr2c_r > (cpl_iobin_appendBun) writing data for bundle = Xc2o_o > (cpl_iobin_appendBun) writing data for bundle = aoflux_o > (cpl_iobin_appendBun) writing data for bundle = oalbedo_o > ---------------------------------------------------------------- > > But this was almost 12Hrs back(machine is running on California time. > Any comments? > regards > Prithu > > > > >>>Hi, >>> I am a newbee. >>> We are trying to test run of CCSM3 on a linux cluster. >>>2xQuadcore 2.66GHz processor/node. 16 such nodes connected with >>>Infiniband. >>> The test run built by us >>> >>> TER.01a.T31_gx3v5.B.generic_linux >>> >>> using OFED and OpenMPI and pgi compilers. >>>The "run" script was modified so that each instance of each component >>>runs on a different core using OpenMPI. >>> >>> The job is running with each instance of each component >>>runs on a different core - is not idling processor usage >>>even communication shows running status. (For more than >>>14 Hrs now). >>> >>> How long such a configuration would run before giving any result? Even >>>guess would be helpful. >>> >>> Are there any intermediate outputs generated ? where and what form? >>> If any results were expected by now how would I know if the run has >>>not >>>gone into infinite loop/divergence ? >>> >>> WHile compiling I faced a problem - NetCDF wont compile for pgi >>>compilers >>>(netcdf-3.6.2 and pgi-7.1-5) so finally downloaded pre-copiled Netcdf >>>for >>>x86-64 and used that to build the model. OFED & OpenMPI were compiled >>>using pgi. >>> >>>A quick response would be appreciated. >>>regards >>>Prithu >>> >>> >> >> >> >> > > > _______________________________________________ > CCSM-Users mailing list > CCSM-Users at cgd.ucar.edu > http://mailman.cgd.ucar.edu/mailman/listinfo/ccsm-users From prithu at zresearch.com Mon Mar 24 09:53:24 2008 From: prithu at zresearch.com (prithu at zresearch.com) Date: Mon, 24 Mar 2008 08:53:24 -0700 (PDT) Subject: [Ccsm-users] running tests for CCSM In-Reply-To: <47E71F91.9090103@ucar.edu> References: <42751.59.92.133.179.1206176289.squirrel@zresearch.com> <41364.59.92.133.179.1206190337.squirrel@zresearch.com> <35387.59.92.173.62.1206328407.squirrel@zresearch.com> <47E71F91.9090103@ucar.edu> Message-ID: <42062.59.92.144.101.1206374004.squirrel@zresearch.com> Hi, Ran the test again and here is the process distribution I gave cpl - 8 csim - 8 clm - 6 pop - 24 cam - 16 The only difference between 1 and 2 run was that earlier we had given 1 process to cpl but in run 2 we gave 8 processes to cpl. I checked the tStamp_ it was something like it took about 5min to come to 0001-01-06 after which I get finally the message quoted below about writting the restart file and then goes into loop. (all tStamps were within those 5min Wall clock) surprizing thing is that program does not exactly hang - it runs that is it occupies allotted cpus (98-100% utillization), but without progressing any further at all. I am going to try running TER.01a.T42_gx1v3.B.generic_linux testcase too. what would be the recommended process distribution for this? and run time? regards Prithu > > hi prithu, > > it's difficult to know how fast ccsm should run on your platform. > could you tell me what your processor counts are for each > component. > > certainly, an hour or two seems too long. on the other hand, > you latest email suggests the model is running. if you "grep" > for tStamp_write in your cpl.log file, that should summarize > the time for each model day (or timestep), like > > (tStamp_write) cpl model date 0001-01-06 00000s wall clock > 2008-03-23 13:48:42 avg dt 1s dt 2s > > if this is typical of your run time, then i suspect the model > is hanging on the restart write. the TER test runs 5 days, > writes a restart at 0001-01-06 00000s, then runs 5 more days. > then the model is supposed to start up again on from the restart > and run the same 5 days again bit-for-bit. > > you might want to start with a simple B case using create_newcase > and then play around with turning restarts on and off. that will > also help you debug. > > again, it looks like your run is going, but it's hanging on the > restart write. but that's just a guess at this point. and based > on the limited output i've seen, the job should run in less than 30 > minutes (or so) in total. > > tony...... > > prithu at zresearch.com wrote: >> Hi, >> Here are some further inputs. >> In run one number of processes given to coupler was 1. However >> after the last entry to logfile(cpl*.log) was generated within 5min. >> of start, no further log entries were generated even though the >> program was run for more than 2days (>48Hrs). >> >> One more run was given, this time with coupler given 8 processes. >> This >> is also running merrily for more than 12 hrs now. The last part of log >> messages >> in cpl.log show following >> >> ----------------------------------------------------------------------- >> (tStamp_write) cpl model date 0001-01-06 00000s wall clock 2008-03-23 >> 13:48:42 avg dt 1s dt 2s >> (restart_write) cpl_control_caseName = >> TER.01a.T31_gx3v5.B.generic_linux.060452 >> (restart_write) cpl_control_restType = initial >> (restart_write) cpl_control_restCDate = 10101 >> (restart_write) cpl_control_restPFn = rpointer.cpl >> (restart_write) cpl_control_restBFn = null >> (restart_write) creating new file: >> TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000 >> (restart_write) appending to file: >> TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000, date = >> 00010106, 0s >> (cpl_iobin_appendBun) writing data for bundle = Xa2c_a >> (cpl_iobin_appendBun) writing data for bundle = Xi2c_i >> (cpl_iobin_appendBun) writing data for bundle = Xl2c_l >> (cpl_iobin_appendBun) writing data for bundle = Xo2c_o >> (cpl_iobin_appendBun) writing data for bundle = Xr2c_r >> (cpl_iobin_appendBun) writing data for bundle = Xc2o_o >> (cpl_iobin_appendBun) writing data for bundle = aoflux_o >> (cpl_iobin_appendBun) writing data for bundle = oalbedo_o >> ---------------------------------------------------------------- >> >> But this was almost 12Hrs back(machine is running on California time. >> Any comments? >> regards >> Prithu >> >> >> >> >>>>Hi, >>>> I am a newbee. >>>> We are trying to test run of CCSM3 on a linux cluster. >>>>2xQuadcore 2.66GHz processor/node. 16 such nodes connected with >>>>Infiniband. >>>> The test run built by us >>>> >>>> TER.01a.T31_gx3v5.B.generic_linux >>>> >>>> using OFED and OpenMPI and pgi compilers. >>>>The "run" script was modified so that each instance of each component >>>>runs on a different core using OpenMPI. >>>> >>>> The job is running with each instance of each component >>>>runs on a different core - is not idling processor usage >>>>even communication shows running status. (For more than >>>>14 Hrs now). >>>> >>>> How long such a configuration would run before giving any result? >>>> Even >>>>guess would be helpful. >>>> >>>> Are there any intermediate outputs generated ? where and what form? >>>> If any results were expected by now how would I know if the run has >>>>not >>>>gone into infinite loop/divergence ? >>>> >>>> WHile compiling I faced a problem - NetCDF wont compile for pgi >>>>compilers >>>>(netcdf-3.6.2 and pgi-7.1-5) so finally downloaded pre-copiled Netcdf >>>>for >>>>x86-64 and used that to build the model. OFED & OpenMPI were compiled >>>>using pgi. >>>> >>>>A quick response would be appreciated. >>>>regards >>>>Prithu >>>> >>>> >>> >>> >>> >>> >> >> >> _______________________________________________ >> CCSM-Users mailing list >> CCSM-Users at cgd.ucar.edu >> http://mailman.cgd.ucar.edu/mailman/listinfo/ccsm-users > > From tcraig at ucar.edu Mon Mar 24 15:32:57 2008 From: tcraig at ucar.edu (tcraig) Date: Tue, 25 Mar 2008 08:32:57 +1100 Subject: [Ccsm-users] running tests for CCSM In-Reply-To: <42062.59.92.144.101.1206374004.squirrel@zresearch.com> References: <42751.59.92.133.179.1206176289.squirrel@zresearch.com> <41364.59.92.133.179.1206190337.squirrel@zresearch.com> <35387.59.92.173.62.1206328407.squirrel@zresearch.com> <47E71F91.9090103@ucar.edu> <42062.59.92.144.101.1206374004.squirrel@zresearch.com> Message-ID: <47E81E09.2050701@ucar.edu> prithu, before trying a new, higher resolution, i would try to figure out why the model is hanging in the restart write. the processor distribution is reasonable. if you want to try a simple case, you could try "X" instead of "B". X is just the coupler with "dead" components that don't write restarts. that might be an easier case to test and debug. tony...... prithu at zresearch.com wrote: > Hi, > Ran the test again and here is the process distribution I gave > cpl - 8 > csim - 8 > clm - 6 > pop - 24 > cam - 16 > > The only difference between 1 and 2 run was that earlier > we had given 1 process to cpl > but in run 2 we gave 8 processes to cpl. > > I checked the tStamp_ it was something like it took about 5min to come > to 0001-01-06 after which I get finally the message quoted below about > writting the restart file and then goes into loop. (all tStamps were > within those 5min Wall clock) > surprizing thing is that program does not exactly hang - it runs that is it > occupies allotted cpus (98-100% utillization), but without progressing > any further at all. > > I am going to try running > TER.01a.T42_gx1v3.B.generic_linux > testcase too. > what would be the recommended process distribution for this? and run time? > > regards > Prithu > > > >>hi prithu, >> >>it's difficult to know how fast ccsm should run on your platform. >>could you tell me what your processor counts are for each >>component. >> >>certainly, an hour or two seems too long. on the other hand, >>you latest email suggests the model is running. if you "grep" >>for tStamp_write in your cpl.log file, that should summarize >>the time for each model day (or timestep), like >> >> (tStamp_write) cpl model date 0001-01-06 00000s wall clock >>2008-03-23 13:48:42 avg dt 1s dt 2s >> >>if this is typical of your run time, then i suspect the model >>is hanging on the restart write. the TER test runs 5 days, >>writes a restart at 0001-01-06 00000s, then runs 5 more days. >>then the model is supposed to start up again on from the restart >>and run the same 5 days again bit-for-bit. >> >>you might want to start with a simple B case using create_newcase >>and then play around with turning restarts on and off. that will >>also help you debug. >> >>again, it looks like your run is going, but it's hanging on the >>restart write. but that's just a guess at this point. and based >>on the limited output i've seen, the job should run in less than 30 >>minutes (or so) in total. >> >>tony...... >> >>prithu at zresearch.com wrote: >> >>>Hi, >>> Here are some further inputs. >>> In run one number of processes given to coupler was 1. However >>>after the last entry to logfile(cpl*.log) was generated within 5min. >>>of start, no further log entries were generated even though the >>>program was run for more than 2days (>48Hrs). >>> >>> One more run was given, this time with coupler given 8 processes. >>>This >>>is also running merrily for more than 12 hrs now. The last part of log >>>messages >>>in cpl.log show following >>> >>>----------------------------------------------------------------------- >>>(tStamp_write) cpl model date 0001-01-06 00000s wall clock 2008-03-23 >>>13:48:42 avg dt 1s dt 2s >>>(restart_write) cpl_control_caseName = >>>TER.01a.T31_gx3v5.B.generic_linux.060452 >>>(restart_write) cpl_control_restType = initial >>>(restart_write) cpl_control_restCDate = 10101 >>>(restart_write) cpl_control_restPFn = rpointer.cpl >>>(restart_write) cpl_control_restBFn = null >>>(restart_write) creating new file: >>>TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000 >>>(restart_write) appending to file: >>>TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000, date = >>>00010106, 0s >>>(cpl_iobin_appendBun) writing data for bundle = Xa2c_a >>>(cpl_iobin_appendBun) writing data for bundle = Xi2c_i >>>(cpl_iobin_appendBun) writing data for bundle = Xl2c_l >>>(cpl_iobin_appendBun) writing data for bundle = Xo2c_o >>>(cpl_iobin_appendBun) writing data for bundle = Xr2c_r >>>(cpl_iobin_appendBun) writing data for bundle = Xc2o_o >>>(cpl_iobin_appendBun) writing data for bundle = aoflux_o >>>(cpl_iobin_appendBun) writing data for bundle = oalbedo_o >>>---------------------------------------------------------------- >>> >>>But this was almost 12Hrs back(machine is running on California time. >>>Any comments? >>>regards >>>Prithu >>> >>> >>> >>> >>> >>>>>Hi, >>>>> I am a newbee. >>>>> We are trying to test run of CCSM3 on a linux cluster. >>>>>2xQuadcore 2.66GHz processor/node. 16 such nodes connected with >>>>>Infiniband. >>>>> The test run built by us >>>>> >>>>>TER.01a.T31_gx3v5.B.generic_linux >>>>> >>>>> using OFED and OpenMPI and pgi compilers. >>>>>The "run" script was modified so that each instance of each component >>>>>runs on a different core using OpenMPI. >>>>> >>>>> The job is running with each instance of each component >>>>>runs on a different core - is not idling processor usage >>>>>even communication shows running status. (For more than >>>>>14 Hrs now). >>>>> >>>>> How long such a configuration would run before giving any result? >>>>>Even >>>>>guess would be helpful. >>>>> >>>>> Are there any intermediate outputs generated ? where and what form? >>>>> If any results were expected by now how would I know if the run has >>>>>not >>>>>gone into infinite loop/divergence ? >>>>> >>>>> WHile compiling I faced a problem - NetCDF wont compile for pgi >>>>>compilers >>>>>(netcdf-3.6.2 and pgi-7.1-5) so finally downloaded pre-copiled Netcdf >>>>>for >>>>>x86-64 and used that to build the model. OFED & OpenMPI were compiled >>>>>using pgi. >>>>> >>>>>A quick response would be appreciated. >>>>>regards >>>>>Prithu >>>>> >>>>> >>>> >>>> >>>> >>>> >>> >>>_______________________________________________ >>>CCSM-Users mailing list >>>CCSM-Users at cgd.ucar.edu >>>http://mailman.cgd.ucar.edu/mailman/listinfo/ccsm-users >> >> > From prithu at zresearch.com Mon Mar 24 21:36:23 2008 From: prithu at zresearch.com (prithu at zresearch.com) Date: Mon, 24 Mar 2008 20:36:23 -0700 (PDT) Subject: [Ccsm-users] running tests for CCSM In-Reply-To: <47E81E09.2050701@ucar.edu> References: <42751.59.92.133.179.1206176289.squirrel@zresearch.com> <41364.59.92.133.179.1206190337.squirrel@zresearch.com> <35387.59.92.173.62.1206328407.squirrel@zresearch.com> <47E71F91.9090103@ucar.edu> <42062.59.92.144.101.1206374004.squirrel@zresearch.com> <47E81E09.2050701@ucar.edu> Message-ID: <50588.59.96.206.111.1206416183.squirrel@zresearch.com> hi, What would be the latest PGI compilers for which ccsm3 is tested on x86-64 platforms? I ask this because NetCDF is essential for CCSM3 but it does not get built ("make check" fails) using atleast pgi-7.1-5. due to name-mangling issues. Are there some flags etc if we want to switch over to intel-compilers? I would try the "X" case also. regards Prithu > > prithu, > > before trying a new, higher resolution, i would try to figure > out why the model is hanging in the restart write. the > processor distribution is reasonable. > > if you want to try a simple case, you could try "X" instead > of "B". X is just the coupler with "dead" components > that don't write restarts. that might be an easier case > to test and debug. > > tony...... > > prithu at zresearch.com wrote: >> Hi, >> Ran the test again and here is the process distribution I gave >> cpl - 8 >> csim - 8 >> clm - 6 >> pop - 24 >> cam - 16 >> >> The only difference between 1 and 2 run was that earlier >> we had given 1 process to cpl >> but in run 2 we gave 8 processes to cpl. >> >> I checked the tStamp_ it was something like it took about 5min to come >> to 0001-01-06 after which I get finally the message quoted below about >> writting the restart file and then goes into loop. (all tStamps were >> within those 5min Wall clock) >> surprizing thing is that program does not exactly hang - it runs that is >> it >> occupies allotted cpus (98-100% utillization), but without progressing >> any further at all. >> >> I am going to try running >> TER.01a.T42_gx1v3.B.generic_linux >> testcase too. >> what would be the recommended process distribution for this? and run >> time? >> >> regards >> Prithu >> >> >> >>>hi prithu, >>> >>>it's difficult to know how fast ccsm should run on your platform. >>>could you tell me what your processor counts are for each >>>component. >>> >>>certainly, an hour or two seems too long. on the other hand, >>>you latest email suggests the model is running. if you "grep" >>>for tStamp_write in your cpl.log file, that should summarize >>>the time for each model day (or timestep), like >>> >>> (tStamp_write) cpl model date 0001-01-06 00000s wall clock >>>2008-03-23 13:48:42 avg dt 1s dt 2s >>> >>>if this is typical of your run time, then i suspect the model >>>is hanging on the restart write. the TER test runs 5 days, >>>writes a restart at 0001-01-06 00000s, then runs 5 more days. >>>then the model is supposed to start up again on from the restart >>>and run the same 5 days again bit-for-bit. >>> >>>you might want to start with a simple B case using create_newcase >>>and then play around with turning restarts on and off. that will >>>also help you debug. >>> >>>again, it looks like your run is going, but it's hanging on the >>>restart write. but that's just a guess at this point. and based >>>on the limited output i've seen, the job should run in less than 30 >>>minutes (or so) in total. >>> >>>tony...... >>> >>>prithu at zresearch.com wrote: >>> >>>>Hi, >>>> Here are some further inputs. >>>> In run one number of processes given to coupler was 1. However >>>>after the last entry to logfile(cpl*.log) was generated within 5min. >>>>of start, no further log entries were generated even though the >>>>program was run for more than 2days (>48Hrs). >>>> >>>> One more run was given, this time with coupler given 8 processes. >>>>This >>>>is also running merrily for more than 12 hrs now. The last part of log >>>>messages >>>>in cpl.log show following >>>> >>>>----------------------------------------------------------------------- >>>>(tStamp_write) cpl model date 0001-01-06 00000s wall clock 2008-03-23 >>>>13:48:42 avg dt 1s dt 2s >>>>(restart_write) cpl_control_caseName = >>>>TER.01a.T31_gx3v5.B.generic_linux.060452 >>>>(restart_write) cpl_control_restType = initial >>>>(restart_write) cpl_control_restCDate = 10101 >>>>(restart_write) cpl_control_restPFn = rpointer.cpl >>>>(restart_write) cpl_control_restBFn = null >>>>(restart_write) creating new file: >>>>TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000 >>>>(restart_write) appending to file: >>>>TER.01a.T31_gx3v5.B.generic_linux.060452.cpl6.r.0001-01-06-00000, date >>>> = >>>>00010106, 0s >>>>(cpl_iobin_appendBun) writing data for bundle = Xa2c_a >>>>(cpl_iobin_appendBun) writing data for bundle = Xi2c_i >>>>(cpl_iobin_appendBun) writing data for bundle = Xl2c_l >>>>(cpl_iobin_appendBun) writing data for bundle = Xo2c_o >>>>(cpl_iobin_appendBun) writing data for bundle = Xr2c_r >>>>(cpl_iobin_appendBun) writing data for bundle = Xc2o_o >>>>(cpl_iobin_appendBun) writing data for bundle = aoflux_o >>>>(cpl_iobin_appendBun) writing data for bundle = oalbedo_o >>>>---------------------------------------------------------------- >>>> >>>>But this was almost 12Hrs back(machine is running on California time. >>>>Any comments? >>>>regards >>>>Prithu >>>> >>>> >>>> >>>> >>>> >>>>>>Hi, >>>>>> I am a newbee. >>>>>> We are trying to test run of CCSM3 on a linux cluster. >>>>>>2xQuadcore 2.66GHz processor/node. 16 such nodes connected with >>>>>>Infiniband. >>>>>> The test run built by us >>>>>> >>>>>>TER.01a.T31_gx3v5.B.generic_linux >>>>>> >>>>>> using OFED and OpenMPI and pgi compilers. >>>>>>The "run" script was modified so that each instance of each component >>>>>>runs on a different core using OpenMPI. >>>>>> >>>>>> The job is running with each instance of each component >>>>>>runs on a different core - is not idling processor usage >>>>>>even communication shows running status. (For more than >>>>>>14 Hrs now). >>>>>> >>>>>> How long such a configuration would run before giving any result? >>>>>>Even >>>>>>guess would be helpful. >>>>>> >>>>>> Are there any intermediate outputs generated ? where and what form? >>>>>> If any results were expected by now how would I know if the run has >>>>>>not >>>>>>gone into infinite loop/divergence ? >>>>>> >>>>>> WHile compiling I faced a problem - NetCDF wont compile for pgi >>>>>>compilers >>>>>>(netcdf-3.6.2 and pgi-7.1-5) so finally downloaded pre-copiled Netcdf >>>>>>for >>>>>>x86-64 and used that to build the model. OFED & OpenMPI were compiled >>>>>>using pgi. >>>>>> >>>>>>A quick response would be appreciated. >>>>>>regards >>>>>>Prithu >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>>_______________________________________________ >>>>CCSM-Users mailing list >>>>CCSM-Users at cgd.ucar.edu >>>>http://mailman.cgd.ucar.edu/mailman/listinfo/ccsm-users >>> >>> >> > > From jianma at hawaii.edu Thu Mar 27 12:57:38 2008 From: jianma at hawaii.edu (Jian Ma) Date: Thu, 27 Mar 2008 08:57:38 -1000 Subject: [Ccsm-users] [CCSM-Users] CCSM restart error on blueice Message-ID: Dear All, I am a newbie of CCSM. Now I am running a 1990 control run on blueice. The tests are all right. And it also work perfectly when I run a startup for the 1990 control. Lately, I found I can use the restart files of the output data set to continue with a 1000 years 1990 control run. But when I changed the model to run continuously with the 1000th year restart files, the following error occurs in the poe.stderr.### file. Traceback: ?Offset 0x0000010c in procedure __abortutils_NMOD_endrun, near line 38 in file /ptmp/jianma/CCSM/run/b30.004/atm/obj/abortutils.f90 ?Offset 0x000000ac in procedure handle_error, near line 39014 in file /ptmp/jianma/CCSM/run/b30.004/atm/obj/wrap_nf.f90 ?Offset 0x000001b4 in procedure wrap_inq_varid, near line 8358 in file /ptmp/jianma/CCSM/run/b30.004/atm/obj/wrap_nf.f90 ?Offset 0x000001b0 in procedure __history_NMOD_h_inquire, near line 2353 in file /ptmp/jianma/CCSM/run/b30.004/atm/obj/history.f90 ?Offset 0x00001df8 in procedure __history_NMOD_read_restart_history, near line 1158 in file /ptmp/jianma/CCSM/run/b30.004/atm/obj/history.f90 ?Offset 0x000005a0 in procedure __restart_NMOD_read_restart, near line 368 in file /ptmp/jianma/CCSM/run/b30.004/atm/obj/restart.f90 ?Offset 0x00000e00 in procedure cam, near line 256 in file /ptmp/jianma/CCSM/run/b30.004/atm/obj/cam.f90 ?--- End of call chain --- Could anyone pls give me a tip on the cause of this problem? Please let me know if you need additional info. Thank you very much! Tony -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mailman.cgd.ucar.edu/pipermail/ccsm-users/attachments/20080327/9e2bad42/attachment.html