diff --git b/Changes.log a/Changes.log new file mode 100644 index 0000000..ccf61a3 --- /dev/null +++ a/Changes.log @@ -0,0 +1,428 @@ +########################################### +# Modification History of NPB3.x # +# ------------------------------ # +# NPB development team # +# NASA Ames Research Center # +# npb@nas.nasa.gov # +# http://www.nas.nasa.gov/Software/NPB/ # +########################################### + +------------------------------------------------------ +Changes in NPB3.3.1 + (NPB3.3-SER, NPB3.3-OMP, NPB3.3-MPI ) +------------------------------------------------------ +[17-Feb-09] + +This is a bug fixing release of NPB3.3. + +1. All versions + + - sys/setparams.c: fixed a problem in dealing with quoted (") flags + from make.def when producing npbparams.h for C. + + - CG: ensure 'implicit none' used in all subroutines. + +2. MPI version + + - Additional timers can be used for profiling purpose, similar + to those already included in the OMP and SER versions. + + - LU: + * code clean up (suggested by Rob Van der Wijngaart) + > avoid using MPI_ANY_SOURCE in exchange_*.f, which might + alter performance in some cases. + > delete references to sethyper and 'icomm*', which are + no longer used since NPB2.2. + * change the low-bound limit on the sub-domain size in subdomain.f + from 4 to 3 in order to increase allowable process counts. + * allow number of processes other than power of two. + + - FT: fix a non-portable way of broadcasting input parameters + (pointed out by Art Lazanoff) + + - BT: include 'btio_cleanup' as part of the I/O timing + +3. OMP and SER versions + + - DC: fix access to out-of-bound array elements in adc.c + Reported by Per Larsen of Demark + + - UA: fix the use of uninitialized array 'sje' in mortar_vertex() by + adding "call nr_init[_omp](sje,4*6*nelt,0)" in the main program. + + - MG, UA: include additional timers for profiling purpose. + + - Executables now use ".x" as a name extension + + +------------------------------------------------------ +Changes in NPB3.3 + (NPB3.3-SER, NPB3.3-OMP, NPB3.3-MPI ) +------------------------------------------------------ +[02-Aug-07] + +1. New and improvements + + - The Class E problem has been introduced in seven of the benchmarks + (BT, SP, LU, CG, MG, FT, and EP) in all three implementations. + + - The Class D problem has been added to the IS benchmark in all + three implementations. It requires the compiler support of + 64-bit "long" type in C. The MPI version of IS now allows runs + up to 1024 processes. + + - The Bucket Sort option (USE_BUCKETS) has been added to + the OpenMP version of IS and made as the default. + + - Introduced the "twiddle" array in the OpenMP FT benchmark, + which has been used in the MPI and SER versions and seems + to improve performance for larger problem sizes. + + - Merged vector codes for the BT and LU benchmarks into + the release. + + - Updates to BTIO (MPI/BT with IO subtypes): + * added I/O stats (I/O timing, data size written, I/O data rate) + * added an option for interleaving reads between writes through + the inputbt.data file. Although the data file size would be + smaller as a result, the total amount of data written is still + the same. + + - Made documents more consistent throughout different versions + (README and README.install). + +2. Bug fixes + + - MPI/FT: fixed a verification failure for cases where NX/=NY + and the 2D decomposition are used. The bug occurred at least + for (Class D, NPROCS=2048) and (Class B, NPROCS=512). + + fixed an output printing format problem occurred when + the number of processes >= 1000. + + - MPI/SP: fixed a performance regression due to improper + padding of array dimensions. + + - MPI/IS: minor fix to support large processor counts (>=512). + + - OMP/UA: fixed a race condition in mason.f, avoided the use + of the LASTPRIVATE directive. + + - OMP/LU: minor fix in data flushing for pipelining. + + - DC: There are a number of fixes - + * fixed segmentation fault in both OMP and SER versions + caused by accessing zero-length array elements. + Reported by Jeff Odom . + + * fixed a race in reporting benchmark timing in the OMP version + + * fixed the use of timer in the OMP version, which limited + the number of threads to 64. The number of threads is now + lifted to a maximum of MAX_NUMBER_OF_TASKS (=256). + + * made the benchmark output consistent with other NPBs. + + - fixed a use of uninitialized variable in MPI/sys/setparams.c. + setparams in all three versions was updated to deal with + make.def that contains carriage-return character ('\r'). + + - SER/FT: added 'implicit none' to all missing places. + + - SER/IS: fixed missing variable declarations for the Bucket + Sort option (when USE_BUCKETS is defined). + +3. Others + + - The default value for collbuf_nodes in the BT I/O benchmark + is now set to 0, indicating no file hints will be used. + The setting can be changed by using the "inputbt.data" file. + + - The hyperplane version of LU (LU-HP) is no longer included + in the distribution. + + +------------------------------------------------------ +Changes in NPB3.2.1 + (NPB3.2-SER, NPB3.2-OMP, NPB3.2-MPI ) +------------------------------------------------------ +[27-Jul-05] + +This is a bug fixing release of NPB3.2. + +1. MPI version + - sys/setparams.c: removed a duplicated statement for writing + FT parameters and made invalid SUBTYPE as an error condition. + The 'duplicated statement' problem was fixed in NPB3.2 (See + the note below). However, during the final updating process, + the fix was left out, even though the log file was updated. + + - BT: included SUBTYPE=EPIO in the I/O verification. + + - LU: bcast_inputs.f: fixed wrong data type (dp_type) used for + communicating integers (nx0,ny0,nz0) with the correct type + MPI_INTEGER. + + - MG: fixed a mis-calculation of parameter "nr" in globals.h + that caused run-time failure for NPROCS >= 512 + (reported by Donald Ferry of Cray). Expanded to limit to + 131072 processes and added an error checking code. + + The use of MPI_ANY_SOURCE for MPI_Irecv inside subroutine + ready() could cause MPI_Wait return a message meant for + the wrong k. The problem is fixed with nbr(axis,-dir,k) + in place of MPI_ANY_SOURCE in the call to MPI_Irecv + (reported and suggested by Hideo Saito). + +2. OpenMP version + - EP: use THREADPRIVATE for working array storage. It should not + change performance but made some compiler happier. + + - LU: add variable "v" to FLUSH to ensure solution data properly + flushed for pipeline. This change is needed according to + the OpenMP 2.5 standard. + + - IS: reorganized working buffers so that the count for key + population could be more naturally performed. This version + uses much less stack space. + + - UA: implemented atomic updates with locks in order to achieve + better scaling on those systems that have an inefficient + (or even buggy) ATOMIC implementation. + + +------------------------------------------------------ +Changes in NPB3.2 + (NPB3.2-SER, NPB3.2-OMP, NPB3.2-MPI ) +------------------------------------------------------ +[07-Jan-05] + +1. DC version in NPB3.2-SER was converted to C from C++ + (CLASSES S, W, A, B). + sys/setparams.c file was changed appropriately. + +2. OpenMP version of DC was added to NPB3.2-OMP. + +3. Data Traffic benchmark DT was added to NPB3.2-MPI. + +[24-May-04] + +All versions: + - use assumed shape "(*)" declaration in CG + - fixed the use of an uninitialized variable in EP + - avoid using integer array for assumed shape dimensions in FT + - fix in UA: + * fix the reference to file "inputua.data" + * avoid overindexing + * avoid reference to out-of-bound array elements + * change declaration "real*8" to "double precision" + +OMP version: + - explicitly added "SCHEDULE(STATIC)" to the OMP version + - use the "omp_get_wtime()" function for timer if available + - removed the call to "getenv" for portability + - change in UA: + * implemented an alternative approach for atomic update + +MPI version: + - removed a duplicated declaration in FT (from setparams.c) + - removed a duplicated declaration in BT/full_mpiio.f + - fixed a missing "NPROCS=" in sys/suite.awk + + +------------------------------------------------------ +Changes in NPB3.1 + (NPB3.1-MPI, NPB3.1-SER, NPB3.1-OMP) +------------------------------------------------------ +[22-Apr-04] NPB3.1-MPI + +Merged the NPB2.4-MPI branch into NPB3.1 with the following changes. + + - Optimized the BT memory usage. The new version is about 1/3 of + the memory used in NPB2.x. + - Fixed a bug in CG for running on a large number of processes + - Redefined the Class W size in MG so that the verification value + will not be too small. (see below for SER & OMP versions) + - Use the relative errors for verification in both CG and MG + - Fixed a race in 'make suite' + +[08-Apr-04] NPB3.1-SER and NPB3.1-OMP + +The following changes are made in both NPB3.1-SER and NPB3.1-OMP. + +1. Added the Class D problem + - verification values taken from NPB2.4-MPI + - modified variables to fit in large problem + +2. Improvements for LU and LU-HP: + - reduced the memory usage for the 'tv' variable in LU and LU-HP + - a more efficient memory access for variables "a,b,c,d" in LU-HP + - a dummy iteration added before the time step loop for consistency + with other benchmarks + +3. Improvement and fix in MG: + - verification in MG now uses the relative error + (instead of the absolute error). This will avoid incorrect + verification for small reference values. + - redefined the class size for Class W so that the verification + value will not be too small. + In version 3.0 and earlier: 64x64x64, 40 iters + New size in version 3.1 : 128x128x128, 4 iters + - fixed incorrect verification values for Classes A and C. + +4. CG: + - use relative error for verification + - clean up codes for matrix initialization (makea). + The new code uses about 1/2 memory of the previous version. + +5. Fixed makefile related issues + - fixed dependence on make.def for files in common. + - fixed a race in 'make suite' + - added 'LU-HP' as a valid benchmark option in makefiles + +The following changes are made in NPB3.1-OMP. + +1. Included a hyper-plane version of the LU benchmark: LU-HP + - based on the serial version + +2. The dummy 'omp_lib_dum' library is not longer used for compilation + without an OpenMP compiler. Conditional compilation is now used. + +3. Parallelization of the initialization part of MG. + It improves the turn-around time quite a bit for the larger + classes, such as class D. + +4. Parallelize codes for matrix initialization (makea) in CG. + The new code uses about 2/3 memory of the version in NPB3.0-OMP. + +5. Code clean up in SP so that the structure is more consistent + with the serial version. + + + +------------------------------------------------------ +Changes in NPB2.x MPI version +------------------------------------------------------ + +Changes in 2.4.1 +- fixed error in BT/Makefile (replaced "==" with "=") +- added stub function accumulate_norms in BT/btio.f +- changed type of Class B verification constants in BT/verify.f from + single to double precision + +Changes in 2.4 +- Added I/O benchmark (subtype of BT). +- Added Class D for all benchmarks except IS. +- Reduced size of tabulated exponentials in FT. +- Made minor changes to FT to prevent integer overflow for class D on + systems with 32-bit integers. FT class D will not run on small + numbers of processors anymore. + + +------------------------------------------------------ +Changes in non-MPI versions of NPB (previously PBN3.0) + (NPB3.0-SER, NPB3.0-HPF, NPB3.0-OMP, NPB3.0-JAV) +------------------------------------------------------ + +[01-Mar-99] Initial Beta Release. + +[06-Apr-99] Based on report from Charles Grassl and Ramesh Menon (SGI). + + 1. NPB-SER, FT: file auxfnct.f - + lines 74 and 75 were interchanged: + + double complex u0(d1+1,d2,d3), tmp(maxdim) + integer d1,d2,d3 + + 2. NPB-OMP: The OpenMP standards requires reduction variable be scalars, + thus, changes made to remove the use of array variable for reduction. + Relevant modifications in EP, CG, LU, SP, and BT + + 3. NPB-OMP: Remove compiler warnings of "Referenced scalar variables + use defaults" by declaring explicitly as shared. + Relevant modifications in FT, LU, and BT + + 4. NPB-OMP, README.openmp: Explicitly spell out the requirement of + the static scheduling (setenv OMP_SCHEDULE "static"). + + +[05-Oct-99] NPB3.0-non-MPI Beta Release (02) + +General change to all (NPB-SER, NPB-HPF, NPB-OMP) - + 1. Update header information for all benchmarks. + + 2. Allow continuation lines in 'make.def' (modification done + in sys/setparams.c). + +Change made in NPB-OMP - + 1. 'print_results' now prints Number-Of-Threads and Mflops/s/thread. + The printed number is the activated threads during the run, which + may not be the same as what's requested. + + 2. A initial data touch loop for array A is added in CG. + + 3. 'CRITICAL' section is used for reduction with array. + Relevant changes in EP, CG, LU, SP, and BT. + + 4. Reconfigure 'make.def' such that 'omp_lib_dum' can be activated + from the file for no directive compilation. + + 5. The "!$OMP END DO" seems needed before "!$OMP MASTER" in rhs.f + for both BT and SP for some f90 compilers. + + 6. "SCHEDULE(STATIC)" are used for the pipeline in LU to ensure + compliance with the OMP standard. + +Change made in NPB-HPF - + 1. 'print_results' now prints Number-Of-Processes and Mflops/s/process. + + 2. Use more consistent output format (via print_results). + + 3. More consistent makefiles (via config/make.def). + + +[04-Apr-00] NPB3.0-non-MPI Beta Release (03) + +Change made in NPB-OMP - + 1. The OpenMP-C version of IS has been added, including more timers. + + 2. 'cprint_results' includes Number-Of-Threads and Mflops/s/thread. + +Change made in NPB-SER - + 1. More timers included in IS. + +NPB-JAV has been included in NPB3.0-non-MPI. + + +[31-May-01] NPB3.0-non-MPI Beta Release (04) + +Change made in NPB-OMP - + 1. NPB-OMP/LU: Failure in verification for number of threads greater + than the problem size is now fixed. + + 2. If OMP_NUM_THREADS is unset, the printout will report as "unset" + instead of "1" + + 3. NPB-OMP/IS: Allocating work_buff on the stack seems to cause problem + for large problem size (CLASS C). "work_buff" is now allocated + by "malloc" on the heap for CLASS C. + + 4. NPB-OMP/IS: Reported by - potential + synchronization problem could arise due to the use of "static" + variables inside "randlc()". Declaration of these static variables + are moved out of randlc() and put in the threadprivate directive. + +General change to all (NPB-SER, NPB-HPF, NPB-OMP) - + 1. Cleanup in makefiles + + +[28-Aug-02] The Official NPB3.0 Release + +Change made in all - + 1. Fixed a bogus verification for "NaN". + + 2. Name change from "PBN3.0" to "NPB3.0". Updated all the banners. + + 3. NPB-SER/FT: use a derived version from NPB2.3-serial. + + 4. NPB-HPF/FT: use a consistent printing format. diff --git b/NPB3.3-HPF.README a/NPB3.3-HPF.README new file mode 100644 index 0000000..ff1e508 --- /dev/null +++ a/NPB3.3-HPF.README @@ -0,0 +1,4 @@ +The HPF version of NPB is not included in this distribution. +Please download it from NPB3.0 instead. + +http://www.nas.nasa.gov/Software/NPB diff --git b/NPB3.3-JAV.README a/NPB3.3-JAV.README new file mode 100644 index 0000000..b36e686 --- /dev/null +++ a/NPB3.3-JAV.README @@ -0,0 +1,4 @@ +The Java version of NPB is not included in this distribution. +Please download it from NPB3.0 instead. + +http://www.nas.nasa.gov/Software/NPB diff --git b/NPB3.3-MPI/BT/Makefile a/NPB3.3-MPI/BT/Makefile new file mode 100644 index 0000000..dd27503 --- /dev/null +++ a/NPB3.3-MPI/BT/Makefile @@ -0,0 +1,106 @@ +SHELL=/bin/sh +BENCHMARK=bt +BENCHMARKU=BT +VEC= + +include ../config/make.def + + +OBJS = bt.o make_set.o initialize.o exact_solution.o exact_rhs.o \ + set_constants.o adi.o define.o copy_faces.o rhs.o solve_subs.o \ + x_solve$(VEC).o y_solve$(VEC).o z_solve$(VEC).o add.o error.o \ + verify.o setup_mpi.o \ + ${COMMON}/print_results.o ${COMMON}/timers.o + +include ../sys/make.common + +# npbparams.h is included by header.h +# The following rule should do the trick but many make programs (not gmake) +# will do the wrong thing and rebuild the world every time (because the +# mod time on header.h is not changed. One solution would be to +# touch header.h but this might cause confusion if someone has +# accidentally deleted it. Instead, make the dependency on npbparams.h +# explicit in all the lines below (even though dependence is indirect). + +# header.h: npbparams.h + +${PROGRAM}: config + @if [ x$(VERSION) = xvec ] ; then \ + ${MAKE} VEC=_vec exec; \ + elif [ x$(VERSION) = xVEC ] ; then \ + ${MAKE} VEC=_vec exec; \ + else \ + ${MAKE} exec; \ + fi + +exec: $(OBJS) + @if [ x$(SUBTYPE) = xfull ] ; then \ + ${MAKE} bt-full; \ + elif [ x$(SUBTYPE) = xFULL ] ; then \ + ${MAKE} bt-full; \ + elif [ x$(SUBTYPE) = xsimple ] ; then \ + ${MAKE} bt-simple; \ + elif [ x$(SUBTYPE) = xSIMPLE ] ; then \ + ${MAKE} bt-simple; \ + elif [ x$(SUBTYPE) = xfortran ] ; then \ + ${MAKE} bt-fortran; \ + elif [ x$(SUBTYPE) = xFORTRAN ] ; then \ + ${MAKE} bt-fortran; \ + elif [ x$(SUBTYPE) = xepio ] ; then \ + ${MAKE} bt-epio; \ + elif [ x$(SUBTYPE) = xEPIO ] ; then \ + ${MAKE} bt-epio; \ + else \ + ${MAKE} bt-bt; \ + fi + +bt-bt: ${OBJS} btio.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} btio.o ${FMPI_LIB} + +bt-full: ${OBJS} full_mpiio.o btio_common.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.mpi_io_full ${OBJS} btio_common.o full_mpiio.o ${FMPI_LIB} + +bt-simple: ${OBJS} simple_mpiio.o btio_common.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.mpi_io_simple ${OBJS} btio_common.o simple_mpiio.o ${FMPI_LIB} + +bt-fortran: ${OBJS} fortran_io.o btio_common.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.fortran_io ${OBJS} btio_common.o fortran_io.o ${FMPI_LIB} + +bt-epio: ${OBJS} epio.o btio_common.o + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM}.ep_io ${OBJS} btio_common.o epio.o ${FMPI_LIB} + +.f.o: + ${FCOMPILE} $< + +.c.o: + ${CCOMPILE} $< + + +bt.o: bt.f header.h npbparams.h mpinpb.h +make_set.o: make_set.f header.h npbparams.h mpinpb.h +initialize.o: initialize.f header.h npbparams.h +exact_solution.o: exact_solution.f header.h npbparams.h +exact_rhs.o: exact_rhs.f header.h npbparams.h +set_constants.o: set_constants.f header.h npbparams.h +adi.o: adi.f header.h npbparams.h +define.o: define.f header.h npbparams.h +copy_faces.o: copy_faces.f header.h npbparams.h mpinpb.h +rhs.o: rhs.f header.h npbparams.h +x_solve$(VEC).o: x_solve$(VEC).f header.h work_lhs$(VEC).h npbparams.h mpinpb.h +y_solve$(VEC).o: y_solve$(VEC).f header.h work_lhs$(VEC).h npbparams.h mpinpb.h +z_solve$(VEC).o: z_solve$(VEC).f header.h work_lhs$(VEC).h npbparams.h mpinpb.h +solve_subs.o: solve_subs.f npbparams.h +add.o: add.f header.h npbparams.h +error.o: error.f header.h npbparams.h mpinpb.h +verify.o: verify.f header.h npbparams.h mpinpb.h +setup_mpi.o: setup_mpi.f mpinpb.h npbparams.h +btio.o: btio.f header.h npbparams.h +btio_common.o: btio_common.f mpinpb.h npbparams.h +fortran_io.o: fortran_io.f mpinpb.h npbparams.h +simple_mpiio.o: simple_mpiio.f mpinpb.h npbparams.h +full_mpiio.o: full_mpiio.f mpinpb.h npbparams.h +epio.o: epio.f mpinpb.h npbparams.h + +clean: + - rm -f *.o *~ mputil* + - rm -f npbparams.h core diff --git b/NPB3.3-MPI/BT/add.f a/NPB3.3-MPI/BT/add.f new file mode 100644 index 0000000..e14cde4 --- /dev/null +++ a/NPB3.3-MPI/BT/add.f @@ -0,0 +1,30 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine add + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c addition of update to the vector u +c--------------------------------------------------------------------- + + include 'header.h' + + integer c, i, j, k, m + + do c = 1, ncells + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = u(m,i,j,k,c) + rhs(m,i,j,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end diff --git b/NPB3.3-MPI/BT/adi.f a/NPB3.3-MPI/BT/adi.f new file mode 100644 index 0000000..58450c0 --- /dev/null +++ a/NPB3.3-MPI/BT/adi.f @@ -0,0 +1,21 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine adi + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + call copy_faces + + call x_solve + + call y_solve + + call z_solve + + call add + + return + end + diff --git b/NPB3.3-MPI/BT/bt.f a/NPB3.3-MPI/BT/bt.f new file mode 100644 index 0000000..c64d0a3 --- /dev/null +++ a/NPB3.3-MPI/BT/bt.f @@ -0,0 +1,328 @@ +!-------------------------------------------------------------------------! +! ! +! N A S P A R A L L E L B E N C H M A R K S 3.3 ! +! ! +! B T ! +! ! +!-------------------------------------------------------------------------! +! ! +! This benchmark is part of the NAS Parallel Benchmark 3.3 suite. ! +! It is described in NAS Technical Reports 95-020 and 02-007. ! +! ! +! Permission to use, copy, distribute and modify this software ! +! for any purpose with or without fee is hereby granted. We ! +! request, however, that all derived work reference the NAS ! +! Parallel Benchmarks 3.3. This software is provided "as is" ! +! without express or implied warranty. ! +! ! +! Information on NPB 3.3, including the technical report, the ! +! original specifications, source code, results and information ! +! on how to submit new results, is available at: ! +! ! +! http://www.nas.nasa.gov/Software/NPB/ ! +! ! +! Send comments or suggestions to npb@nas.nasa.gov ! +! ! +! NAS Parallel Benchmarks Group ! +! NASA Ames Research Center ! +! Mail Stop: T27A-1 ! +! Moffett Field, CA 94035-1000 ! +! ! +! E-mail: npb@nas.nasa.gov ! +! Fax: (650) 604-3957 ! +! ! +!-------------------------------------------------------------------------! + +c--------------------------------------------------------------------- +c +c Authors: R. F. Van der Wijngaart +c T. Harris +c M. Yarrow +c +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- + program MPBT +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i, niter, step, c, error, fstatus + double precision navg, mflops, mbytes, n3 + + external timer_read + double precision t, tmax, tiominv, tpc, timer_read + logical verified + character class, cbuff*40 + double precision t1(t_last+2), tsum(t_last+2), + > tming(t_last+2), tmaxg(t_last+2) + character t_recs(t_last+2)*8 + + integer wr_interval + + data t_recs/'total', 'i/o', 'rhs', 'xsolve', 'ysolve', 'zsolve', + > 'bpack', 'exch', 'xcomm', 'ycomm', 'zcomm', + > ' totcomp', ' totcomm'/ + + call setup_mpi + if (.not. active) goto 999 + +c--------------------------------------------------------------------- +c Root node reads input file (if it exists) else takes +c defaults from parameters +c--------------------------------------------------------------------- + if (node .eq. root) then + + write(*, 1000) + + open (unit=2,file='timer.flag',status='old',iostat=fstatus) + timeron = .false. + if (fstatus .eq. 0) then + timeron = .true. + close(2) + endif + + open (unit=2,file='inputbt.data',status='old', iostat=fstatus) +c + rd_interval = 0 + if (fstatus .eq. 0) then + write(*,233) + 233 format(' Reading from input file inputbt.data') + read (2,*) niter + read (2,*) dt + read (2,*) grid_points(1), grid_points(2), grid_points(3) + if (iotype .ne. 0) then + read (2,'(A)') cbuff + read (cbuff,*,iostat=i) wr_interval, rd_interval + if (i .ne. 0) rd_interval = 0 + if (wr_interval .le. 0) wr_interval = wr_default + endif + if (iotype .eq. 1) then + read (2,*) collbuf_nodes, collbuf_size + write(*,*) 'collbuf_nodes ', collbuf_nodes + write(*,*) 'collbuf_size ', collbuf_size + endif + close(2) + else + write(*,234) + niter = niter_default + dt = dt_default + grid_points(1) = problem_size + grid_points(2) = problem_size + grid_points(3) = problem_size + wr_interval = wr_default + if (iotype .eq. 1) then +c set number of nodes involved in collective buffering to 4, +c unless total number of nodes is smaller than that. +c set buffer size for collective buffering to 1MB per node +c collbuf_nodes = min(4,no_nodes) +c set default to No-File-Hints with a value of 0 + collbuf_nodes = 0 + collbuf_size = 1000000 + endif + endif + 234 format(' No input file inputbt.data. Using compiled defaults') + + write(*, 1001) grid_points(1), grid_points(2), grid_points(3) + write(*, 1002) niter, dt + if (no_nodes .ne. total_nodes) write(*, 1004) total_nodes + if (no_nodes .ne. maxcells*maxcells) + > write(*, 1005) maxcells*maxcells + write(*, 1003) no_nodes + + if (iotype .eq. 1) write(*, 1006) 'FULL MPI-IO', wr_interval + if (iotype .eq. 2) write(*, 1006) 'SIMPLE MPI-IO', wr_interval + if (iotype .eq. 3) write(*, 1006) 'EPIO', wr_interval + if (iotype .eq. 4) write(*, 1006) 'FORTRAN IO', wr_interval + + 1000 format(//, ' NAS Parallel Benchmarks 3.3 -- BT Benchmark ',/) + 1001 format(' Size: ', i4, 'x', i4, 'x', i4) + 1002 format(' Iterations: ', i4, ' dt: ', F11.7) + 1004 format(' Total number of processes: ', i5) + 1005 format(' WARNING: compiled for ', i5, ' processes ') + 1003 format(' Number of active processes: ', i5, /) + 1006 format(' BTIO -- ', A, ' write interval: ', i3 /) + + endif + + call mpi_bcast(niter, 1, MPI_INTEGER, + > root, comm_setup, error) + + call mpi_bcast(dt, 1, dp_type, + > root, comm_setup, error) + + call mpi_bcast(grid_points(1), 3, MPI_INTEGER, + > root, comm_setup, error) + + call mpi_bcast(wr_interval, 1, MPI_INTEGER, + > root, comm_setup, error) + + call mpi_bcast(rd_interval, 1, MPI_INTEGER, + > root, comm_setup, error) + + call mpi_bcast(timeron, 1, MPI_LOGICAL, + > root, comm_setup, error) + + call make_set + + do c = 1, maxcells + if ( (cell_size(1,c) .gt. IMAX) .or. + > (cell_size(2,c) .gt. JMAX) .or. + > (cell_size(3,c) .gt. KMAX) ) then + print *,node, c, (cell_size(i,c),i=1,3) + print *,' Problem size too big for compiled array sizes' + goto 999 + endif + end do + + do i = 1, t_last + call timer_clear(i) + end do + + call set_constants + + call initialize + + call setup_btio + idump = 0 + + call lhsinit + + call exact_rhs + + call compute_buffer_size(5) + +c--------------------------------------------------------------------- +c do one time step to touch all code, and reinitialize +c--------------------------------------------------------------------- + call adi + call initialize + +c--------------------------------------------------------------------- +c Synchronize before placing time stamp +c--------------------------------------------------------------------- + do i = 1, t_last + call timer_clear(i) + end do + call mpi_barrier(comm_setup, error) + + call timer_start(1) + + do step = 1, niter + + if (node .eq. root) then + if (mod(step, 20) .eq. 0 .or. step .eq. niter .or. + > step .eq. 1) then + write(*, 200) step + 200 format(' Time step ', i4) + endif + endif + + call adi + + if (iotype .ne. 0) then + if (mod(step, wr_interval).eq.0 .or. step .eq. niter) then + if (node .eq. root) then + print *, 'Writing data set, time step', step + endif + if (step .eq. niter .and. rd_interval .gt. 1) then + rd_interval = 1 + endif + call timer_start(2) + call output_timestep + call timer_stop(2) + idump = idump + 1 + endif + endif + end do + + call timer_start(2) + call btio_cleanup + call timer_stop(2) + + call timer_stop(1) + t = timer_read(1) + + call verify(niter, class, verified) + + call mpi_reduce(t, tmax, 1, + > dp_type, MPI_MAX, + > root, comm_setup, error) + + if (iotype .ne. 0) then + t = timer_read(2) + if (t .ne. 0.d0) t = 1.0d0 / t + call mpi_reduce(t, tiominv, 1, + > dp_type, MPI_SUM, + > root, comm_setup, error) + endif + + if( node .eq. root ) then + n3 = 1.0d0*grid_points(1)*grid_points(2)*grid_points(3) + navg = (grid_points(1)+grid_points(2)+grid_points(3))/3.0 + if( tmax .ne. 0. ) then + mflops = 1.0e-6*float(niter)* + > (3478.8*n3-17655.7*navg**2+28023.7*navg) + > / tmax + else + mflops = 0.0 + endif + + if (iotype .ne. 0) then + mbytes = n3 * 40.0 * idump * 1.0d-6 + tiominv = tiominv / no_nodes + t = 0.0 + if (tiominv .ne. 0.) t = 1.d0 / tiominv + tpc = 0.0 + if (tmax .ne. 0.) tpc = t * 100.0 / tmax + write(*,1100) t, tpc, mbytes, mbytes*tiominv + 1100 format(/' BTIO -- statistics:'/ + > ' I/O timing in seconds : ', f14.2/ + > ' I/O timing percentage : ', f14.2/ + > ' Total data written (MB) : ', f14.2/ + > ' I/O data rate (MB/sec) : ', f14.2) + endif + + call print_results('BT', class, grid_points(1), + > grid_points(2), grid_points(3), niter, maxcells*maxcells, + > total_nodes, tmax, mflops, ' floating point', + > verified, npbversion,compiletime, cs1, cs2, cs3, cs4, cs5, + > cs6, '(none)') + endif + + if (.not.timeron) goto 999 + + do i = 1, t_last + t1(i) = timer_read(i) + end do + t1(t_xsolve) = t1(t_xsolve) - t1(t_xcomm) + t1(t_ysolve) = t1(t_ysolve) - t1(t_ycomm) + t1(t_zsolve) = t1(t_zsolve) - t1(t_zcomm) + t1(t_last+2) = t1(t_xcomm)+t1(t_ycomm)+t1(t_zcomm)+t1(t_exch) + t1(t_last+1) = t1(t_total) - t1(t_last+2) + + call MPI_Reduce(t1, tsum, t_last+2, dp_type, MPI_SUM, + > 0, comm_setup, error) + call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN, + > 0, comm_setup, error) + call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX, + > 0, comm_setup, error) + + if (node .eq. 0) then + write(*, 800) total_nodes + do i = 1, t_last+2 + tsum(i) = tsum(i) / total_nodes + write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i) + end do + endif + 800 format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', + > 5x, 'average') + 810 format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4)) + + 999 continue + call mpi_barrier(MPI_COMM_WORLD, error) + call mpi_finalize(error) + + end + diff --git b/NPB3.3-MPI/BT/btio.f a/NPB3.3-MPI/BT/btio.f new file mode 100644 index 0000000..1fb730b --- /dev/null +++ a/NPB3.3-MPI/BT/btio.f @@ -0,0 +1,72 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine setup_btio + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine output_timestep + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine btio_cleanup + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine btio_verify(verified) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + logical verified + + verified = .true. + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine accumulate_norms(xce_acc) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + double precision xce_acc(5) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine checksum_timestep + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + return + end diff --git b/NPB3.3-MPI/BT/btio_common.f a/NPB3.3-MPI/BT/btio_common.f new file mode 100644 index 0000000..9227a12 --- /dev/null +++ a/NPB3.3-MPI/BT/btio_common.f @@ -0,0 +1,30 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine clear_timestep + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer cio, kio, jio, ix + + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + do ix=0,cell_size(1,cio)-1 + u(1,ix, jio,kio,cio) = 0 + u(2,ix, jio,kio,cio) = 0 + u(3,ix, jio,kio,cio) = 0 + u(4,ix, jio,kio,cio) = 0 + u(5,ix, jio,kio,cio) = 0 + enddo + enddo + enddo + enddo + + return + end + diff --git b/NPB3.3-MPI/BT/copy_faces.f a/NPB3.3-MPI/BT/copy_faces.f new file mode 100644 index 0000000..5261d30 --- /dev/null +++ a/NPB3.3-MPI/BT/copy_faces.f @@ -0,0 +1,322 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine copy_faces + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c +c This function copies the face values of a variable defined on a set +c of cells to the overlap locations of the adjacent sets of cells. +c Because a set of cells interfaces in each direction with exactly one +c other set, we only need to fill six different buffers. We could try to +c overlap communication with computation, by computing +c some internal values while communicating boundary values, but this +c adds so much overhead that it's not clearly useful. +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i, j, k, c, m, requests(0:11), p0, p1, + > p2, p3, p4, p5, b_size(0:5), ss(0:5), + > sr(0:5), error, statuses(MPI_STATUS_SIZE, 0:11) + +c--------------------------------------------------------------------- +c exit immediately if there are no faces to be copied +c--------------------------------------------------------------------- + if (no_nodes .eq. 1) then + call compute_rhs + return + endif + + ss(0) = start_send_east + ss(1) = start_send_west + ss(2) = start_send_north + ss(3) = start_send_south + ss(4) = start_send_top + ss(5) = start_send_bottom + + sr(0) = start_recv_east + sr(1) = start_recv_west + sr(2) = start_recv_north + sr(3) = start_recv_south + sr(4) = start_recv_top + sr(5) = start_recv_bottom + + b_size(0) = east_size + b_size(1) = west_size + b_size(2) = north_size + b_size(3) = south_size + b_size(4) = top_size + b_size(5) = bottom_size + +c--------------------------------------------------------------------- +c because the difference stencil for the diagonalized scheme is +c orthogonal, we do not have to perform the staged copying of faces, +c but can send all face information simultaneously to the neighboring +c cells in all directions +c--------------------------------------------------------------------- + if (timeron) call timer_start(t_bpack) + p0 = 0 + p1 = 0 + p2 = 0 + p3 = 0 + p4 = 0 + p5 = 0 + + do c = 1, ncells + +c--------------------------------------------------------------------- +c fill the buffer to be sent to eastern neighbors (i-dir) +c--------------------------------------------------------------------- + if (cell_coord(1,c) .ne. ncells) then + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = cell_size(1,c)-2, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(0)+p0) = u(m,i,j,k,c) + p0 = p0 + 1 + end do + end do + end do + end do + endif + +c--------------------------------------------------------------------- +c fill the buffer to be sent to western neighbors +c--------------------------------------------------------------------- + if (cell_coord(1,c) .ne. 1) then + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, 1 + do m = 1, 5 + out_buffer(ss(1)+p1) = u(m,i,j,k,c) + p1 = p1 + 1 + end do + end do + end do + end do + + endif + +c--------------------------------------------------------------------- +c fill the buffer to be sent to northern neighbors (j_dir) +c--------------------------------------------------------------------- + if (cell_coord(2,c) .ne. ncells) then + do k = 0, cell_size(3,c)-1 + do j = cell_size(2,c)-2, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(2)+p2) = u(m,i,j,k,c) + p2 = p2 + 1 + end do + end do + end do + end do + endif + +c--------------------------------------------------------------------- +c fill the buffer to be sent to southern neighbors +c--------------------------------------------------------------------- + if (cell_coord(2,c).ne. 1) then + do k = 0, cell_size(3,c)-1 + do j = 0, 1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(3)+p3) = u(m,i,j,k,c) + p3 = p3 + 1 + end do + end do + end do + end do + endif + +c--------------------------------------------------------------------- +c fill the buffer to be sent to top neighbors (k-dir) +c--------------------------------------------------------------------- + if (cell_coord(3,c) .ne. ncells) then + do k = cell_size(3,c)-2, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(4)+p4) = u(m,i,j,k,c) + p4 = p4 + 1 + end do + end do + end do + end do + endif + +c--------------------------------------------------------------------- +c fill the buffer to be sent to bottom neighbors +c--------------------------------------------------------------------- + if (cell_coord(3,c).ne. 1) then + do k=0, 1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + out_buffer(ss(5)+p5) = u(m,i,j,k,c) + p5 = p5 + 1 + end do + end do + end do + end do + endif + +c--------------------------------------------------------------------- +c cell loop +c--------------------------------------------------------------------- + end do + if (timeron) call timer_stop(t_bpack) + + if (timeron) call timer_start(t_exch) + call mpi_irecv(in_buffer(sr(0)), b_size(0), + > dp_type, successor(1), WEST, + > comm_rhs, requests(0), error) + call mpi_irecv(in_buffer(sr(1)), b_size(1), + > dp_type, predecessor(1), EAST, + > comm_rhs, requests(1), error) + call mpi_irecv(in_buffer(sr(2)), b_size(2), + > dp_type, successor(2), SOUTH, + > comm_rhs, requests(2), error) + call mpi_irecv(in_buffer(sr(3)), b_size(3), + > dp_type, predecessor(2), NORTH, + > comm_rhs, requests(3), error) + call mpi_irecv(in_buffer(sr(4)), b_size(4), + > dp_type, successor(3), BOTTOM, + > comm_rhs, requests(4), error) + call mpi_irecv(in_buffer(sr(5)), b_size(5), + > dp_type, predecessor(3), TOP, + > comm_rhs, requests(5), error) + + call mpi_isend(out_buffer(ss(0)), b_size(0), + > dp_type, successor(1), EAST, + > comm_rhs, requests(6), error) + call mpi_isend(out_buffer(ss(1)), b_size(1), + > dp_type, predecessor(1), WEST, + > comm_rhs, requests(7), error) + call mpi_isend(out_buffer(ss(2)), b_size(2), + > dp_type,successor(2), NORTH, + > comm_rhs, requests(8), error) + call mpi_isend(out_buffer(ss(3)), b_size(3), + > dp_type,predecessor(2), SOUTH, + > comm_rhs, requests(9), error) + call mpi_isend(out_buffer(ss(4)), b_size(4), + > dp_type,successor(3), TOP, + > comm_rhs, requests(10), error) + call mpi_isend(out_buffer(ss(5)), b_size(5), + > dp_type,predecessor(3), BOTTOM, + > comm_rhs,requests(11), error) + + + call mpi_waitall(12, requests, statuses, error) + if (timeron) call timer_stop(t_exch) + +c--------------------------------------------------------------------- +c unpack the data that has just been received; +c--------------------------------------------------------------------- + if (timeron) call timer_start(t_bpack) + p0 = 0 + p1 = 0 + p2 = 0 + p3 = 0 + p4 = 0 + p5 = 0 + + do c = 1, ncells + + if (cell_coord(1,c) .ne. 1) then + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = -2, -1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(1)+p0) + p0 = p0 + 1 + end do + end do + end do + end do + endif + + if (cell_coord(1,c) .ne. ncells) then + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = cell_size(1,c), cell_size(1,c)+1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(0)+p1) + p1 = p1 + 1 + end do + end do + end do + end do + end if + + if (cell_coord(2,c) .ne. 1) then + do k = 0, cell_size(3,c)-1 + do j = -2, -1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(3)+p2) + p2 = p2 + 1 + end do + end do + end do + end do + + endif + + if (cell_coord(2,c) .ne. ncells) then + do k = 0, cell_size(3,c)-1 + do j = cell_size(2,c), cell_size(2,c)+1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(2)+p3) + p3 = p3 + 1 + end do + end do + end do + end do + endif + + if (cell_coord(3,c) .ne. 1) then + do k = -2, -1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(5)+p4) + p4 = p4 + 1 + end do + end do + end do + end do + endif + + if (cell_coord(3,c) .ne. ncells) then + do k = cell_size(3,c), cell_size(3,c)+1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + u(m,i,j,k,c) = in_buffer(sr(4)+p5) + p5 = p5 + 1 + end do + end do + end do + end do + endif + +c--------------------------------------------------------------------- +c cells loop +c--------------------------------------------------------------------- + end do + if (timeron) call timer_stop(t_bpack) + +c--------------------------------------------------------------------- +c do the rest of the rhs that uses the copied face values +c--------------------------------------------------------------------- + call compute_rhs + + return + end diff --git b/NPB3.3-MPI/BT/define.f a/NPB3.3-MPI/BT/define.f new file mode 100644 index 0000000..03c4c6e --- /dev/null +++ a/NPB3.3-MPI/BT/define.f @@ -0,0 +1,64 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine compute_buffer_size(dim) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + + integer c, dim, face_size + + if (ncells .eq. 1) return + +c--------------------------------------------------------------------- +c compute the actual sizes of the buffers; note that there is +c always one cell face that doesn't need buffer space, because it +c is at the boundary of the grid +c--------------------------------------------------------------------- + west_size = 0 + east_size = 0 + + do c = 1, ncells + face_size = cell_size(2,c) * cell_size(3,c) * dim * 2 + if (cell_coord(1,c).ne.1) west_size = west_size + face_size + if (cell_coord(1,c).ne.ncells) east_size = east_size + + > face_size + end do + + north_size = 0 + south_size = 0 + do c = 1, ncells + face_size = cell_size(1,c)*cell_size(3,c) * dim * 2 + if (cell_coord(2,c).ne.1) south_size = south_size + face_size + if (cell_coord(2,c).ne.ncells) north_size = north_size + + > face_size + end do + + top_size = 0 + bottom_size = 0 + do c = 1, ncells + face_size = cell_size(1,c) * cell_size(2,c) * dim * 2 + if (cell_coord(3,c).ne.1) bottom_size = bottom_size + + > face_size + if (cell_coord(3,c).ne.ncells) top_size = top_size + + > face_size + end do + + start_send_west = 1 + start_send_east = start_send_west + west_size + start_send_south = start_send_east + east_size + start_send_north = start_send_south + south_size + start_send_bottom = start_send_north + north_size + start_send_top = start_send_bottom + bottom_size + start_recv_west = 1 + start_recv_east = start_recv_west + west_size + start_recv_south = start_recv_east + east_size + start_recv_north = start_recv_south + south_size + start_recv_bottom = start_recv_north + north_size + start_recv_top = start_recv_bottom + bottom_size + + return + end + diff --git b/NPB3.3-MPI/BT/epio.f a/NPB3.3-MPI/BT/epio.f new file mode 100644 index 0000000..52b6309 --- /dev/null +++ a/NPB3.3-MPI/BT/epio.f @@ -0,0 +1,165 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine setup_btio + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + character*(128) newfilenm + integer m + + if (node .lt. 10000) then + write (newfilenm, 996) filenm,node + else + print *, 'error generating file names (> 10000 nodes)' + stop + endif + +996 format (a,'.',i4.4) + + open (unit=99, file=newfilenm, form='unformatted', + $ status='unknown') + + do m = 1, 5 + xce_sub(m) = 0.d0 + end do + + idump_sub = 0 + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine output_timestep + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + include 'header.h' + include 'mpinpb.h' + + integer ix, iio, jio, kio, cio, aio + + do cio=1,ncells + write(99) + $ ((((u(aio,ix, jio,kio,cio),aio=1,5), + $ ix=0, cell_size(1,cio)-1), + $ jio=0, cell_size(2,cio)-1), + $ kio=0, cell_size(3,cio)-1) + enddo + + idump_sub = idump_sub + 1 + if (rd_interval .gt. 0) then + if (idump_sub .ge. rd_interval) then + + rewind(99) + call acc_sub_norms(idump+1) + + rewind(99) + idump_sub = 0 + endif + endif + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine acc_sub_norms(idump_cur) + + include 'header.h' + include 'mpinpb.h' + + integer idump_cur + + integer ix, jio, kio, cio, ii, m, ichunk + double precision xce_single(5) + + ichunk = idump_cur - idump_sub + 1 + do ii=0, idump_sub-1 + do cio=1,ncells + read(99) + $ ((((u(m,ix, jio,kio,cio),m=1,5), + $ ix=0, cell_size(1,cio)-1), + $ jio=0, cell_size(2,cio)-1), + $ kio=0, cell_size(3,cio)-1) + enddo + + if (node .eq. root) print *, 'Reading data set ', ii+ichunk + + call error_norm(xce_single) + do m = 1, 5 + xce_sub(m) = xce_sub(m) + xce_single(m) + end do + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine btio_cleanup + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + close(unit=99) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine accumulate_norms(xce_acc) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + double precision xce_acc(5) + + character*(128) newfilenm + integer m + + if (rd_interval .gt. 0) goto 20 + + if (node .lt. 10000) then + write (newfilenm, 996) filenm,node + else + print *, 'error generating file names (> 10000 nodes)' + stop + endif + +996 format (a,'.',i4.4) + + open (unit=99, file=newfilenm, + $ form='unformatted') + +c clear the last time step + + call clear_timestep + +c read back the time steps and accumulate norms + + call acc_sub_norms(idump) + + close(unit=99) + + 20 continue + do m = 1, 5 + xce_acc(m) = xce_sub(m) / dble(idump) + end do + + return + end diff --git b/NPB3.3-MPI/BT/error.f a/NPB3.3-MPI/BT/error.f new file mode 100644 index 0000000..147a582 --- /dev/null +++ a/NPB3.3-MPI/BT/error.f @@ -0,0 +1,106 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine error_norm(rms) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c this function computes the norm of the difference between the +c computed solution and the exact solution +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer c, i, j, k, m, ii, jj, kk, d, error + double precision xi, eta, zeta, u_exact(5), rms(5), rms_work(5), + > add + + do m = 1, 5 + rms_work(m) = 0.0d0 + enddo + + do c = 1, ncells + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + ii = 0 + do i = cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + call exact_solution(xi, eta, zeta, u_exact) + + do m = 1, 5 + add = u(m,ii,jj,kk,c)-u_exact(m) + rms_work(m) = rms_work(m) + add*add + enddo + ii = ii + 1 + enddo + jj = jj + 1 + enddo + kk = kk + 1 + enddo + enddo + + call mpi_allreduce(rms_work, rms, 5, dp_type, + > MPI_SUM, comm_setup, error) + + do m = 1, 5 + do d = 1, 3 + rms(m) = rms(m) / dble(grid_points(d)-2) + enddo + rms(m) = dsqrt(rms(m)) + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine rhs_norm(rms) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer c, i, j, k, d, m, error + double precision rms(5), rms_work(5), add + + do m = 1, 5 + rms_work(m) = 0.0d0 + enddo + + do c = 1, ncells + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + add = rhs(m,i,j,k,c) + rms_work(m) = rms_work(m) + add*add + enddo + enddo + enddo + enddo + enddo + + call mpi_allreduce(rms_work, rms, 5, dp_type, + > MPI_SUM, comm_setup, error) + + do m = 1, 5 + do d = 1, 3 + rms(m) = rms(m) / dble(grid_points(d)-2) + enddo + rms(m) = dsqrt(rms(m)) + enddo + + return + end + diff --git b/NPB3.3-MPI/BT/exact_rhs.f a/NPB3.3-MPI/BT/exact_rhs.f new file mode 100644 index 0000000..26a2871 --- /dev/null +++ a/NPB3.3-MPI/BT/exact_rhs.f @@ -0,0 +1,360 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine exact_rhs + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c compute the right hand side based on exact solution +c--------------------------------------------------------------------- + + include 'header.h' + + double precision dtemp(5), xi, eta, zeta, dtpp + integer c, m, i, j, k, ip1, im1, jp1, + > jm1, km1, kp1 + + +c--------------------------------------------------------------------- +c loop over all cells owned by this node +c--------------------------------------------------------------------- + do c = 1, ncells + +c--------------------------------------------------------------------- +c initialize +c--------------------------------------------------------------------- + do k= 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = 0.0d0 + enddo + enddo + enddo + enddo + +c--------------------------------------------------------------------- +c xi-direction flux differences +c--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + zeta = dble(k+cell_low(3,c)) * dnzm1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + eta = dble(j+cell_low(2,c)) * dnym1 + + do i=-2*(1-start(1,c)), cell_size(1,c)+1-2*end(1,c) + xi = dble(i+cell_low(1,c)) * dnxm1 + + call exact_solution(xi, eta, zeta, dtemp) + do m = 1, 5 + ue(i,m) = dtemp(m) + enddo + + dtpp = 1.0d0 / dtemp(1) + + do m = 2, 5 + buf(i,m) = dtpp * dtemp(m) + enddo + + cuf(i) = buf(i,2) * buf(i,2) + buf(i,1) = cuf(i) + buf(i,3) * buf(i,3) + + > buf(i,4) * buf(i,4) + q(i) = 0.5d0*(buf(i,2)*ue(i,2) + buf(i,3)*ue(i,3) + + > buf(i,4)*ue(i,4)) + + enddo + + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + im1 = i-1 + ip1 = i+1 + + forcing(1,i,j,k,c) = forcing(1,i,j,k,c) - + > tx2*( ue(ip1,2)-ue(im1,2) )+ + > dx1tx1*(ue(ip1,1)-2.0d0*ue(i,1)+ue(im1,1)) + + forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - tx2 * ( + > (ue(ip1,2)*buf(ip1,2)+c2*(ue(ip1,5)-q(ip1)))- + > (ue(im1,2)*buf(im1,2)+c2*(ue(im1,5)-q(im1))))+ + > xxcon1*(buf(ip1,2)-2.0d0*buf(i,2)+buf(im1,2))+ + > dx2tx1*( ue(ip1,2)-2.0d0* ue(i,2)+ue(im1,2)) + + forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - tx2 * ( + > ue(ip1,3)*buf(ip1,2)-ue(im1,3)*buf(im1,2))+ + > xxcon2*(buf(ip1,3)-2.0d0*buf(i,3)+buf(im1,3))+ + > dx3tx1*( ue(ip1,3)-2.0d0*ue(i,3) +ue(im1,3)) + + forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - tx2*( + > ue(ip1,4)*buf(ip1,2)-ue(im1,4)*buf(im1,2))+ + > xxcon2*(buf(ip1,4)-2.0d0*buf(i,4)+buf(im1,4))+ + > dx4tx1*( ue(ip1,4)-2.0d0* ue(i,4)+ ue(im1,4)) + + forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - tx2*( + > buf(ip1,2)*(c1*ue(ip1,5)-c2*q(ip1))- + > buf(im1,2)*(c1*ue(im1,5)-c2*q(im1)))+ + > 0.5d0*xxcon3*(buf(ip1,1)-2.0d0*buf(i,1)+ + > buf(im1,1))+ + > xxcon4*(cuf(ip1)-2.0d0*cuf(i)+cuf(im1))+ + > xxcon5*(buf(ip1,5)-2.0d0*buf(i,5)+buf(im1,5))+ + > dx5tx1*( ue(ip1,5)-2.0d0* ue(i,5)+ ue(im1,5)) + enddo + +c--------------------------------------------------------------------- +c Fourth-order dissipation +c--------------------------------------------------------------------- + if (start(1,c) .gt. 0) then + do m = 1, 5 + i = 1 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (5.0d0*ue(i,m) - 4.0d0*ue(i+1,m) +ue(i+2,m)) + i = 2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (-4.0d0*ue(i-1,m) + 6.0d0*ue(i,m) - + > 4.0d0*ue(i+1,m) + ue(i+2,m)) + enddo + endif + + do i = start(1,c)*3, cell_size(1,c)-3*end(1,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp* + > (ue(i-2,m) - 4.0d0*ue(i-1,m) + + > 6.0d0*ue(i,m) - 4.0d0*ue(i+1,m) + ue(i+2,m)) + enddo + enddo + + if (end(1,c) .gt. 0) then + do m = 1, 5 + i = cell_size(1,c)-3 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (ue(i-2,m) - 4.0d0*ue(i-1,m) + + > 6.0d0*ue(i,m) - 4.0d0*ue(i+1,m)) + i = cell_size(1,c)-2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (ue(i-2,m) - 4.0d0*ue(i-1,m) + 5.0d0*ue(i,m)) + enddo + endif + + enddo + enddo + +c--------------------------------------------------------------------- +c eta-direction flux differences +c--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + zeta = dble(k+cell_low(3,c)) * dnzm1 + do i=start(1,c), cell_size(1,c)-end(1,c)-1 + xi = dble(i+cell_low(1,c)) * dnxm1 + + do j=-2*(1-start(2,c)), cell_size(2,c)+1-2*end(2,c) + eta = dble(j+cell_low(2,c)) * dnym1 + + call exact_solution(xi, eta, zeta, dtemp) + do m = 1, 5 + ue(j,m) = dtemp(m) + enddo + + dtpp = 1.0d0/dtemp(1) + + do m = 2, 5 + buf(j,m) = dtpp * dtemp(m) + enddo + + cuf(j) = buf(j,3) * buf(j,3) + buf(j,1) = cuf(j) + buf(j,2) * buf(j,2) + + > buf(j,4) * buf(j,4) + q(j) = 0.5d0*(buf(j,2)*ue(j,2) + buf(j,3)*ue(j,3) + + > buf(j,4)*ue(j,4)) + enddo + + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + jm1 = j-1 + jp1 = j+1 + + forcing(1,i,j,k,c) = forcing(1,i,j,k,c) - + > ty2*( ue(jp1,3)-ue(jm1,3) )+ + > dy1ty1*(ue(jp1,1)-2.0d0*ue(j,1)+ue(jm1,1)) + + forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - ty2*( + > ue(jp1,2)*buf(jp1,3)-ue(jm1,2)*buf(jm1,3))+ + > yycon2*(buf(jp1,2)-2.0d0*buf(j,2)+buf(jm1,2))+ + > dy2ty1*( ue(jp1,2)-2.0* ue(j,2)+ ue(jm1,2)) + + forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - ty2*( + > (ue(jp1,3)*buf(jp1,3)+c2*(ue(jp1,5)-q(jp1)))- + > (ue(jm1,3)*buf(jm1,3)+c2*(ue(jm1,5)-q(jm1))))+ + > yycon1*(buf(jp1,3)-2.0d0*buf(j,3)+buf(jm1,3))+ + > dy3ty1*( ue(jp1,3)-2.0d0*ue(j,3) +ue(jm1,3)) + + forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - ty2*( + > ue(jp1,4)*buf(jp1,3)-ue(jm1,4)*buf(jm1,3))+ + > yycon2*(buf(jp1,4)-2.0d0*buf(j,4)+buf(jm1,4))+ + > dy4ty1*( ue(jp1,4)-2.0d0*ue(j,4)+ ue(jm1,4)) + + forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - ty2*( + > buf(jp1,3)*(c1*ue(jp1,5)-c2*q(jp1))- + > buf(jm1,3)*(c1*ue(jm1,5)-c2*q(jm1)))+ + > 0.5d0*yycon3*(buf(jp1,1)-2.0d0*buf(j,1)+ + > buf(jm1,1))+ + > yycon4*(cuf(jp1)-2.0d0*cuf(j)+cuf(jm1))+ + > yycon5*(buf(jp1,5)-2.0d0*buf(j,5)+buf(jm1,5))+ + > dy5ty1*(ue(jp1,5)-2.0d0*ue(j,5)+ue(jm1,5)) + enddo + +c--------------------------------------------------------------------- +c Fourth-order dissipation +c--------------------------------------------------------------------- + if (start(2,c) .gt. 0) then + do m = 1, 5 + j = 1 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (5.0d0*ue(j,m) - 4.0d0*ue(j+1,m) +ue(j+2,m)) + j = 2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (-4.0d0*ue(j-1,m) + 6.0d0*ue(j,m) - + > 4.0d0*ue(j+1,m) + ue(j+2,m)) + enddo + endif + + do j = start(2,c)*3, cell_size(2,c)-3*end(2,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp* + > (ue(j-2,m) - 4.0d0*ue(j-1,m) + + > 6.0d0*ue(j,m) - 4.0d0*ue(j+1,m) + ue(j+2,m)) + enddo + enddo + + if (end(2,c) .gt. 0) then + do m = 1, 5 + j = cell_size(2,c)-3 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (ue(j-2,m) - 4.0d0*ue(j-1,m) + + > 6.0d0*ue(j,m) - 4.0d0*ue(j+1,m)) + j = cell_size(2,c)-2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (ue(j-2,m) - 4.0d0*ue(j-1,m) + 5.0d0*ue(j,m)) + + enddo + endif + + enddo + enddo + +c--------------------------------------------------------------------- +c zeta-direction flux differences +c--------------------------------------------------------------------- + do j=start(2,c), cell_size(2,c)-end(2,c)-1 + eta = dble(j+cell_low(2,c)) * dnym1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + xi = dble(i+cell_low(1,c)) * dnxm1 + + do k=-2*(1-start(3,c)), cell_size(3,c)+1-2*end(3,c) + zeta = dble(k+cell_low(3,c)) * dnzm1 + + call exact_solution(xi, eta, zeta, dtemp) + do m = 1, 5 + ue(k,m) = dtemp(m) + enddo + + dtpp = 1.0d0/dtemp(1) + + do m = 2, 5 + buf(k,m) = dtpp * dtemp(m) + enddo + + cuf(k) = buf(k,4) * buf(k,4) + buf(k,1) = cuf(k) + buf(k,2) * buf(k,2) + + > buf(k,3) * buf(k,3) + q(k) = 0.5d0*(buf(k,2)*ue(k,2) + buf(k,3)*ue(k,3) + + > buf(k,4)*ue(k,4)) + enddo + + do k=start(3,c), cell_size(3,c)-end(3,c)-1 + km1 = k-1 + kp1 = k+1 + + forcing(1,i,j,k,c) = forcing(1,i,j,k,c) - + > tz2*( ue(kp1,4)-ue(km1,4) )+ + > dz1tz1*(ue(kp1,1)-2.0d0*ue(k,1)+ue(km1,1)) + + forcing(2,i,j,k,c) = forcing(2,i,j,k,c) - tz2 * ( + > ue(kp1,2)*buf(kp1,4)-ue(km1,2)*buf(km1,4))+ + > zzcon2*(buf(kp1,2)-2.0d0*buf(k,2)+buf(km1,2))+ + > dz2tz1*( ue(kp1,2)-2.0d0* ue(k,2)+ ue(km1,2)) + + forcing(3,i,j,k,c) = forcing(3,i,j,k,c) - tz2 * ( + > ue(kp1,3)*buf(kp1,4)-ue(km1,3)*buf(km1,4))+ + > zzcon2*(buf(kp1,3)-2.0d0*buf(k,3)+buf(km1,3))+ + > dz3tz1*(ue(kp1,3)-2.0d0*ue(k,3)+ue(km1,3)) + + forcing(4,i,j,k,c) = forcing(4,i,j,k,c) - tz2 * ( + > (ue(kp1,4)*buf(kp1,4)+c2*(ue(kp1,5)-q(kp1)))- + > (ue(km1,4)*buf(km1,4)+c2*(ue(km1,5)-q(km1))))+ + > zzcon1*(buf(kp1,4)-2.0d0*buf(k,4)+buf(km1,4))+ + > dz4tz1*( ue(kp1,4)-2.0d0*ue(k,4) +ue(km1,4)) + + forcing(5,i,j,k,c) = forcing(5,i,j,k,c) - tz2 * ( + > buf(kp1,4)*(c1*ue(kp1,5)-c2*q(kp1))- + > buf(km1,4)*(c1*ue(km1,5)-c2*q(km1)))+ + > 0.5d0*zzcon3*(buf(kp1,1)-2.0d0*buf(k,1) + > +buf(km1,1))+ + > zzcon4*(cuf(kp1)-2.0d0*cuf(k)+cuf(km1))+ + > zzcon5*(buf(kp1,5)-2.0d0*buf(k,5)+buf(km1,5))+ + > dz5tz1*( ue(kp1,5)-2.0d0*ue(k,5)+ ue(km1,5)) + enddo + +c--------------------------------------------------------------------- +c Fourth-order dissipation +c--------------------------------------------------------------------- + if (start(3,c) .gt. 0) then + do m = 1, 5 + k = 1 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (5.0d0*ue(k,m) - 4.0d0*ue(k+1,m) +ue(k+2,m)) + k = 2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (-4.0d0*ue(k-1,m) + 6.0d0*ue(k,m) - + > 4.0d0*ue(k+1,m) + ue(k+2,m)) + enddo + endif + + do k = start(3,c)*3, cell_size(3,c)-3*end(3,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp* + > (ue(k-2,m) - 4.0d0*ue(k-1,m) + + > 6.0d0*ue(k,m) - 4.0d0*ue(k+1,m) + ue(k+2,m)) + enddo + enddo + + if (end(3,c) .gt. 0) then + do m = 1, 5 + k = cell_size(3,c)-3 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (ue(k-2,m) - 4.0d0*ue(k-1,m) + + > 6.0d0*ue(k,m) - 4.0d0*ue(k+1,m)) + k = cell_size(3,c)-2 + forcing(m,i,j,k,c) = forcing(m,i,j,k,c) - dssp * + > (ue(k-2,m) - 4.0d0*ue(k-1,m) + 5.0d0*ue(k,m)) + enddo + endif + + enddo + enddo + +c--------------------------------------------------------------------- +c now change the sign of the forcing function, +c--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + forcing(m,i,j,k,c) = -1.d0 * forcing(m,i,j,k,c) + enddo + enddo + enddo + enddo + + enddo + + return + end diff --git b/NPB3.3-MPI/BT/exact_solution.f a/NPB3.3-MPI/BT/exact_solution.f new file mode 100644 index 0000000..b093b46 --- /dev/null +++ a/NPB3.3-MPI/BT/exact_solution.f @@ -0,0 +1,29 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine exact_solution(xi,eta,zeta,dtemp) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c this function returns the exact solution at point xi, eta, zeta +c--------------------------------------------------------------------- + + include 'header.h' + + double precision xi, eta, zeta, dtemp(5) + integer m + + do m = 1, 5 + dtemp(m) = ce(m,1) + + > xi*(ce(m,2) + xi*(ce(m,5) + xi*(ce(m,8) + xi*ce(m,11)))) + + > eta*(ce(m,3) + eta*(ce(m,6) + eta*(ce(m,9) + eta*ce(m,12))))+ + > zeta*(ce(m,4) + zeta*(ce(m,7) + zeta*(ce(m,10) + + > zeta*ce(m,13)))) + enddo + + return + end + + diff --git b/NPB3.3-MPI/BT/fortran_io.f a/NPB3.3-MPI/BT/fortran_io.f new file mode 100644 index 0000000..d3085a0 --- /dev/null +++ a/NPB3.3-MPI/BT/fortran_io.f @@ -0,0 +1,174 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine setup_btio + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + character*(128) newfilenm + integer m, ierr + + if (node.eq.root) record_length = 40/fortran_rec_sz + call mpi_bcast(record_length, 1, MPI_INTEGER, + > root, comm_setup, ierr) + + open (unit=99, file=filenm, + $ form='unformatted', access='direct', + $ recl=record_length) + + do m = 1, 5 + xce_sub(m) = 0.d0 + end do + + idump_sub = 0 + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine output_timestep + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + include 'header.h' + include 'mpinpb.h' + + integer ix, jio, kio, cio + + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + iseek=(cell_low(1,cio) + + $ PROBLEM_SIZE*((cell_low(2,cio)+jio) + + $ PROBLEM_SIZE*((cell_low(3,cio)+kio) + + $ PROBLEM_SIZE*idump_sub))) + + do ix=0,cell_size(1,cio)-1 + write(99, rec=iseek+ix+1) + $ u(1,ix, jio,kio,cio), + $ u(2,ix, jio,kio,cio), + $ u(3,ix, jio,kio,cio), + $ u(4,ix, jio,kio,cio), + $ u(5,ix, jio,kio,cio) + enddo + enddo + enddo + enddo + + idump_sub = idump_sub + 1 + if (rd_interval .gt. 0) then + if (idump_sub .ge. rd_interval) then + + call acc_sub_norms(idump+1) + + idump_sub = 0 + endif + endif + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine acc_sub_norms(idump_cur) + + include 'header.h' + include 'mpinpb.h' + + integer idump_cur + + integer ix, jio, kio, cio, ii, m, ichunk + double precision xce_single(5) + + ichunk = idump_cur - idump_sub + 1 + do ii=0, idump_sub-1 + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + iseek=(cell_low(1,cio) + + $ PROBLEM_SIZE*((cell_low(2,cio)+jio) + + $ PROBLEM_SIZE*((cell_low(3,cio)+kio) + + $ PROBLEM_SIZE*ii))) + + + do ix=0,cell_size(1,cio)-1 + read(99, rec=iseek+ix+1) + $ u(1,ix, jio,kio,cio), + $ u(2,ix, jio,kio,cio), + $ u(3,ix, jio,kio,cio), + $ u(4,ix, jio,kio,cio), + $ u(5,ix, jio,kio,cio) + enddo + enddo + enddo + enddo + + if (node .eq. root) print *, 'Reading data set ', ii+ichunk + + call error_norm(xce_single) + do m = 1, 5 + xce_sub(m) = xce_sub(m) + xce_single(m) + end do + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine btio_cleanup + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + close(unit=99) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine accumulate_norms(xce_acc) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + include 'header.h' + include 'mpinpb.h' + + double precision xce_acc(5) + integer m + + if (rd_interval .gt. 0) goto 20 + + open (unit=99, file=filenm, + $ form='unformatted', access='direct', + $ recl=record_length) + +c clear the last time step + + call clear_timestep + +c read back the time steps and accumulate norms + + call acc_sub_norms(idump) + + close(unit=99) + + 20 continue + do m = 1, 5 + xce_acc(m) = xce_sub(m) / dble(idump) + end do + + return + end diff --git b/NPB3.3-MPI/BT/full_mpiio.f a/NPB3.3-MPI/BT/full_mpiio.f new file mode 100644 index 0000000..ecfd41c --- /dev/null +++ a/NPB3.3-MPI/BT/full_mpiio.f @@ -0,0 +1,307 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine setup_btio + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer ierr + integer mstatus(MPI_STATUS_SIZE) + integer sizes(4), starts(4), subsizes(4) + integer cell_btype(maxcells), cell_ftype(maxcells) + integer cell_blength(maxcells) + integer info + character*20 cb_nodes, cb_size + integer c, m + integer cell_disp(maxcells) + + call mpi_bcast(collbuf_nodes, 1, MPI_INTEGER, + > root, comm_setup, ierr) + + call mpi_bcast(collbuf_size, 1, MPI_INTEGER, + > root, comm_setup, ierr) + + if (collbuf_nodes .eq. 0) then + info = MPI_INFO_NULL + else + write (cb_nodes,*) collbuf_nodes + write (cb_size,*) collbuf_size + call MPI_Info_create(info, ierr) + call MPI_Info_set(info, 'cb_nodes', cb_nodes, ierr) + call MPI_Info_set(info, 'cb_buffer_size', cb_size, ierr) + call MPI_Info_set(info, 'collective_buffering', 'true', ierr) + endif + + call MPI_Type_contiguous(5, MPI_DOUBLE_PRECISION, + $ element, ierr) + call MPI_Type_commit(element, ierr) + call MPI_Type_extent(element, eltext, ierr) + + do c = 1, ncells +c +c Outer array dimensions ar same for every cell +c + sizes(1) = IMAX+4 + sizes(2) = JMAX+4 + sizes(3) = KMAX+4 +c +c 4th dimension is cell number, total of maxcells cells +c + sizes(4) = maxcells +c +c Internal dimensions of cells can differ slightly between cells +c + subsizes(1) = cell_size(1, c) + subsizes(2) = cell_size(2, c) + subsizes(3) = cell_size(3, c) +c +c Cell is 4th dimension, 1 cell per cell type to handle varying +c cell sub-array sizes +c + subsizes(4) = 1 + +c +c type constructors use 0-based start addresses +c + starts(1) = 2 + starts(2) = 2 + starts(3) = 2 + starts(4) = c-1 + +c +c Create buftype for a cell +c + call MPI_Type_create_subarray(4, sizes, subsizes, + $ starts, MPI_ORDER_FORTRAN, element, + $ cell_btype(c), ierr) +c +c block length and displacement for joining cells - +c 1 cell buftype per block, cell buftypes have own displacment +c generated from cell number (4th array dimension) +c + cell_blength(c) = 1 + cell_disp(c) = 0 + + enddo +c +c Create combined buftype for all cells +c + call MPI_Type_struct(ncells, cell_blength, cell_disp, + $ cell_btype, combined_btype, ierr) + call MPI_Type_commit(combined_btype, ierr) + + do c = 1, ncells +c +c Entire array size +c + sizes(1) = PROBLEM_SIZE + sizes(2) = PROBLEM_SIZE + sizes(3) = PROBLEM_SIZE + +c +c Size of c'th cell +c + subsizes(1) = cell_size(1, c) + subsizes(2) = cell_size(2, c) + subsizes(3) = cell_size(3, c) + +c +c Starting point in full array of c'th cell +c + starts(1) = cell_low(1,c) + starts(2) = cell_low(2,c) + starts(3) = cell_low(3,c) + + call MPI_Type_create_subarray(3, sizes, subsizes, + $ starts, MPI_ORDER_FORTRAN, + $ element, cell_ftype(c), ierr) + cell_blength(c) = 1 + cell_disp(c) = 0 + enddo + + call MPI_Type_struct(ncells, cell_blength, cell_disp, + $ cell_ftype, combined_ftype, ierr) + call MPI_Type_commit(combined_ftype, ierr) + + iseek=0 + if (node .eq. root) then + call MPI_File_delete(filenm, MPI_INFO_NULL, ierr) + endif + + + call MPI_Barrier(comm_solve, ierr) + + call MPI_File_open(comm_solve, + $ filenm, + $ MPI_MODE_RDWR+MPI_MODE_CREATE, + $ MPI_INFO_NULL, fp, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error opening file' + stop + endif + + call MPI_File_set_view(fp, iseek, element, + $ combined_ftype, 'native', info, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error setting file view' + stop + endif + + do m = 1, 5 + xce_sub(m) = 0.d0 + end do + + idump_sub = 0 + + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine output_timestep + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + include 'header.h' + include 'mpinpb.h' + + integer mstatus(MPI_STATUS_SIZE) + integer ierr + + call MPI_File_write_at_all(fp, iseek, u, + $ 1, combined_btype, mstatus, ierr) + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error writing to file' + stop + endif + + call MPI_Type_size(combined_btype, iosize, ierr) + iseek = iseek + iosize/eltext + + idump_sub = idump_sub + 1 + if (rd_interval .gt. 0) then + if (idump_sub .ge. rd_interval) then + + iseek = 0 + call acc_sub_norms(idump+1) + + iseek = 0 + idump_sub = 0 + endif + endif + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine acc_sub_norms(idump_cur) + + include 'header.h' + include 'mpinpb.h' + + integer idump_cur + + integer ii, m, ichunk + integer ierr + integer mstatus(MPI_STATUS_SIZE) + double precision xce_single(5) + + ichunk = idump_cur - idump_sub + 1 + do ii=0, idump_sub-1 + + call MPI_File_read_at_all(fp, iseek, u, + $ 1, combined_btype, mstatus, ierr) + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error reading back file' + call MPI_File_close(fp, ierr) + stop + endif + + call MPI_Type_size(combined_btype, iosize, ierr) + iseek = iseek + iosize/eltext + + if (node .eq. root) print *, 'Reading data set ', ii+ichunk + + call error_norm(xce_single) + do m = 1, 5 + xce_sub(m) = xce_sub(m) + xce_single(m) + end do + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine btio_cleanup + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + include 'header.h' + include 'mpinpb.h' + + integer ierr + + call MPI_File_close(fp, ierr) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + + subroutine accumulate_norms(xce_acc) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + double precision xce_acc(5) + integer m, ierr + + if (rd_interval .gt. 0) goto 20 + + call MPI_File_open(comm_solve, + $ filenm, + $ MPI_MODE_RDONLY, + $ MPI_INFO_NULL, + $ fp, + $ ierr) + + iseek = 0 + call MPI_File_set_view(fp, iseek, element, combined_ftype, + $ 'native', MPI_INFO_NULL, ierr) + +c clear the last time step + + call clear_timestep + +c read back the time steps and accumulate norms + + call acc_sub_norms(idump) + + call MPI_File_close(fp, ierr) + + 20 continue + do m = 1, 5 + xce_acc(m) = xce_sub(m) / dble(idump) + end do + + return + end + diff --git b/NPB3.3-MPI/BT/header.h a/NPB3.3-MPI/BT/header.h new file mode 100644 index 0000000..cb815eb --- /dev/null +++ a/NPB3.3-MPI/BT/header.h @@ -0,0 +1,146 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- +c +c header.h +c +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + +c--------------------------------------------------------------------- +c The following include file is generated automatically by the +c "setparams" utility. It defines +c maxcells: the square root of the maximum number of processors +c problem_size: 12, 64, 102, 162 (for class T, A, B, C) +c dt_default: default time step for this problem size if no +c config file +c niter_default: default number of iterations for this problem size +c--------------------------------------------------------------------- + + include 'npbparams.h' + + integer aa, bb, cc, BLOCK_SIZE + parameter (aa=1, bb=2, cc=3, BLOCK_SIZE=5) + + integer ncells, grid_points(3) + double precision elapsed_time + common /global/ elapsed_time, ncells, grid_points + + double precision tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, + > dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, + > dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, + > ce(5,13), dxmax, dymax, dzmax, xxcon1, xxcon2, + > xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1, + > dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4, + > yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1, + > zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, + > dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, + > dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, + > c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt, + > dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, + > c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, + > c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16 + + common /constants/ tx1, tx2, tx3, ty1, ty2, ty3, tz1, tz2, tz3, + > dx1, dx2, dx3, dx4, dx5, dy1, dy2, dy3, dy4, + > dy5, dz1, dz2, dz3, dz4, dz5, dssp, dt, + > ce, dxmax, dymax, dzmax, xxcon1, xxcon2, + > xxcon3, xxcon4, xxcon5, dx1tx1, dx2tx1, dx3tx1, + > dx4tx1, dx5tx1, yycon1, yycon2, yycon3, yycon4, + > yycon5, dy1ty1, dy2ty1, dy3ty1, dy4ty1, dy5ty1, + > zzcon1, zzcon2, zzcon3, zzcon4, zzcon5, dz1tz1, + > dz2tz1, dz3tz1, dz4tz1, dz5tz1, dnxm1, dnym1, + > dnzm1, c1c2, c1c5, c3c4, c1345, conz1, c1, c2, + > c3, c4, c5, c4dssp, c5dssp, dtdssp, dttx1, bt, + > dttx2, dtty1, dtty2, dttz1, dttz2, c2dttx1, + > c2dtty1, c2dttz1, comz1, comz4, comz5, comz6, + > c3c4tx3, c3c4ty3, c3c4tz3, c2iv, con43, con16 + + integer EAST, WEST, NORTH, SOUTH, + > BOTTOM, TOP + + parameter (EAST=2000, WEST=3000, NORTH=4000, SOUTH=5000, + > BOTTOM=6000, TOP=7000) + + integer cell_coord (3,maxcells), cell_low (3,maxcells), + > cell_high (3,maxcells), cell_size(3,maxcells), + > predecessor(3), slice (3,maxcells), + > grid_size (3), successor(3) , + > start (3,maxcells), end (3,maxcells) + common /partition/ cell_coord, cell_low, cell_high, cell_size, + > grid_size, successor, predecessor, slice, + > start, end + + integer IMAX, JMAX, KMAX, MAX_CELL_DIM, BUF_SIZE + + parameter (MAX_CELL_DIM = (problem_size/maxcells)+1) + + parameter (IMAX=MAX_CELL_DIM,JMAX=MAX_CELL_DIM,KMAX=MAX_CELL_DIM) + + parameter (BUF_SIZE=MAX_CELL_DIM*MAX_CELL_DIM*(maxcells-1)*60+1) + + double precision + > us ( -1:IMAX, -1:JMAX, -1:KMAX, maxcells), + > vs ( -1:IMAX, -1:JMAX, -1:KMAX, maxcells), + > ws ( -1:IMAX, -1:JMAX, -1:KMAX, maxcells), + > qs ( -1:IMAX, -1:JMAX, -1:KMAX, maxcells), + > rho_i ( -1:IMAX, -1:JMAX, -1:KMAX, maxcells), + > square ( -1:IMAX, -1:JMAX, -1:KMAX, maxcells), + > forcing (5, 0:IMAX-1, 0:JMAX-1, 0:KMAX-1, maxcells), + > u (5, -2:IMAX+1,-2:JMAX+1,-2:KMAX+1, maxcells), + > rhs (5, -1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells), + > lhsc (5,5,-1:IMAX-1,-1:JMAX-1,-1:KMAX-1, maxcells), + > backsub_info (5, 0:MAX_CELL_DIM, 0:MAX_CELL_DIM, maxcells), + > in_buffer(BUF_SIZE), out_buffer(BUF_SIZE) + common /fields/ u, us, vs, ws, qs, rho_i, square, + > rhs, forcing, lhsc, in_buffer, out_buffer, + > backsub_info + + double precision cv(-2:MAX_CELL_DIM+1), rhon(-2:MAX_CELL_DIM+1), + > rhos(-2:MAX_CELL_DIM+1), rhoq(-2:MAX_CELL_DIM+1), + > cuf(-2:MAX_CELL_DIM+1), q(-2:MAX_CELL_DIM+1), + > ue(-2:MAX_CELL_DIM+1,5), buf(-2:MAX_CELL_DIM+1,5) + common /work_1d/ cv, rhon, rhos, rhoq, cuf, q, ue, buf + + integer west_size, east_size, bottom_size, top_size, + > north_size, south_size, start_send_west, + > start_send_east, start_send_south, start_send_north, + > start_send_bottom, start_send_top, start_recv_west, + > start_recv_east, start_recv_south, start_recv_north, + > start_recv_bottom, start_recv_top + common /box/ west_size, east_size, bottom_size, + > top_size, north_size, south_size, + > start_send_west, start_send_east, start_send_south, + > start_send_north, start_send_bottom, start_send_top, + > start_recv_west, start_recv_east, start_recv_south, + > start_recv_north, start_recv_bottom, start_recv_top + + double precision tmp_block(5,5), b_inverse(5,5), tmp_vec(5) + common /work_solve/ tmp_block, b_inverse, tmp_vec + +c +c These are used by btio +c + integer collbuf_nodes, collbuf_size, iosize, eltext, + $ combined_btype, fp, idump, record_length, element, + $ combined_ftype, idump_sub, rd_interval + common /btio/ collbuf_nodes, collbuf_size, iosize, eltext, + $ combined_btype, fp, idump, record_length, + $ idump_sub, rd_interval + double precision sum(niter_default), xce_sub(5) + common /btio/ sum, xce_sub + integer*8 iseek + common /btio/ iseek, element, combined_ftype + + + integer t_total, t_io, t_rhs, t_xsolve, t_ysolve, t_zsolve, + > t_bpack, t_exch, t_xcomm, t_ycomm, t_zcomm, t_last + parameter (t_total=1, t_io=2, t_rhs=3, t_xsolve=4, t_ysolve=5, + > t_zsolve=6, t_bpack=7, t_exch=8, t_xcomm=9, + > t_ycomm=10, t_zcomm=11, t_last=11) + logical timeron + common /tflags/ timeron + + + diff --git b/NPB3.3-MPI/BT/initialize.f a/NPB3.3-MPI/BT/initialize.f new file mode 100644 index 0000000..274cdb1 --- /dev/null +++ a/NPB3.3-MPI/BT/initialize.f @@ -0,0 +1,308 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine initialize + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c This subroutine initializes the field variable u using +c tri-linear transfinite interpolation of the boundary values +c--------------------------------------------------------------------- + + include 'header.h' + + integer c, i, j, k, m, ii, jj, kk, ix, iy, iz + double precision xi, eta, zeta, Pface(5,3,2), Pxi, Peta, + > Pzeta, temp(5) + +c--------------------------------------------------------------------- +c Later (in compute_rhs) we compute 1/u for every element. A few of +c the corner elements are not used, but it convenient (and faster) +c to compute the whole thing with a simple loop. Make sure those +c values are nonzero by initializing the whole thing here. +c--------------------------------------------------------------------- + do c = 1, ncells + do kk = -1, KMAX + do jj = -1, JMAX + do ii = -1, IMAX + do m = 1, 5 + u(m, ii, jj, kk, c) = 1.0 + end do + end do + end do + end do + end do +c--------------------------------------------------------------------- + + + +c--------------------------------------------------------------------- +c first store the "interpolated" values everywhere on the grid +c--------------------------------------------------------------------- + do c=1, ncells + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + ii = 0 + do i = cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + + do ix = 1, 2 + call exact_solution(dble(ix-1), eta, zeta, + > Pface(1,1,ix)) + enddo + + do iy = 1, 2 + call exact_solution(xi, dble(iy-1) , zeta, + > Pface(1,2,iy)) + enddo + + do iz = 1, 2 + call exact_solution(xi, eta, dble(iz-1), + > Pface(1,3,iz)) + enddo + + do m = 1, 5 + Pxi = xi * Pface(m,1,2) + + > (1.0d0-xi) * Pface(m,1,1) + Peta = eta * Pface(m,2,2) + + > (1.0d0-eta) * Pface(m,2,1) + Pzeta = zeta * Pface(m,3,2) + + > (1.0d0-zeta) * Pface(m,3,1) + + u(m,ii,jj,kk,c) = Pxi + Peta + Pzeta - + > Pxi*Peta - Pxi*Pzeta - Peta*Pzeta + + > Pxi*Peta*Pzeta + + enddo + ii = ii + 1 + enddo + jj = jj + 1 + enddo + kk = kk+1 + enddo + enddo + +c--------------------------------------------------------------------- +c now store the exact values on the boundaries +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c west face +c--------------------------------------------------------------------- + c = slice(1,1) + ii = 0 + xi = 0.0d0 + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + jj = jj + 1 + enddo + kk = kk + 1 + enddo + +c--------------------------------------------------------------------- +c east face +c--------------------------------------------------------------------- + c = slice(1,ncells) + ii = cell_size(1,c)-1 + xi = 1.0d0 + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + jj = jj + 1 + enddo + kk = kk + 1 + enddo + +c--------------------------------------------------------------------- +c south face +c--------------------------------------------------------------------- + c = slice(2,1) + jj = 0 + eta = 0.0d0 + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + ii = 0 + do i = cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + ii = ii + 1 + enddo + kk = kk + 1 + enddo + + +c--------------------------------------------------------------------- +c north face +c--------------------------------------------------------------------- + c = slice(2,ncells) + jj = cell_size(2,c)-1 + eta = 1.0d0 + kk = 0 + do k = cell_low(3,c), cell_high(3,c) + zeta = dble(k) * dnzm1 + ii = 0 + do i = cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + ii = ii + 1 + enddo + kk = kk + 1 + enddo + +c--------------------------------------------------------------------- +c bottom face +c--------------------------------------------------------------------- + c = slice(3,1) + kk = 0 + zeta = 0.0d0 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + ii = 0 + do i =cell_low(1,c), cell_high(1,c) + xi = dble(i) *dnxm1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + ii = ii + 1 + enddo + jj = jj + 1 + enddo + +c--------------------------------------------------------------------- +c top face +c--------------------------------------------------------------------- + c = slice(3,ncells) + kk = cell_size(3,c)-1 + zeta = 1.0d0 + jj = 0 + do j = cell_low(2,c), cell_high(2,c) + eta = dble(j) * dnym1 + ii = 0 + do i =cell_low(1,c), cell_high(1,c) + xi = dble(i) * dnxm1 + call exact_solution(xi, eta, zeta, temp) + do m = 1, 5 + u(m,ii,jj,kk,c) = temp(m) + enddo + ii = ii + 1 + enddo + jj = jj + 1 + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine lhsinit + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + + integer i, j, k, d, c, m, n + +c--------------------------------------------------------------------- +c loop over all cells +c--------------------------------------------------------------------- + do c = 1, ncells + +c--------------------------------------------------------------------- +c first, initialize the start and end arrays +c--------------------------------------------------------------------- + do d = 1, 3 + if (cell_coord(d,c) .eq. 1) then + start(d,c) = 1 + else + start(d,c) = 0 + endif + if (cell_coord(d,c) .eq. ncells) then + end(d,c) = 1 + else + end(d,c) = 0 + endif + enddo + +c--------------------------------------------------------------------- +c zero the whole left hand side for starters +c--------------------------------------------------------------------- + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1,5 + do n = 1, 5 + lhsc(m,n,i,j,k,c) = 0.0d0 + enddo + enddo + enddo + enddo + enddo + + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine lhsabinit(lhsa, lhsb, size) + implicit none + + integer size + double precision lhsa(5, 5, -1:size), lhsb(5, 5, -1:size) + + integer i, m, n + +c--------------------------------------------------------------------- +c next, set all diagonal values to 1. This is overkill, but convenient +c--------------------------------------------------------------------- + do i = 0, size + do m = 1, 5 + do n = 1, 5 + lhsa(m,n,i) = 0.0d0 + lhsb(m,n,i) = 0.0d0 + enddo + lhsb(m,m,i) = 1.0d0 + enddo + enddo + + return + end + + + diff --git b/NPB3.3-MPI/BT/inputbt.data.sample a/NPB3.3-MPI/BT/inputbt.data.sample new file mode 100644 index 0000000..776654e --- /dev/null +++ a/NPB3.3-MPI/BT/inputbt.data.sample @@ -0,0 +1,5 @@ +200 number of time steps +0.0008d0 dt for class A = 0.0008d0. class B = 0.0003d0 class C = 0.0001d0 +64 64 64 +5 0 write interval (optional read interval) for BTIO +0 1000000 number of nodes in collective buffering and buffer size for BTIO diff --git b/NPB3.3-MPI/BT/make_set.f a/NPB3.3-MPI/BT/make_set.f new file mode 100644 index 0000000..ffab37c --- /dev/null +++ a/NPB3.3-MPI/BT/make_set.f @@ -0,0 +1,125 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine make_set + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c This function allocates space for a set of cells and fills the set +c such that communication between cells on different nodes is only +c nearest neighbor +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + + integer p, i, j, c, dir, size, excess, ierr,ierrcode + +c--------------------------------------------------------------------- +c compute square root; add small number to allow for roundoff +c (note: this is computed in setup_mpi.f also, but prefer to do +c it twice because of some include file problems). +c--------------------------------------------------------------------- + ncells = dint(dsqrt(dble(no_nodes) + 0.00001d0)) + +c--------------------------------------------------------------------- +c this makes coding easier +c--------------------------------------------------------------------- + p = ncells + +c--------------------------------------------------------------------- +c determine the location of the cell at the bottom of the 3D +c array of cells +c--------------------------------------------------------------------- + cell_coord(1,1) = mod(node,p) + cell_coord(2,1) = node/p + cell_coord(3,1) = 0 + +c--------------------------------------------------------------------- +c set the cell_coords for cells in the rest of the z-layers; +c this comes down to a simple linear numbering in the z-direct- +c ion, and to the doubly-cyclic numbering in the other dirs +c--------------------------------------------------------------------- + do c=2, p + cell_coord(1,c) = mod(cell_coord(1,c-1)+1,p) + cell_coord(2,c) = mod(cell_coord(2,c-1)-1+p,p) + cell_coord(3,c) = c-1 + end do + +c--------------------------------------------------------------------- +c offset all the coordinates by 1 to adjust for Fortran arrays +c--------------------------------------------------------------------- + do dir = 1, 3 + do c = 1, p + cell_coord(dir,c) = cell_coord(dir,c) + 1 + end do + end do + +c--------------------------------------------------------------------- +c slice(dir,n) contains the sequence number of the cell that is in +c coordinate plane n in the dir direction +c--------------------------------------------------------------------- + do dir = 1, 3 + do c = 1, p + slice(dir,cell_coord(dir,c)) = c + end do + end do + + +c--------------------------------------------------------------------- +c fill the predecessor and successor entries, using the indices +c of the bottom cells (they are the same at each level of k +c anyway) acting as if full periodicity pertains; note that p is +c added to those arguments to the mod functions that might +c otherwise return wrong values when using the modulo function +c--------------------------------------------------------------------- + i = cell_coord(1,1)-1 + j = cell_coord(2,1)-1 + + predecessor(1) = mod(i-1+p,p) + p*j + predecessor(2) = i + p*mod(j-1+p,p) + predecessor(3) = mod(i+1,p) + p*mod(j-1+p,p) + successor(1) = mod(i+1,p) + p*j + successor(2) = i + p*mod(j+1,p) + successor(3) = mod(i-1+p,p) + p*mod(j+1,p) + +c--------------------------------------------------------------------- +c now compute the sizes of the cells +c--------------------------------------------------------------------- + do dir= 1, 3 +c--------------------------------------------------------------------- +c set cell_coord range for each direction +c--------------------------------------------------------------------- + size = grid_points(dir)/p + excess = mod(grid_points(dir),p) + do c=1, ncells + if (cell_coord(dir,c) .le. excess) then + cell_size(dir,c) = size+1 + cell_low(dir,c) = (cell_coord(dir,c)-1)*(size+1) + cell_high(dir,c) = cell_low(dir,c)+size + else + cell_size(dir,c) = size + cell_low(dir,c) = excess*(size+1)+ + > (cell_coord(dir,c)-excess-1)*size + cell_high(dir,c) = cell_low(dir,c)+size-1 + endif + if (cell_size(dir, c) .le. 2) then + write(*,50) + 50 format(' Error: Cell size too small. Min size is 3') + ierrcode = 1 + call MPI_Abort(mpi_comm_world,ierrcode,ierr) + stop + endif + end do + end do + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + diff --git b/NPB3.3-MPI/BT/mpinpb.h a/NPB3.3-MPI/BT/mpinpb.h new file mode 100644 index 0000000..f621f08 --- /dev/null +++ a/NPB3.3-MPI/BT/mpinpb.h @@ -0,0 +1,12 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'mpif.h' + + integer node, no_nodes, total_nodes, root, comm_setup, + > comm_solve, comm_rhs, dp_type + logical active + common /mpistuff/ node, no_nodes, total_nodes, root, comm_setup, + > comm_solve, comm_rhs, dp_type, active + diff --git b/NPB3.3-MPI/BT/rhs.f a/NPB3.3-MPI/BT/rhs.f new file mode 100644 index 0000000..722f750 --- /dev/null +++ a/NPB3.3-MPI/BT/rhs.f @@ -0,0 +1,428 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine compute_rhs + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + + integer c, i, j, k, m + double precision rho_inv, uijk, up1, um1, vijk, vp1, vm1, + > wijk, wp1, wm1 + + + if (timeron) call timer_start(t_rhs) +c--------------------------------------------------------------------- +c loop over all cells owned by this node +c--------------------------------------------------------------------- + do c = 1, ncells + +c--------------------------------------------------------------------- +c compute the reciprocal of density, and the kinetic energy, +c and the speed of sound. +c--------------------------------------------------------------------- + do k = -1, cell_size(3,c) + do j = -1, cell_size(2,c) + do i = -1, cell_size(1,c) + rho_inv = 1.0d0/u(1,i,j,k,c) + rho_i(i,j,k,c) = rho_inv + us(i,j,k,c) = u(2,i,j,k,c) * rho_inv + vs(i,j,k,c) = u(3,i,j,k,c) * rho_inv + ws(i,j,k,c) = u(4,i,j,k,c) * rho_inv + square(i,j,k,c) = 0.5d0* ( + > u(2,i,j,k,c)*u(2,i,j,k,c) + + > u(3,i,j,k,c)*u(3,i,j,k,c) + + > u(4,i,j,k,c)*u(4,i,j,k,c) ) * rho_inv + qs(i,j,k,c) = square(i,j,k,c) * rho_inv + enddo + enddo + enddo + +c--------------------------------------------------------------------- +c copy the exact forcing term to the right hand side; because +c this forcing term is known, we can store it on the whole of every +c cell, including the boundary +c--------------------------------------------------------------------- + + do k = 0, cell_size(3,c)-1 + do j = 0, cell_size(2,c)-1 + do i = 0, cell_size(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = forcing(m,i,j,k,c) + enddo + enddo + enddo + enddo + + +c--------------------------------------------------------------------- +c compute xi-direction fluxes +c--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + uijk = us(i,j,k,c) + up1 = us(i+1,j,k,c) + um1 = us(i-1,j,k,c) + + rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dx1tx1 * + > (u(1,i+1,j,k,c) - 2.0d0*u(1,i,j,k,c) + + > u(1,i-1,j,k,c)) - + > tx2 * (u(2,i+1,j,k,c) - u(2,i-1,j,k,c)) + + rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dx2tx1 * + > (u(2,i+1,j,k,c) - 2.0d0*u(2,i,j,k,c) + + > u(2,i-1,j,k,c)) + + > xxcon2*con43 * (up1 - 2.0d0*uijk + um1) - + > tx2 * (u(2,i+1,j,k,c)*up1 - + > u(2,i-1,j,k,c)*um1 + + > (u(5,i+1,j,k,c)- square(i+1,j,k,c)- + > u(5,i-1,j,k,c)+ square(i-1,j,k,c))* + > c2) + + rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dx3tx1 * + > (u(3,i+1,j,k,c) - 2.0d0*u(3,i,j,k,c) + + > u(3,i-1,j,k,c)) + + > xxcon2 * (vs(i+1,j,k,c) - 2.0d0*vs(i,j,k,c) + + > vs(i-1,j,k,c)) - + > tx2 * (u(3,i+1,j,k,c)*up1 - + > u(3,i-1,j,k,c)*um1) + + rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dx4tx1 * + > (u(4,i+1,j,k,c) - 2.0d0*u(4,i,j,k,c) + + > u(4,i-1,j,k,c)) + + > xxcon2 * (ws(i+1,j,k,c) - 2.0d0*ws(i,j,k,c) + + > ws(i-1,j,k,c)) - + > tx2 * (u(4,i+1,j,k,c)*up1 - + > u(4,i-1,j,k,c)*um1) + + rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dx5tx1 * + > (u(5,i+1,j,k,c) - 2.0d0*u(5,i,j,k,c) + + > u(5,i-1,j,k,c)) + + > xxcon3 * (qs(i+1,j,k,c) - 2.0d0*qs(i,j,k,c) + + > qs(i-1,j,k,c)) + + > xxcon4 * (up1*up1 - 2.0d0*uijk*uijk + + > um1*um1) + + > xxcon5 * (u(5,i+1,j,k,c)*rho_i(i+1,j,k,c) - + > 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) + + > u(5,i-1,j,k,c)*rho_i(i-1,j,k,c)) - + > tx2 * ( (c1*u(5,i+1,j,k,c) - + > c2*square(i+1,j,k,c))*up1 - + > (c1*u(5,i-1,j,k,c) - + > c2*square(i-1,j,k,c))*um1 ) + enddo + enddo + enddo + +c--------------------------------------------------------------------- +c add fourth order xi-direction dissipation +c--------------------------------------------------------------------- + if (start(1,c) .gt. 0) then + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + i = 1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * + > ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) + + > u(m,i+2,j,k,c)) + enddo + + i = 2 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > (-4.0d0*u(m,i-1,j,k,c) + 6.0d0*u(m,i,j,k,c) - + > 4.0d0*u(m,i+1,j,k,c) + u(m,i+2,j,k,c)) + enddo + enddo + enddo + endif + + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = 3*start(1,c),cell_size(1,c)-3*end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i-2,j,k,c) - 4.0d0*u(m,i-1,j,k,c) + + > 6.0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) + + > u(m,i+2,j,k,c) ) + enddo + enddo + enddo + enddo + + + if (end(1,c) .gt. 0) then + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + i = cell_size(1,c)-3 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i-2,j,k,c) - 4.0d0*u(m,i-1,j,k,c) + + > 6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i+1,j,k,c) ) + enddo + + i = cell_size(1,c)-2 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i-2,j,k,c) - 4.d0*u(m,i-1,j,k,c) + + > 5.d0*u(m,i,j,k,c) ) + enddo + enddo + enddo + endif + +c--------------------------------------------------------------------- +c compute eta-direction fluxes +c--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + vijk = vs(i,j,k,c) + vp1 = vs(i,j+1,k,c) + vm1 = vs(i,j-1,k,c) + rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dy1ty1 * + > (u(1,i,j+1,k,c) - 2.0d0*u(1,i,j,k,c) + + > u(1,i,j-1,k,c)) - + > ty2 * (u(3,i,j+1,k,c) - u(3,i,j-1,k,c)) + rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dy2ty1 * + > (u(2,i,j+1,k,c) - 2.0d0*u(2,i,j,k,c) + + > u(2,i,j-1,k,c)) + + > yycon2 * (us(i,j+1,k,c) - 2.0d0*us(i,j,k,c) + + > us(i,j-1,k,c)) - + > ty2 * (u(2,i,j+1,k,c)*vp1 - + > u(2,i,j-1,k,c)*vm1) + rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dy3ty1 * + > (u(3,i,j+1,k,c) - 2.0d0*u(3,i,j,k,c) + + > u(3,i,j-1,k,c)) + + > yycon2*con43 * (vp1 - 2.0d0*vijk + vm1) - + > ty2 * (u(3,i,j+1,k,c)*vp1 - + > u(3,i,j-1,k,c)*vm1 + + > (u(5,i,j+1,k,c) - square(i,j+1,k,c) - + > u(5,i,j-1,k,c) + square(i,j-1,k,c)) + > *c2) + rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dy4ty1 * + > (u(4,i,j+1,k,c) - 2.0d0*u(4,i,j,k,c) + + > u(4,i,j-1,k,c)) + + > yycon2 * (ws(i,j+1,k,c) - 2.0d0*ws(i,j,k,c) + + > ws(i,j-1,k,c)) - + > ty2 * (u(4,i,j+1,k,c)*vp1 - + > u(4,i,j-1,k,c)*vm1) + rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dy5ty1 * + > (u(5,i,j+1,k,c) - 2.0d0*u(5,i,j,k,c) + + > u(5,i,j-1,k,c)) + + > yycon3 * (qs(i,j+1,k,c) - 2.0d0*qs(i,j,k,c) + + > qs(i,j-1,k,c)) + + > yycon4 * (vp1*vp1 - 2.0d0*vijk*vijk + + > vm1*vm1) + + > yycon5 * (u(5,i,j+1,k,c)*rho_i(i,j+1,k,c) - + > 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) + + > u(5,i,j-1,k,c)*rho_i(i,j-1,k,c)) - + > ty2 * ((c1*u(5,i,j+1,k,c) - + > c2*square(i,j+1,k,c)) * vp1 - + > (c1*u(5,i,j-1,k,c) - + > c2*square(i,j-1,k,c)) * vm1) + enddo + enddo + enddo + +c--------------------------------------------------------------------- +c add fourth order eta-direction dissipation +c--------------------------------------------------------------------- + if (start(2,c) .gt. 0) then + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + j = 1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * + > ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) + + > u(m,i,j+2,k,c)) + enddo + enddo + + j = 2 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > (-4.0d0*u(m,i,j-1,k,c) + 6.0d0*u(m,i,j,k,c) - + > 4.0d0*u(m,i,j+1,k,c) + u(m,i,j+2,k,c)) + enddo + enddo + enddo + endif + + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = 3*start(2,c), cell_size(2,c)-3*end(2,c)-1 + do i = start(1,c),cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i,j-2,k,c) - 4.0d0*u(m,i,j-1,k,c) + + > 6.0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) + + > u(m,i,j+2,k,c) ) + enddo + enddo + enddo + enddo + + if (end(2,c) .gt. 0) then + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + j = cell_size(2,c)-3 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i,j-2,k,c) - 4.0d0*u(m,i,j-1,k,c) + + > 6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j+1,k,c) ) + enddo + enddo + + j = cell_size(2,c)-2 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i,j-2,k,c) - 4.d0*u(m,i,j-1,k,c) + + > 5.d0*u(m,i,j,k,c) ) + enddo + enddo + enddo + endif + +c--------------------------------------------------------------------- +c compute zeta-direction fluxes +c--------------------------------------------------------------------- + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + wijk = ws(i,j,k,c) + wp1 = ws(i,j,k+1,c) + wm1 = ws(i,j,k-1,c) + + rhs(1,i,j,k,c) = rhs(1,i,j,k,c) + dz1tz1 * + > (u(1,i,j,k+1,c) - 2.0d0*u(1,i,j,k,c) + + > u(1,i,j,k-1,c)) - + > tz2 * (u(4,i,j,k+1,c) - u(4,i,j,k-1,c)) + rhs(2,i,j,k,c) = rhs(2,i,j,k,c) + dz2tz1 * + > (u(2,i,j,k+1,c) - 2.0d0*u(2,i,j,k,c) + + > u(2,i,j,k-1,c)) + + > zzcon2 * (us(i,j,k+1,c) - 2.0d0*us(i,j,k,c) + + > us(i,j,k-1,c)) - + > tz2 * (u(2,i,j,k+1,c)*wp1 - + > u(2,i,j,k-1,c)*wm1) + rhs(3,i,j,k,c) = rhs(3,i,j,k,c) + dz3tz1 * + > (u(3,i,j,k+1,c) - 2.0d0*u(3,i,j,k,c) + + > u(3,i,j,k-1,c)) + + > zzcon2 * (vs(i,j,k+1,c) - 2.0d0*vs(i,j,k,c) + + > vs(i,j,k-1,c)) - + > tz2 * (u(3,i,j,k+1,c)*wp1 - + > u(3,i,j,k-1,c)*wm1) + rhs(4,i,j,k,c) = rhs(4,i,j,k,c) + dz4tz1 * + > (u(4,i,j,k+1,c) - 2.0d0*u(4,i,j,k,c) + + > u(4,i,j,k-1,c)) + + > zzcon2*con43 * (wp1 - 2.0d0*wijk + wm1) - + > tz2 * (u(4,i,j,k+1,c)*wp1 - + > u(4,i,j,k-1,c)*wm1 + + > (u(5,i,j,k+1,c) - square(i,j,k+1,c) - + > u(5,i,j,k-1,c) + square(i,j,k-1,c)) + > *c2) + rhs(5,i,j,k,c) = rhs(5,i,j,k,c) + dz5tz1 * + > (u(5,i,j,k+1,c) - 2.0d0*u(5,i,j,k,c) + + > u(5,i,j,k-1,c)) + + > zzcon3 * (qs(i,j,k+1,c) - 2.0d0*qs(i,j,k,c) + + > qs(i,j,k-1,c)) + + > zzcon4 * (wp1*wp1 - 2.0d0*wijk*wijk + + > wm1*wm1) + + > zzcon5 * (u(5,i,j,k+1,c)*rho_i(i,j,k+1,c) - + > 2.0d0*u(5,i,j,k,c)*rho_i(i,j,k,c) + + > u(5,i,j,k-1,c)*rho_i(i,j,k-1,c)) - + > tz2 * ( (c1*u(5,i,j,k+1,c) - + > c2*square(i,j,k+1,c))*wp1 - + > (c1*u(5,i,j,k-1,c) - + > c2*square(i,j,k-1,c))*wm1) + enddo + enddo + enddo + +c--------------------------------------------------------------------- +c add fourth order zeta-direction dissipation +c--------------------------------------------------------------------- + if (start(3,c) .gt. 0) then + k = 1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c)- dssp * + > ( 5.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) + + > u(m,i,j,k+2,c)) + enddo + enddo + enddo + + k = 2 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > (-4.0d0*u(m,i,j,k-1,c) + 6.0d0*u(m,i,j,k,c) - + > 4.0d0*u(m,i,j,k+1,c) + u(m,i,j,k+2,c)) + enddo + enddo + enddo + endif + + do k = 3*start(3,c), cell_size(3,c)-3*end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c),cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i,j,k-2,c) - 4.0d0*u(m,i,j,k-1,c) + + > 6.0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) + + > u(m,i,j,k+2,c) ) + enddo + enddo + enddo + enddo + + if (end(3,c) .gt. 0) then + k = cell_size(3,c)-3 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i,j,k-2,c) - 4.0d0*u(m,i,j,k-1,c) + + > 6.0d0*u(m,i,j,k,c) - 4.0d0*u(m,i,j,k+1,c) ) + enddo + enddo + enddo + + k = cell_size(3,c)-2 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) - dssp * + > ( u(m,i,j,k-2,c) - 4.d0*u(m,i,j,k-1,c) + + > 5.d0*u(m,i,j,k,c) ) + enddo + enddo + enddo + endif + + do k = start(3,c), cell_size(3,c)-end(3,c)-1 + do j = start(2,c), cell_size(2,c)-end(2,c)-1 + do i = start(1,c), cell_size(1,c)-end(1,c)-1 + do m = 1, 5 + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) * dt + enddo + enddo + enddo + enddo + + enddo + + if (timeron) call timer_stop(t_rhs) + + return + end + + + + diff --git b/NPB3.3-MPI/BT/set_constants.f a/NPB3.3-MPI/BT/set_constants.f new file mode 100644 index 0000000..81397d4 --- /dev/null +++ a/NPB3.3-MPI/BT/set_constants.f @@ -0,0 +1,202 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine set_constants + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + + ce(1,1) = 2.0d0 + ce(1,2) = 0.0d0 + ce(1,3) = 0.0d0 + ce(1,4) = 4.0d0 + ce(1,5) = 5.0d0 + ce(1,6) = 3.0d0 + ce(1,7) = 0.5d0 + ce(1,8) = 0.02d0 + ce(1,9) = 0.01d0 + ce(1,10) = 0.03d0 + ce(1,11) = 0.5d0 + ce(1,12) = 0.4d0 + ce(1,13) = 0.3d0 + + ce(2,1) = 1.0d0 + ce(2,2) = 0.0d0 + ce(2,3) = 0.0d0 + ce(2,4) = 0.0d0 + ce(2,5) = 1.0d0 + ce(2,6) = 2.0d0 + ce(2,7) = 3.0d0 + ce(2,8) = 0.01d0 + ce(2,9) = 0.03d0 + ce(2,10) = 0.02d0 + ce(2,11) = 0.4d0 + ce(2,12) = 0.3d0 + ce(2,13) = 0.5d0 + + ce(3,1) = 2.0d0 + ce(3,2) = 2.0d0 + ce(3,3) = 0.0d0 + ce(3,4) = 0.0d0 + ce(3,5) = 0.0d0 + ce(3,6) = 2.0d0 + ce(3,7) = 3.0d0 + ce(3,8) = 0.04d0 + ce(3,9) = 0.03d0 + ce(3,10) = 0.05d0 + ce(3,11) = 0.3d0 + ce(3,12) = 0.5d0 + ce(3,13) = 0.4d0 + + ce(4,1) = 2.0d0 + ce(4,2) = 2.0d0 + ce(4,3) = 0.0d0 + ce(4,4) = 0.0d0 + ce(4,5) = 0.0d0 + ce(4,6) = 2.0d0 + ce(4,7) = 3.0d0 + ce(4,8) = 0.03d0 + ce(4,9) = 0.05d0 + ce(4,10) = 0.04d0 + ce(4,11) = 0.2d0 + ce(4,12) = 0.1d0 + ce(4,13) = 0.3d0 + + ce(5,1) = 5.0d0 + ce(5,2) = 4.0d0 + ce(5,3) = 3.0d0 + ce(5,4) = 2.0d0 + ce(5,5) = 0.1d0 + ce(5,6) = 0.4d0 + ce(5,7) = 0.3d0 + ce(5,8) = 0.05d0 + ce(5,9) = 0.04d0 + ce(5,10) = 0.03d0 + ce(5,11) = 0.1d0 + ce(5,12) = 0.3d0 + ce(5,13) = 0.2d0 + + c1 = 1.4d0 + c2 = 0.4d0 + c3 = 0.1d0 + c4 = 1.0d0 + c5 = 1.4d0 + + bt = dsqrt(0.5d0) + + dnxm1 = 1.0d0 / dble(grid_points(1)-1) + dnym1 = 1.0d0 / dble(grid_points(2)-1) + dnzm1 = 1.0d0 / dble(grid_points(3)-1) + + c1c2 = c1 * c2 + c1c5 = c1 * c5 + c3c4 = c3 * c4 + c1345 = c1c5 * c3c4 + + conz1 = (1.0d0-c1c5) + + tx1 = 1.0d0 / (dnxm1 * dnxm1) + tx2 = 1.0d0 / (2.0d0 * dnxm1) + tx3 = 1.0d0 / dnxm1 + + ty1 = 1.0d0 / (dnym1 * dnym1) + ty2 = 1.0d0 / (2.0d0 * dnym1) + ty3 = 1.0d0 / dnym1 + + tz1 = 1.0d0 / (dnzm1 * dnzm1) + tz2 = 1.0d0 / (2.0d0 * dnzm1) + tz3 = 1.0d0 / dnzm1 + + dx1 = 0.75d0 + dx2 = 0.75d0 + dx3 = 0.75d0 + dx4 = 0.75d0 + dx5 = 0.75d0 + + dy1 = 0.75d0 + dy2 = 0.75d0 + dy3 = 0.75d0 + dy4 = 0.75d0 + dy5 = 0.75d0 + + dz1 = 1.0d0 + dz2 = 1.0d0 + dz3 = 1.0d0 + dz4 = 1.0d0 + dz5 = 1.0d0 + + dxmax = dmax1(dx3, dx4) + dymax = dmax1(dy2, dy4) + dzmax = dmax1(dz2, dz3) + + dssp = 0.25d0 * dmax1(dx1, dmax1(dy1, dz1) ) + + c4dssp = 4.0d0 * dssp + c5dssp = 5.0d0 * dssp + + dttx1 = dt*tx1 + dttx2 = dt*tx2 + dtty1 = dt*ty1 + dtty2 = dt*ty2 + dttz1 = dt*tz1 + dttz2 = dt*tz2 + + c2dttx1 = 2.0d0*dttx1 + c2dtty1 = 2.0d0*dtty1 + c2dttz1 = 2.0d0*dttz1 + + dtdssp = dt*dssp + + comz1 = dtdssp + comz4 = 4.0d0*dtdssp + comz5 = 5.0d0*dtdssp + comz6 = 6.0d0*dtdssp + + c3c4tx3 = c3c4*tx3 + c3c4ty3 = c3c4*ty3 + c3c4tz3 = c3c4*tz3 + + dx1tx1 = dx1*tx1 + dx2tx1 = dx2*tx1 + dx3tx1 = dx3*tx1 + dx4tx1 = dx4*tx1 + dx5tx1 = dx5*tx1 + + dy1ty1 = dy1*ty1 + dy2ty1 = dy2*ty1 + dy3ty1 = dy3*ty1 + dy4ty1 = dy4*ty1 + dy5ty1 = dy5*ty1 + + dz1tz1 = dz1*tz1 + dz2tz1 = dz2*tz1 + dz3tz1 = dz3*tz1 + dz4tz1 = dz4*tz1 + dz5tz1 = dz5*tz1 + + c2iv = 2.5d0 + con43 = 4.0d0/3.0d0 + con16 = 1.0d0/6.0d0 + + xxcon1 = c3c4tx3*con43*tx3 + xxcon2 = c3c4tx3*tx3 + xxcon3 = c3c4tx3*conz1*tx3 + xxcon4 = c3c4tx3*con16*tx3 + xxcon5 = c3c4tx3*c1c5*tx3 + + yycon1 = c3c4ty3*con43*ty3 + yycon2 = c3c4ty3*ty3 + yycon3 = c3c4ty3*conz1*ty3 + yycon4 = c3c4ty3*con16*ty3 + yycon5 = c3c4ty3*c1c5*ty3 + + zzcon1 = c3c4tz3*con43*tz3 + zzcon2 = c3c4tz3*tz3 + zzcon3 = c3c4tz3*conz1*tz3 + zzcon4 = c3c4tz3*con16*tz3 + zzcon5 = c3c4tz3*c1c5*tz3 + + return + end diff --git b/NPB3.3-MPI/BT/setup_mpi.f a/NPB3.3-MPI/BT/setup_mpi.f new file mode 100644 index 0000000..987c6bf --- /dev/null +++ a/NPB3.3-MPI/BT/setup_mpi.f @@ -0,0 +1,64 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine setup_mpi + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c set up MPI stuff +c--------------------------------------------------------------------- + + implicit none + include 'mpinpb.h' + include 'npbparams.h' + integer error, color, nc + + call mpi_init(error) + + call mpi_comm_size(MPI_COMM_WORLD, total_nodes, error) + call mpi_comm_rank(MPI_COMM_WORLD, node, error) + + if (.not. convertdouble) then + dp_type = MPI_DOUBLE_PRECISION + else + dp_type = MPI_REAL + endif + +c--------------------------------------------------------------------- +c compute square root; add small number to allow for roundoff +c--------------------------------------------------------------------- + nc = dint(dsqrt(dble(total_nodes) + 0.00001d0)) + +c--------------------------------------------------------------------- +c We handle a non-square number of nodes by making the excess nodes +c inactive. However, we can never handle more cells than were compiled +c in. +c--------------------------------------------------------------------- + + if (nc .gt. maxcells) nc = maxcells + if (node .ge. nc*nc) then + active = .false. + color = 1 + else + active = .true. + color = 0 + end if + + call mpi_comm_split(MPI_COMM_WORLD,color,node,comm_setup,error) + if (.not. active) return + + call mpi_comm_size(comm_setup, no_nodes, error) + call mpi_comm_dup(comm_setup, comm_solve, error) + call mpi_comm_dup(comm_setup, comm_rhs, error) + +c--------------------------------------------------------------------- +c let node 0 be the root for the group (there is only one) +c--------------------------------------------------------------------- + root = 0 + + return + end + diff --git b/NPB3.3-MPI/BT/simple_mpiio.f a/NPB3.3-MPI/BT/simple_mpiio.f new file mode 100644 index 0000000..02e2700 --- /dev/null +++ a/NPB3.3-MPI/BT/simple_mpiio.f @@ -0,0 +1,213 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine setup_btio + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer m, ierr + + iseek=0 + + if (node .eq. root) then + call MPI_File_delete(filenm, MPI_INFO_NULL, ierr) + endif + + call MPI_Barrier(comm_solve, ierr) + + call MPI_File_open(comm_solve, + $ filenm, + $ MPI_MODE_RDWR + MPI_MODE_CREATE, + $ MPI_INFO_NULL, + $ fp, + $ ierr) + + call MPI_File_set_view(fp, + $ iseek, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION, + $ 'native', MPI_INFO_NULL, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error opening file' + stop + endif + + do m = 1, 5 + xce_sub(m) = 0.d0 + end do + + idump_sub = 0 + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine output_timestep + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + include 'header.h' + include 'mpinpb.h' + + integer count, jio, kio, cio, aio + integer ierr + integer mstatus(MPI_STATUS_SIZE) + + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + iseek=5*(cell_low(1,cio) + + $ PROBLEM_SIZE*((cell_low(2,cio)+jio) + + $ PROBLEM_SIZE*((cell_low(3,cio)+kio) + + $ PROBLEM_SIZE*idump_sub))) + + count=5*cell_size(1,cio) + + call MPI_File_write_at(fp, iseek, + $ u(1,0,jio,kio,cio), + $ count, MPI_DOUBLE_PRECISION, + $ mstatus, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error writing to file' + stop + endif + enddo + enddo + enddo + + idump_sub = idump_sub + 1 + if (rd_interval .gt. 0) then + if (idump_sub .ge. rd_interval) then + + call acc_sub_norms(idump+1) + + idump_sub = 0 + endif + endif + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine acc_sub_norms(idump_cur) + + include 'header.h' + include 'mpinpb.h' + + integer idump_cur + + integer count, jio, kio, cio, ii, m, ichunk + integer ierr + integer mstatus(MPI_STATUS_SIZE) + double precision xce_single(5) + + ichunk = idump_cur - idump_sub + 1 + do ii=0, idump_sub-1 + do cio=1,ncells + do kio=0, cell_size(3,cio)-1 + do jio=0, cell_size(2,cio)-1 + iseek=5*(cell_low(1,cio) + + $ PROBLEM_SIZE*((cell_low(2,cio)+jio) + + $ PROBLEM_SIZE*((cell_low(3,cio)+kio) + + $ PROBLEM_SIZE*ii))) + + count=5*cell_size(1,cio) + + call MPI_File_read_at(fp, iseek, + $ u(1,0,jio,kio,cio), + $ count, MPI_DOUBLE_PRECISION, + $ mstatus, ierr) + + if (ierr .ne. MPI_SUCCESS) then + print *, 'Error reading back file' + call MPI_File_close(fp, ierr) + stop + endif + enddo + enddo + enddo + + if (node .eq. root) print *, 'Reading data set ', ii+ichunk + + call error_norm(xce_single) + do m = 1, 5 + xce_sub(m) = xce_sub(m) + xce_single(m) + end do + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine btio_cleanup + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer ierr + + call MPI_File_close(fp, ierr) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine accumulate_norms(xce_acc) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + double precision xce_acc(5) + integer m, ierr + + if (rd_interval .gt. 0) goto 20 + + call MPI_File_open(comm_solve, + $ filenm, + $ MPI_MODE_RDONLY, + $ MPI_INFO_NULL, + $ fp, + $ ierr) + + iseek = 0 + call MPI_File_set_view(fp, + $ iseek, MPI_DOUBLE_PRECISION, MPI_DOUBLE_PRECISION, + $ 'native', MPI_INFO_NULL, ierr) + +c clear the last time step + + call clear_timestep + +c read back the time steps and accumulate norms + + call acc_sub_norms(idump) + + call MPI_File_close(fp, ierr) + + 20 continue + do m = 1, 5 + xce_acc(m) = xce_sub(m) / dble(idump) + end do + + return + end + diff --git b/NPB3.3-MPI/BT/solve_subs.f a/NPB3.3-MPI/BT/solve_subs.f new file mode 100644 index 0000000..351489a --- /dev/null +++ a/NPB3.3-MPI/BT/solve_subs.f @@ -0,0 +1,642 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine matvec_sub(ablock,avec,bvec) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c subtracts bvec=bvec - ablock*avec +c--------------------------------------------------------------------- + + implicit none + + double precision ablock,avec,bvec + dimension ablock(5,5),avec(5),bvec(5) + +c--------------------------------------------------------------------- +c rhs(i,ic,jc,kc,ccell) = rhs(i,ic,jc,kc,ccell) +c $ - lhs(i,1,ablock,ia,ja,ka,acell)* +c--------------------------------------------------------------------- + bvec(1) = bvec(1) - ablock(1,1)*avec(1) + > - ablock(1,2)*avec(2) + > - ablock(1,3)*avec(3) + > - ablock(1,4)*avec(4) + > - ablock(1,5)*avec(5) + bvec(2) = bvec(2) - ablock(2,1)*avec(1) + > - ablock(2,2)*avec(2) + > - ablock(2,3)*avec(3) + > - ablock(2,4)*avec(4) + > - ablock(2,5)*avec(5) + bvec(3) = bvec(3) - ablock(3,1)*avec(1) + > - ablock(3,2)*avec(2) + > - ablock(3,3)*avec(3) + > - ablock(3,4)*avec(4) + > - ablock(3,5)*avec(5) + bvec(4) = bvec(4) - ablock(4,1)*avec(1) + > - ablock(4,2)*avec(2) + > - ablock(4,3)*avec(3) + > - ablock(4,4)*avec(4) + > - ablock(4,5)*avec(5) + bvec(5) = bvec(5) - ablock(5,1)*avec(1) + > - ablock(5,2)*avec(2) + > - ablock(5,3)*avec(3) + > - ablock(5,4)*avec(4) + > - ablock(5,5)*avec(5) + + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine matmul_sub(ablock, bblock, cblock) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c subtracts a(i,j,k) X b(i,j,k) from c(i,j,k) +c--------------------------------------------------------------------- + + implicit none + + double precision ablock, bblock, cblock + dimension ablock(5,5), bblock(5,5), cblock(5,5) + + + cblock(1,1) = cblock(1,1) - ablock(1,1)*bblock(1,1) + > - ablock(1,2)*bblock(2,1) + > - ablock(1,3)*bblock(3,1) + > - ablock(1,4)*bblock(4,1) + > - ablock(1,5)*bblock(5,1) + cblock(2,1) = cblock(2,1) - ablock(2,1)*bblock(1,1) + > - ablock(2,2)*bblock(2,1) + > - ablock(2,3)*bblock(3,1) + > - ablock(2,4)*bblock(4,1) + > - ablock(2,5)*bblock(5,1) + cblock(3,1) = cblock(3,1) - ablock(3,1)*bblock(1,1) + > - ablock(3,2)*bblock(2,1) + > - ablock(3,3)*bblock(3,1) + > - ablock(3,4)*bblock(4,1) + > - ablock(3,5)*bblock(5,1) + cblock(4,1) = cblock(4,1) - ablock(4,1)*bblock(1,1) + > - ablock(4,2)*bblock(2,1) + > - ablock(4,3)*bblock(3,1) + > - ablock(4,4)*bblock(4,1) + > - ablock(4,5)*bblock(5,1) + cblock(5,1) = cblock(5,1) - ablock(5,1)*bblock(1,1) + > - ablock(5,2)*bblock(2,1) + > - ablock(5,3)*bblock(3,1) + > - ablock(5,4)*bblock(4,1) + > - ablock(5,5)*bblock(5,1) + cblock(1,2) = cblock(1,2) - ablock(1,1)*bblock(1,2) + > - ablock(1,2)*bblock(2,2) + > - ablock(1,3)*bblock(3,2) + > - ablock(1,4)*bblock(4,2) + > - ablock(1,5)*bblock(5,2) + cblock(2,2) = cblock(2,2) - ablock(2,1)*bblock(1,2) + > - ablock(2,2)*bblock(2,2) + > - ablock(2,3)*bblock(3,2) + > - ablock(2,4)*bblock(4,2) + > - ablock(2,5)*bblock(5,2) + cblock(3,2) = cblock(3,2) - ablock(3,1)*bblock(1,2) + > - ablock(3,2)*bblock(2,2) + > - ablock(3,3)*bblock(3,2) + > - ablock(3,4)*bblock(4,2) + > - ablock(3,5)*bblock(5,2) + cblock(4,2) = cblock(4,2) - ablock(4,1)*bblock(1,2) + > - ablock(4,2)*bblock(2,2) + > - ablock(4,3)*bblock(3,2) + > - ablock(4,4)*bblock(4,2) + > - ablock(4,5)*bblock(5,2) + cblock(5,2) = cblock(5,2) - ablock(5,1)*bblock(1,2) + > - ablock(5,2)*bblock(2,2) + > - ablock(5,3)*bblock(3,2) + > - ablock(5,4)*bblock(4,2) + > - ablock(5,5)*bblock(5,2) + cblock(1,3) = cblock(1,3) - ablock(1,1)*bblock(1,3) + > - ablock(1,2)*bblock(2,3) + > - ablock(1,3)*bblock(3,3) + > - ablock(1,4)*bblock(4,3) + > - ablock(1,5)*bblock(5,3) + cblock(2,3) = cblock(2,3) - ablock(2,1)*bblock(1,3) + > - ablock(2,2)*bblock(2,3) + > - ablock(2,3)*bblock(3,3) + > - ablock(2,4)*bblock(4,3) + > - ablock(2,5)*bblock(5,3) + cblock(3,3) = cblock(3,3) - ablock(3,1)*bblock(1,3) + > - ablock(3,2)*bblock(2,3) + > - ablock(3,3)*bblock(3,3) + > - ablock(3,4)*bblock(4,3) + > - ablock(3,5)*bblock(5,3) + cblock(4,3) = cblock(4,3) - ablock(4,1)*bblock(1,3) + > - ablock(4,2)*bblock(2,3) + > - ablock(4,3)*bblock(3,3) + > - ablock(4,4)*bblock(4,3) + > - ablock(4,5)*bblock(5,3) + cblock(5,3) = cblock(5,3) - ablock(5,1)*bblock(1,3) + > - ablock(5,2)*bblock(2,3) + > - ablock(5,3)*bblock(3,3) + > - ablock(5,4)*bblock(4,3) + > - ablock(5,5)*bblock(5,3) + cblock(1,4) = cblock(1,4) - ablock(1,1)*bblock(1,4) + > - ablock(1,2)*bblock(2,4) + > - ablock(1,3)*bblock(3,4) + > - ablock(1,4)*bblock(4,4) + > - ablock(1,5)*bblock(5,4) + cblock(2,4) = cblock(2,4) - ablock(2,1)*bblock(1,4) + > - ablock(2,2)*bblock(2,4) + > - ablock(2,3)*bblock(3,4) + > - ablock(2,4)*bblock(4,4) + > - ablock(2,5)*bblock(5,4) + cblock(3,4) = cblock(3,4) - ablock(3,1)*bblock(1,4) + > - ablock(3,2)*bblock(2,4) + > - ablock(3,3)*bblock(3,4) + > - ablock(3,4)*bblock(4,4) + > - ablock(3,5)*bblock(5,4) + cblock(4,4) = cblock(4,4) - ablock(4,1)*bblock(1,4) + > - ablock(4,2)*bblock(2,4) + > - ablock(4,3)*bblock(3,4) + > - ablock(4,4)*bblock(4,4) + > - ablock(4,5)*bblock(5,4) + cblock(5,4) = cblock(5,4) - ablock(5,1)*bblock(1,4) + > - ablock(5,2)*bblock(2,4) + > - ablock(5,3)*bblock(3,4) + > - ablock(5,4)*bblock(4,4) + > - ablock(5,5)*bblock(5,4) + cblock(1,5) = cblock(1,5) - ablock(1,1)*bblock(1,5) + > - ablock(1,2)*bblock(2,5) + > - ablock(1,3)*bblock(3,5) + > - ablock(1,4)*bblock(4,5) + > - ablock(1,5)*bblock(5,5) + cblock(2,5) = cblock(2,5) - ablock(2,1)*bblock(1,5) + > - ablock(2,2)*bblock(2,5) + > - ablock(2,3)*bblock(3,5) + > - ablock(2,4)*bblock(4,5) + > - ablock(2,5)*bblock(5,5) + cblock(3,5) = cblock(3,5) - ablock(3,1)*bblock(1,5) + > - ablock(3,2)*bblock(2,5) + > - ablock(3,3)*bblock(3,5) + > - ablock(3,4)*bblock(4,5) + > - ablock(3,5)*bblock(5,5) + cblock(4,5) = cblock(4,5) - ablock(4,1)*bblock(1,5) + > - ablock(4,2)*bblock(2,5) + > - ablock(4,3)*bblock(3,5) + > - ablock(4,4)*bblock(4,5) + > - ablock(4,5)*bblock(5,5) + cblock(5,5) = cblock(5,5) - ablock(5,1)*bblock(1,5) + > - ablock(5,2)*bblock(2,5) + > - ablock(5,3)*bblock(3,5) + > - ablock(5,4)*bblock(4,5) + > - ablock(5,5)*bblock(5,5) + + + return + end + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine binvcrhs( lhs,c,r ) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c +c--------------------------------------------------------------------- + + implicit none + + double precision pivot, coeff, lhs + dimension lhs(5,5) + double precision c(5,5), r(5) + +c--------------------------------------------------------------------- +c +c--------------------------------------------------------------------- + + pivot = 1.00d0/lhs(1,1) + lhs(1,2) = lhs(1,2)*pivot + lhs(1,3) = lhs(1,3)*pivot + lhs(1,4) = lhs(1,4)*pivot + lhs(1,5) = lhs(1,5)*pivot + c(1,1) = c(1,1)*pivot + c(1,2) = c(1,2)*pivot + c(1,3) = c(1,3)*pivot + c(1,4) = c(1,4)*pivot + c(1,5) = c(1,5)*pivot + r(1) = r(1) *pivot + + coeff = lhs(2,1) + lhs(2,2)= lhs(2,2) - coeff*lhs(1,2) + lhs(2,3)= lhs(2,3) - coeff*lhs(1,3) + lhs(2,4)= lhs(2,4) - coeff*lhs(1,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(1,5) + c(2,1) = c(2,1) - coeff*c(1,1) + c(2,2) = c(2,2) - coeff*c(1,2) + c(2,3) = c(2,3) - coeff*c(1,3) + c(2,4) = c(2,4) - coeff*c(1,4) + c(2,5) = c(2,5) - coeff*c(1,5) + r(2) = r(2) - coeff*r(1) + + coeff = lhs(3,1) + lhs(3,2)= lhs(3,2) - coeff*lhs(1,2) + lhs(3,3)= lhs(3,3) - coeff*lhs(1,3) + lhs(3,4)= lhs(3,4) - coeff*lhs(1,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(1,5) + c(3,1) = c(3,1) - coeff*c(1,1) + c(3,2) = c(3,2) - coeff*c(1,2) + c(3,3) = c(3,3) - coeff*c(1,3) + c(3,4) = c(3,4) - coeff*c(1,4) + c(3,5) = c(3,5) - coeff*c(1,5) + r(3) = r(3) - coeff*r(1) + + coeff = lhs(4,1) + lhs(4,2)= lhs(4,2) - coeff*lhs(1,2) + lhs(4,3)= lhs(4,3) - coeff*lhs(1,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(1,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(1,5) + c(4,1) = c(4,1) - coeff*c(1,1) + c(4,2) = c(4,2) - coeff*c(1,2) + c(4,3) = c(4,3) - coeff*c(1,3) + c(4,4) = c(4,4) - coeff*c(1,4) + c(4,5) = c(4,5) - coeff*c(1,5) + r(4) = r(4) - coeff*r(1) + + coeff = lhs(5,1) + lhs(5,2)= lhs(5,2) - coeff*lhs(1,2) + lhs(5,3)= lhs(5,3) - coeff*lhs(1,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(1,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(1,5) + c(5,1) = c(5,1) - coeff*c(1,1) + c(5,2) = c(5,2) - coeff*c(1,2) + c(5,3) = c(5,3) - coeff*c(1,3) + c(5,4) = c(5,4) - coeff*c(1,4) + c(5,5) = c(5,5) - coeff*c(1,5) + r(5) = r(5) - coeff*r(1) + + + pivot = 1.00d0/lhs(2,2) + lhs(2,3) = lhs(2,3)*pivot + lhs(2,4) = lhs(2,4)*pivot + lhs(2,5) = lhs(2,5)*pivot + c(2,1) = c(2,1)*pivot + c(2,2) = c(2,2)*pivot + c(2,3) = c(2,3)*pivot + c(2,4) = c(2,4)*pivot + c(2,5) = c(2,5)*pivot + r(2) = r(2) *pivot + + coeff = lhs(1,2) + lhs(1,3)= lhs(1,3) - coeff*lhs(2,3) + lhs(1,4)= lhs(1,4) - coeff*lhs(2,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(2,5) + c(1,1) = c(1,1) - coeff*c(2,1) + c(1,2) = c(1,2) - coeff*c(2,2) + c(1,3) = c(1,3) - coeff*c(2,3) + c(1,4) = c(1,4) - coeff*c(2,4) + c(1,5) = c(1,5) - coeff*c(2,5) + r(1) = r(1) - coeff*r(2) + + coeff = lhs(3,2) + lhs(3,3)= lhs(3,3) - coeff*lhs(2,3) + lhs(3,4)= lhs(3,4) - coeff*lhs(2,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(2,5) + c(3,1) = c(3,1) - coeff*c(2,1) + c(3,2) = c(3,2) - coeff*c(2,2) + c(3,3) = c(3,3) - coeff*c(2,3) + c(3,4) = c(3,4) - coeff*c(2,4) + c(3,5) = c(3,5) - coeff*c(2,5) + r(3) = r(3) - coeff*r(2) + + coeff = lhs(4,2) + lhs(4,3)= lhs(4,3) - coeff*lhs(2,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(2,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(2,5) + c(4,1) = c(4,1) - coeff*c(2,1) + c(4,2) = c(4,2) - coeff*c(2,2) + c(4,3) = c(4,3) - coeff*c(2,3) + c(4,4) = c(4,4) - coeff*c(2,4) + c(4,5) = c(4,5) - coeff*c(2,5) + r(4) = r(4) - coeff*r(2) + + coeff = lhs(5,2) + lhs(5,3)= lhs(5,3) - coeff*lhs(2,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(2,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(2,5) + c(5,1) = c(5,1) - coeff*c(2,1) + c(5,2) = c(5,2) - coeff*c(2,2) + c(5,3) = c(5,3) - coeff*c(2,3) + c(5,4) = c(5,4) - coeff*c(2,4) + c(5,5) = c(5,5) - coeff*c(2,5) + r(5) = r(5) - coeff*r(2) + + + pivot = 1.00d0/lhs(3,3) + lhs(3,4) = lhs(3,4)*pivot + lhs(3,5) = lhs(3,5)*pivot + c(3,1) = c(3,1)*pivot + c(3,2) = c(3,2)*pivot + c(3,3) = c(3,3)*pivot + c(3,4) = c(3,4)*pivot + c(3,5) = c(3,5)*pivot + r(3) = r(3) *pivot + + coeff = lhs(1,3) + lhs(1,4)= lhs(1,4) - coeff*lhs(3,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(3,5) + c(1,1) = c(1,1) - coeff*c(3,1) + c(1,2) = c(1,2) - coeff*c(3,2) + c(1,3) = c(1,3) - coeff*c(3,3) + c(1,4) = c(1,4) - coeff*c(3,4) + c(1,5) = c(1,5) - coeff*c(3,5) + r(1) = r(1) - coeff*r(3) + + coeff = lhs(2,3) + lhs(2,4)= lhs(2,4) - coeff*lhs(3,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(3,5) + c(2,1) = c(2,1) - coeff*c(3,1) + c(2,2) = c(2,2) - coeff*c(3,2) + c(2,3) = c(2,3) - coeff*c(3,3) + c(2,4) = c(2,4) - coeff*c(3,4) + c(2,5) = c(2,5) - coeff*c(3,5) + r(2) = r(2) - coeff*r(3) + + coeff = lhs(4,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(3,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(3,5) + c(4,1) = c(4,1) - coeff*c(3,1) + c(4,2) = c(4,2) - coeff*c(3,2) + c(4,3) = c(4,3) - coeff*c(3,3) + c(4,4) = c(4,4) - coeff*c(3,4) + c(4,5) = c(4,5) - coeff*c(3,5) + r(4) = r(4) - coeff*r(3) + + coeff = lhs(5,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(3,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(3,5) + c(5,1) = c(5,1) - coeff*c(3,1) + c(5,2) = c(5,2) - coeff*c(3,2) + c(5,3) = c(5,3) - coeff*c(3,3) + c(5,4) = c(5,4) - coeff*c(3,4) + c(5,5) = c(5,5) - coeff*c(3,5) + r(5) = r(5) - coeff*r(3) + + + pivot = 1.00d0/lhs(4,4) + lhs(4,5) = lhs(4,5)*pivot + c(4,1) = c(4,1)*pivot + c(4,2) = c(4,2)*pivot + c(4,3) = c(4,3)*pivot + c(4,4) = c(4,4)*pivot + c(4,5) = c(4,5)*pivot + r(4) = r(4) *pivot + + coeff = lhs(1,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(4,5) + c(1,1) = c(1,1) - coeff*c(4,1) + c(1,2) = c(1,2) - coeff*c(4,2) + c(1,3) = c(1,3) - coeff*c(4,3) + c(1,4) = c(1,4) - coeff*c(4,4) + c(1,5) = c(1,5) - coeff*c(4,5) + r(1) = r(1) - coeff*r(4) + + coeff = lhs(2,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(4,5) + c(2,1) = c(2,1) - coeff*c(4,1) + c(2,2) = c(2,2) - coeff*c(4,2) + c(2,3) = c(2,3) - coeff*c(4,3) + c(2,4) = c(2,4) - coeff*c(4,4) + c(2,5) = c(2,5) - coeff*c(4,5) + r(2) = r(2) - coeff*r(4) + + coeff = lhs(3,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(4,5) + c(3,1) = c(3,1) - coeff*c(4,1) + c(3,2) = c(3,2) - coeff*c(4,2) + c(3,3) = c(3,3) - coeff*c(4,3) + c(3,4) = c(3,4) - coeff*c(4,4) + c(3,5) = c(3,5) - coeff*c(4,5) + r(3) = r(3) - coeff*r(4) + + coeff = lhs(5,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(4,5) + c(5,1) = c(5,1) - coeff*c(4,1) + c(5,2) = c(5,2) - coeff*c(4,2) + c(5,3) = c(5,3) - coeff*c(4,3) + c(5,4) = c(5,4) - coeff*c(4,4) + c(5,5) = c(5,5) - coeff*c(4,5) + r(5) = r(5) - coeff*r(4) + + + pivot = 1.00d0/lhs(5,5) + c(5,1) = c(5,1)*pivot + c(5,2) = c(5,2)*pivot + c(5,3) = c(5,3)*pivot + c(5,4) = c(5,4)*pivot + c(5,5) = c(5,5)*pivot + r(5) = r(5) *pivot + + coeff = lhs(1,5) + c(1,1) = c(1,1) - coeff*c(5,1) + c(1,2) = c(1,2) - coeff*c(5,2) + c(1,3) = c(1,3) - coeff*c(5,3) + c(1,4) = c(1,4) - coeff*c(5,4) + c(1,5) = c(1,5) - coeff*c(5,5) + r(1) = r(1) - coeff*r(5) + + coeff = lhs(2,5) + c(2,1) = c(2,1) - coeff*c(5,1) + c(2,2) = c(2,2) - coeff*c(5,2) + c(2,3) = c(2,3) - coeff*c(5,3) + c(2,4) = c(2,4) - coeff*c(5,4) + c(2,5) = c(2,5) - coeff*c(5,5) + r(2) = r(2) - coeff*r(5) + + coeff = lhs(3,5) + c(3,1) = c(3,1) - coeff*c(5,1) + c(3,2) = c(3,2) - coeff*c(5,2) + c(3,3) = c(3,3) - coeff*c(5,3) + c(3,4) = c(3,4) - coeff*c(5,4) + c(3,5) = c(3,5) - coeff*c(5,5) + r(3) = r(3) - coeff*r(5) + + coeff = lhs(4,5) + c(4,1) = c(4,1) - coeff*c(5,1) + c(4,2) = c(4,2) - coeff*c(5,2) + c(4,3) = c(4,3) - coeff*c(5,3) + c(4,4) = c(4,4) - coeff*c(5,4) + c(4,5) = c(4,5) - coeff*c(5,5) + r(4) = r(4) - coeff*r(5) + + + return + end + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine binvrhs( lhs,r ) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c +c--------------------------------------------------------------------- + + implicit none + + double precision pivot, coeff, lhs + dimension lhs(5,5) + double precision r(5) + +c--------------------------------------------------------------------- +c +c--------------------------------------------------------------------- + + + pivot = 1.00d0/lhs(1,1) + lhs(1,2) = lhs(1,2)*pivot + lhs(1,3) = lhs(1,3)*pivot + lhs(1,4) = lhs(1,4)*pivot + lhs(1,5) = lhs(1,5)*pivot + r(1) = r(1) *pivot + + coeff = lhs(2,1) + lhs(2,2)= lhs(2,2) - coeff*lhs(1,2) + lhs(2,3)= lhs(2,3) - coeff*lhs(1,3) + lhs(2,4)= lhs(2,4) - coeff*lhs(1,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(1,5) + r(2) = r(2) - coeff*r(1) + + coeff = lhs(3,1) + lhs(3,2)= lhs(3,2) - coeff*lhs(1,2) + lhs(3,3)= lhs(3,3) - coeff*lhs(1,3) + lhs(3,4)= lhs(3,4) - coeff*lhs(1,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(1,5) + r(3) = r(3) - coeff*r(1) + + coeff = lhs(4,1) + lhs(4,2)= lhs(4,2) - coeff*lhs(1,2) + lhs(4,3)= lhs(4,3) - coeff*lhs(1,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(1,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(1,5) + r(4) = r(4) - coeff*r(1) + + coeff = lhs(5,1) + lhs(5,2)= lhs(5,2) - coeff*lhs(1,2) + lhs(5,3)= lhs(5,3) - coeff*lhs(1,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(1,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(1,5) + r(5) = r(5) - coeff*r(1) + + + pivot = 1.00d0/lhs(2,2) + lhs(2,3) = lhs(2,3)*pivot + lhs(2,4) = lhs(2,4)*pivot + lhs(2,5) = lhs(2,5)*pivot + r(2) = r(2) *pivot + + coeff = lhs(1,2) + lhs(1,3)= lhs(1,3) - coeff*lhs(2,3) + lhs(1,4)= lhs(1,4) - coeff*lhs(2,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(2,5) + r(1) = r(1) - coeff*r(2) + + coeff = lhs(3,2) + lhs(3,3)= lhs(3,3) - coeff*lhs(2,3) + lhs(3,4)= lhs(3,4) - coeff*lhs(2,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(2,5) + r(3) = r(3) - coeff*r(2) + + coeff = lhs(4,2) + lhs(4,3)= lhs(4,3) - coeff*lhs(2,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(2,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(2,5) + r(4) = r(4) - coeff*r(2) + + coeff = lhs(5,2) + lhs(5,3)= lhs(5,3) - coeff*lhs(2,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(2,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(2,5) + r(5) = r(5) - coeff*r(2) + + + pivot = 1.00d0/lhs(3,3) + lhs(3,4) = lhs(3,4)*pivot + lhs(3,5) = lhs(3,5)*pivot + r(3) = r(3) *pivot + + coeff = lhs(1,3) + lhs(1,4)= lhs(1,4) - coeff*lhs(3,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(3,5) + r(1) = r(1) - coeff*r(3) + + coeff = lhs(2,3) + lhs(2,4)= lhs(2,4) - coeff*lhs(3,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(3,5) + r(2) = r(2) - coeff*r(3) + + coeff = lhs(4,3) + lhs(4,4)= lhs(4,4) - coeff*lhs(3,4) + lhs(4,5)= lhs(4,5) - coeff*lhs(3,5) + r(4) = r(4) - coeff*r(3) + + coeff = lhs(5,3) + lhs(5,4)= lhs(5,4) - coeff*lhs(3,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(3,5) + r(5) = r(5) - coeff*r(3) + + + pivot = 1.00d0/lhs(4,4) + lhs(4,5) = lhs(4,5)*pivot + r(4) = r(4) *pivot + + coeff = lhs(1,4) + lhs(1,5)= lhs(1,5) - coeff*lhs(4,5) + r(1) = r(1) - coeff*r(4) + + coeff = lhs(2,4) + lhs(2,5)= lhs(2,5) - coeff*lhs(4,5) + r(2) = r(2) - coeff*r(4) + + coeff = lhs(3,4) + lhs(3,5)= lhs(3,5) - coeff*lhs(4,5) + r(3) = r(3) - coeff*r(4) + + coeff = lhs(5,4) + lhs(5,5)= lhs(5,5) - coeff*lhs(4,5) + r(5) = r(5) - coeff*r(4) + + + pivot = 1.00d0/lhs(5,5) + r(5) = r(5) *pivot + + coeff = lhs(1,5) + r(1) = r(1) - coeff*r(5) + + coeff = lhs(2,5) + r(2) = r(2) - coeff*r(5) + + coeff = lhs(3,5) + r(3) = r(3) - coeff*r(5) + + coeff = lhs(4,5) + r(4) = r(4) - coeff*r(5) + + + return + end + + + diff --git b/NPB3.3-MPI/BT/verify.f a/NPB3.3-MPI/BT/verify.f new file mode 100644 index 0000000..d1863f2 --- /dev/null +++ a/NPB3.3-MPI/BT/verify.f @@ -0,0 +1,434 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine verify(no_time_steps, class, verified) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c verification routine +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + double precision xcrref(5),xceref(5),xcrdif(5),xcedif(5), + > epsilon, xce(5), xcr(5), dtref + integer m, no_time_steps + character class + logical verified + +c--------------------------------------------------------------------- +c tolerance level +c--------------------------------------------------------------------- + epsilon = 1.0d-08 + verified = .true. + +c--------------------------------------------------------------------- +c compute the error norm and the residual norm, and exit if not printing +c--------------------------------------------------------------------- + + if (iotype .ne. 0) then + call accumulate_norms(xce) + else + call error_norm(xce) + endif + + call copy_faces + + call rhs_norm(xcr) + + do m = 1, 5 + xcr(m) = xcr(m) / dt + enddo + + if (node .ne. 0) return + + class = 'U' + + do m = 1,5 + xcrref(m) = 1.0 + xceref(m) = 1.0 + end do + +c--------------------------------------------------------------------- +c reference data for 12X12X12 grids after 60 time steps, with DT = 1.0d-02 +c--------------------------------------------------------------------- + if ( (grid_points(1) .eq. 12 ) .and. + > (grid_points(2) .eq. 12 ) .and. + > (grid_points(3) .eq. 12 ) .and. + > (no_time_steps .eq. 60 )) then + + class = 'S' + dtref = 1.0d-2 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of residual. +c--------------------------------------------------------------------- + xcrref(1) = 1.7034283709541311d-01 + xcrref(2) = 1.2975252070034097d-02 + xcrref(3) = 3.2527926989486055d-02 + xcrref(4) = 2.6436421275166801d-02 + xcrref(5) = 1.9211784131744430d-01 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of solution error. +c--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 4.9976913345811579d-04 + xceref(2) = 4.5195666782961927d-05 + xceref(3) = 7.3973765172921357d-05 + xceref(4) = 7.3821238632439731d-05 + xceref(5) = 8.9269630987491446d-04 + else + xceref(1) = 0.1149036328945d+02 + xceref(2) = 0.9156788904727d+00 + xceref(3) = 0.2857899428614d+01 + xceref(4) = 0.2598273346734d+01 + xceref(5) = 0.2652795397547d+02 + endif + +c--------------------------------------------------------------------- +c reference data for 24X24X24 grids after 200 time steps, with DT = 0.8d-3 +c--------------------------------------------------------------------- + elseif ( (grid_points(1) .eq. 24) .and. + > (grid_points(2) .eq. 24) .and. + > (grid_points(3) .eq. 24) .and. + > (no_time_steps . eq. 200) ) then + + class = 'W' + dtref = 0.8d-3 +c--------------------------------------------------------------------- +c Reference values of RMS-norms of residual. +c--------------------------------------------------------------------- + xcrref(1) = 0.1125590409344d+03 + xcrref(2) = 0.1180007595731d+02 + xcrref(3) = 0.2710329767846d+02 + xcrref(4) = 0.2469174937669d+02 + xcrref(5) = 0.2638427874317d+03 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of solution error. +c--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.4419655736008d+01 + xceref(2) = 0.4638531260002d+00 + xceref(3) = 0.1011551749967d+01 + xceref(4) = 0.9235878729944d+00 + xceref(5) = 0.1018045837718d+02 + else + xceref(1) = 0.6729594398612d+02 + xceref(2) = 0.5264523081690d+01 + xceref(3) = 0.1677107142637d+02 + xceref(4) = 0.1508721463436d+02 + xceref(5) = 0.1477018363393d+03 + endif + + +c--------------------------------------------------------------------- +c reference data for 64X64X64 grids after 200 time steps, with DT = 0.8d-3 +c--------------------------------------------------------------------- + elseif ( (grid_points(1) .eq. 64) .and. + > (grid_points(2) .eq. 64) .and. + > (grid_points(3) .eq. 64) .and. + > (no_time_steps . eq. 200) ) then + + class = 'A' + dtref = 0.8d-3 +c--------------------------------------------------------------------- +c Reference values of RMS-norms of residual. +c--------------------------------------------------------------------- + xcrref(1) = 1.0806346714637264d+02 + xcrref(2) = 1.1319730901220813d+01 + xcrref(3) = 2.5974354511582465d+01 + xcrref(4) = 2.3665622544678910d+01 + xcrref(5) = 2.5278963211748344d+02 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of solution error. +c--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 4.2348416040525025d+00 + xceref(2) = 4.4390282496995698d-01 + xceref(3) = 9.6692480136345650d-01 + xceref(4) = 8.8302063039765474d-01 + xceref(5) = 9.7379901770829278d+00 + else + xceref(1) = 0.6482218724961d+02 + xceref(2) = 0.5066461714527d+01 + xceref(3) = 0.1613931961359d+02 + xceref(4) = 0.1452010201481d+02 + xceref(5) = 0.1420099377681d+03 + endif + +c--------------------------------------------------------------------- +c reference data for 102X102X102 grids after 200 time steps, +c with DT = 3.0d-04 +c--------------------------------------------------------------------- + elseif ( (grid_points(1) .eq. 102) .and. + > (grid_points(2) .eq. 102) .and. + > (grid_points(3) .eq. 102) .and. + > (no_time_steps . eq. 200) ) then + + class = 'B' + dtref = 3.0d-4 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of residual. +c--------------------------------------------------------------------- + xcrref(1) = 1.4233597229287254d+03 + xcrref(2) = 9.9330522590150238d+01 + xcrref(3) = 3.5646025644535285d+02 + xcrref(4) = 3.2485447959084092d+02 + xcrref(5) = 3.2707541254659363d+03 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of solution error. +c--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 5.2969847140936856d+01 + xceref(2) = 4.4632896115670668d+00 + xceref(3) = 1.3122573342210174d+01 + xceref(4) = 1.2006925323559144d+01 + xceref(5) = 1.2459576151035986d+02 + else + xceref(1) = 0.1477545106464d+03 + xceref(2) = 0.1108895555053d+02 + xceref(3) = 0.3698065590331d+02 + xceref(4) = 0.3310505581440d+02 + xceref(5) = 0.3157928282563d+03 + endif + +c--------------------------------------------------------------------- +c reference data for 162X162X162 grids after 200 time steps, +c with DT = 1.0d-04 +c--------------------------------------------------------------------- + elseif ( (grid_points(1) .eq. 162) .and. + > (grid_points(2) .eq. 162) .and. + > (grid_points(3) .eq. 162) .and. + > (no_time_steps . eq. 200) ) then + + class = 'C' + dtref = 1.0d-4 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of residual. +c--------------------------------------------------------------------- + xcrref(1) = 0.62398116551764615d+04 + xcrref(2) = 0.50793239190423964d+03 + xcrref(3) = 0.15423530093013596d+04 + xcrref(4) = 0.13302387929291190d+04 + xcrref(5) = 0.11604087428436455d+05 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of solution error. +c--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.16462008369091265d+03 + xceref(2) = 0.11497107903824313d+02 + xceref(3) = 0.41207446207461508d+02 + xceref(4) = 0.37087651059694167d+02 + xceref(5) = 0.36211053051841265d+03 + else + xceref(1) = 0.2597156483475d+03 + xceref(2) = 0.1985384289495d+02 + xceref(3) = 0.6517950485788d+02 + xceref(4) = 0.5757235541520d+02 + xceref(5) = 0.5215668188726d+03 + endif + + +c--------------------------------------------------------------------- +c reference data for 408x408x408 grids after 250 time steps, +c with DT = 0.2d-04 +c--------------------------------------------------------------------- + elseif ( (grid_points(1) .eq. 408) .and. + > (grid_points(2) .eq. 408) .and. + > (grid_points(3) .eq. 408) .and. + > (no_time_steps . eq. 250) ) then + + class = 'D' + dtref = 0.2d-4 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of residual. +c--------------------------------------------------------------------- + xcrref(1) = 0.2533188551738d+05 + xcrref(2) = 0.2346393716980d+04 + xcrref(3) = 0.6294554366904d+04 + xcrref(4) = 0.5352565376030d+04 + xcrref(5) = 0.3905864038618d+05 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of solution error. +c--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.3100009377557d+03 + xceref(2) = 0.2424086324913d+02 + xceref(3) = 0.7782212022645d+02 + xceref(4) = 0.6835623860116d+02 + xceref(5) = 0.6065737200368d+03 + else + xceref(1) = 0.3813781566713d+03 + xceref(2) = 0.3160872966198d+02 + xceref(3) = 0.9593576357290d+02 + xceref(4) = 0.8363391989815d+02 + xceref(5) = 0.7063466087423d+03 + endif + + +c--------------------------------------------------------------------- +c reference data for 1020x1020x1020 grids after 250 time steps, +c with DT = 0.4d-05 +c--------------------------------------------------------------------- + elseif ( (grid_points(1) .eq. 1020) .and. + > (grid_points(2) .eq. 1020) .and. + > (grid_points(3) .eq. 1020) .and. + > (no_time_steps . eq. 250) ) then + + class = 'E' + dtref = 0.4d-5 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of residual. +c--------------------------------------------------------------------- + xcrref(1) = 0.9795372484517d+05 + xcrref(2) = 0.9739814511521d+04 + xcrref(3) = 0.2467606342965d+05 + xcrref(4) = 0.2092419572860d+05 + xcrref(5) = 0.1392138856939d+06 + +c--------------------------------------------------------------------- +c Reference values of RMS-norms of solution error. +c--------------------------------------------------------------------- + + if (iotype .eq. 0) then + xceref(1) = 0.4327562208414d+03 + xceref(2) = 0.3699051964887d+02 + xceref(3) = 0.1089845040954d+03 + xceref(4) = 0.9462517622043d+02 + xceref(5) = 0.7765512765309d+03 + else +c wr_interval = 5 + xceref(1) = 0.4729898413058d+03 + xceref(2) = 0.4145899331704d+02 + xceref(3) = 0.1192850917138d+03 + xceref(4) = 0.1032746026932d+03 + xceref(5) = 0.8270322177634d+03 +c wr_interval = 10 +c xceref(1) = 0.4718135916251d+03 +c xceref(2) = 0.4132620259096d+02 +c xceref(3) = 0.1189831133503d+03 +c xceref(4) = 0.1030212798803d+03 +c xceref(5) = 0.8255924078458d+03 + endif + + else + verified = .false. + endif + +c--------------------------------------------------------------------- +c verification test for residuals if gridsize is one of +c the defined grid sizes above (class .ne. 'U') +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Compute the difference of solution values and the known reference +c values. +c--------------------------------------------------------------------- + do m = 1, 5 + + xcrdif(m) = dabs((xcr(m)-xcrref(m))/xcrref(m)) + xcedif(m) = dabs((xce(m)-xceref(m))/xceref(m)) + + enddo + +c--------------------------------------------------------------------- +c Output the comparison of computed results to known cases. +c--------------------------------------------------------------------- + + if (class .ne. 'U') then + write(*, 1990) class + 1990 format(' Verification being performed for class ', a) + write (*,2000) epsilon + 2000 format(' accuracy setting for epsilon = ', E20.13) + verified = (dabs(dt-dtref) .le. epsilon) + if (.not.verified) then + class = 'U' + write (*,1000) dtref + 1000 format(' DT does not match the reference value of ', + > E15.8) + endif + else + write(*, 1995) + 1995 format(' Unknown class') + endif + + + if (class .ne. 'U') then + write (*,2001) + else + write (*, 2005) + endif + + 2001 format(' Comparison of RMS-norms of residual') + 2005 format(' RMS-norms of residual') + do m = 1, 5 + if (class .eq. 'U') then + write(*, 2015) m, xcr(m) + else if (xcrdif(m) .le. epsilon) then + write (*,2011) m,xcr(m),xcrref(m),xcrdif(m) + else + verified = .false. + write (*,2010) m,xcr(m),xcrref(m),xcrdif(m) + endif + enddo + + if (class .ne. 'U') then + write (*,2002) + else + write (*,2006) + endif + 2002 format(' Comparison of RMS-norms of solution error') + 2006 format(' RMS-norms of solution error') + + do m = 1, 5 + if (class .eq. 'U') then + write(*, 2015) m, xce(m) + else if (xcedif(m) .le. epsilon) then + write (*,2011) m,xce(m),xceref(m),xcedif(m) + else + verified = .false. + write (*,2010) m,xce(m),xceref(m),xcedif(m) + endif + enddo + + 2010 format(' FAILURE: ', i2, E20.13, E20.13, E20.13) + 2011 format(' ', i2, E20.13, E20.13, E20.13) + 2015 format(' ', i2, E20.13) + + if (class .eq. 'U') then + write(*, 2022) + write(*, 2023) + 2022 format(' No reference values provided') + 2023 format(' No verification performed') + else if (verified) then + write(*, 2020) + 2020 format(' Verification Successful') + else + write(*, 2021) + 2021 format(' Verification failed') + endif + + return + + + end diff --git b/NPB3.3-MPI/BT/work_lhs.h a/NPB3.3-MPI/BT/work_lhs.h new file mode 100644 index 0000000..d9bc9e4 --- /dev/null +++ a/NPB3.3-MPI/BT/work_lhs.h @@ -0,0 +1,14 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- +c +c work_lhs.h +c +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + double precision fjac(5, 5, -2:MAX_CELL_DIM+1), + > njac(5, 5, -2:MAX_CELL_DIM+1), + > lhsa(5, 5, -1:MAX_CELL_DIM), + > lhsb(5, 5, -1:MAX_CELL_DIM), + > tmp1, tmp2, tmp3 + common /work_lhs/ fjac, njac, lhsa, lhsb, tmp1, tmp2, tmp3 diff --git b/NPB3.3-MPI/BT/work_lhs_vec.h a/NPB3.3-MPI/BT/work_lhs_vec.h new file mode 100644 index 0000000..a97054f --- /dev/null +++ a/NPB3.3-MPI/BT/work_lhs_vec.h @@ -0,0 +1,14 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- +c +c work_lhs_vec.h +c +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + double precision fjac(5, 5, -2:MAX_CELL_DIM+1, -2:MAX_CELL_DIM+1), + > njac(5, 5, -2:MAX_CELL_DIM+1, -2:MAX_CELL_DIM+1), + > lhsa(5, 5, -1:MAX_CELL_DIM, -1:MAX_CELL_DIM), + > lhsb(5, 5, -1:MAX_CELL_DIM, -1:MAX_CELL_DIM), + > tmp1, tmp2, tmp3 + common /work_lhs/ fjac, njac, lhsa, lhsb, tmp1, tmp2, tmp3 diff --git b/NPB3.3-MPI/BT/x_solve.f a/NPB3.3-MPI/BT/x_solve.f new file mode 100644 index 0000000..ecc2c02 --- /dev/null +++ a/NPB3.3-MPI/BT/x_solve.f @@ -0,0 +1,771 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_solve + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c +c Performs line solves in X direction by first factoring +c the block-tridiagonal matrix into an upper triangular matrix, +c and then performing back substitution to solve for the unknow +c vectors of each line. +c +c Make sure we treat elements zero to cell_size in the direction +c of the sweep. +c +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + integer c, istart, stage, + > first, last, recv_id, error, r_status(MPI_STATUS_SIZE), + > isize,jsize,ksize,send_id + + istart = 0 + + if (timeron) call timer_start(t_xsolve) +c--------------------------------------------------------------------- +c in our terminology stage is the number of the cell in the x-direction +c i.e. stage = 1 means the start of the line stage=ncells means end +c--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(1,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 + +c--------------------------------------------------------------------- +c set last-cell flag +c--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +c--------------------------------------------------------------------- +c This is the first cell, so solve without receiving data +c--------------------------------------------------------------------- + first = 1 +c call lhsx(c) + call x_solve_cell(first,last,c) + else +c--------------------------------------------------------------------- +c Not the first cell of this line, so receive info from +c processor working on preceeding cell +c--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_xcomm) + call x_receive_solve_info(recv_id,c) +c--------------------------------------------------------------------- +c overlap computations and communications +c--------------------------------------------------------------------- +c call lhsx(c) +c--------------------------------------------------------------------- +c wait for completion +c--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_xcomm) +c--------------------------------------------------------------------- +c install C'(istart) and rhs'(istart) to be used in this cell +c--------------------------------------------------------------------- + call x_unpack_solve_info(c) + call x_solve_cell(first,last,c) + endif + + if (last .eq. 0) call x_send_solve_info(send_id,c) + enddo + +c--------------------------------------------------------------------- +c now perform backsubstitution in reverse direction +c--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(1,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +c--------------------------------------------------------------------- +c last cell, so perform back substitute without waiting +c--------------------------------------------------------------------- + call x_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_xcomm) + call x_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_xcomm) + call x_unpack_backsub_info(c) + call x_backsubstitute(first,last,c) + endif + if (first .eq. 0) call x_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_xsolve) + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_unpack_solve_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack C'(-1) and rhs'(-1) for +c all j and k +c--------------------------------------------------------------------- + + include 'header.h' + integer j,k,m,n,ptr,c,istart + + istart = 0 + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,istart-1,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,istart-1,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_send_solve_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send C'(iend) and rhs'(iend) for +c all j and k +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer j,k,m,n,isize,ptr,c,jp,kp + integer error,send_id,buffer_size + + isize = cell_size(1,c)-1 + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +c--------------------------------------------------------------------- +c pack up buffer +c--------------------------------------------------------------------- + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,isize,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,isize,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +c--------------------------------------------------------------------- +c send buffer +c--------------------------------------------------------------------- + if (timeron) call timer_start(t_xcomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, successor(1), + > WEST+jp+kp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_xcomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_send_backsub_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send U(istart) for all j and k +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer j,k,n,ptr,c,istart,jp,kp + integer error,send_id,buffer_size + +c--------------------------------------------------------------------- +c Send element 0 to previous processor +c--------------------------------------------------------------------- + istart = 0 + jp = cell_coord(2,c)-1 + kp = cell_coord(3,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,istart,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + if (timeron) call timer_start(t_xcomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, predecessor(1), + > EAST+jp+kp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_xcomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_unpack_backsub_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack U(isize) for all j and k +c--------------------------------------------------------------------- + + include 'header.h' + integer j,k,n,ptr,c + + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_receive_backsub_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer error,recv_id,jp,kp,c,buffer_size + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, + > dp_type, successor(1), + > EAST+jp+kp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_receive_solve_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer jp,kp,recv_id,error,c,buffer_size + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, + > dp_type, predecessor(1), + > WEST+jp+kp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_backsubstitute(first, last, c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c back solve: if last cell, then generate U(isize)=rhs(isize) +c else assume U(isize) is loaded in un pack backsub_info +c so just use it +c after call u(istart) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + + integer first, last, c, i, j, k + integer m,n,isize,jsize,ksize,istart + + istart = 0 + isize = cell_size(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + if (last .eq. 0) then + do k=start(3,c),ksize + do j=start(2,c),jsize +c--------------------------------------------------------------------- +c U(isize) uses info from previous cell if not last cell +c--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) + > - lhsc(m,n,isize,j,k,c)* + > backsub_info(n,j,k,c) +c--------------------------------------------------------------------- +c rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) +c $ - lhsc(m,n,isize,j,k,c)*rhs(n,isize+1,j,k,c) +c--------------------------------------------------------------------- + enddo + enddo + enddo + enddo + endif + do k=start(3,c),ksize + do j=start(2,c),jsize + do i=isize-1,istart,-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) + > - lhsc(m,n,i,j,k,c)*rhs(n,i+1,j,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_solve_cell(first,last,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c performs guaussian elimination on this cell. +c +c assumes that unpacking routines for non-first cells +c preload C' and rhs' from previous cell. +c +c assumed send happens outside this routine, but that +c c'(IMAX) and rhs'(IMAX) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + include 'work_lhs.h' + + integer first,last,c + integer i,j,k,isize,ksize,jsize,istart + + istart = 0 + isize = cell_size(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + + call lhsabinit(lhsa, lhsb, isize) + + do k=start(3,c),ksize + do j=start(2,c),jsize + +c--------------------------------------------------------------------- +c This function computes the left hand side in the xi-direction +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c determine a (labeled f) and n jacobians for cell c +c--------------------------------------------------------------------- + do i = start(1,c)-1, cell_size(1,c) - end(1,c) + + tmp1 = rho_i(i,j,k,c) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 +c--------------------------------------------------------------------- +c +c--------------------------------------------------------------------- + fjac(1,1,i) = 0.0d+00 + fjac(1,2,i) = 1.0d+00 + fjac(1,3,i) = 0.0d+00 + fjac(1,4,i) = 0.0d+00 + fjac(1,5,i) = 0.0d+00 + + fjac(2,1,i) = -(u(2,i,j,k,c) * tmp2 * + > u(2,i,j,k,c)) + > + c2 * qs(i,j,k,c) + fjac(2,2,i) = ( 2.0d+00 - c2 ) + > * ( u(2,i,j,k,c) * tmp1 ) + fjac(2,3,i) = - c2 * ( u(3,i,j,k,c) * tmp1 ) + fjac(2,4,i) = - c2 * ( u(4,i,j,k,c) * tmp1 ) + fjac(2,5,i) = c2 + + fjac(3,1,i) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) * tmp2 + fjac(3,2,i) = u(3,i,j,k,c) * tmp1 + fjac(3,3,i) = u(2,i,j,k,c) * tmp1 + fjac(3,4,i) = 0.0d+00 + fjac(3,5,i) = 0.0d+00 + + fjac(4,1,i) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) * tmp2 + fjac(4,2,i) = u(4,i,j,k,c) * tmp1 + fjac(4,3,i) = 0.0d+00 + fjac(4,4,i) = u(2,i,j,k,c) * tmp1 + fjac(4,5,i) = 0.0d+00 + + fjac(5,1,i) = ( c2 * 2.0d0 * qs(i,j,k,c) + > - c1 * ( u(5,i,j,k,c) * tmp1 ) ) + > * ( u(2,i,j,k,c) * tmp1 ) + fjac(5,2,i) = c1 * u(5,i,j,k,c) * tmp1 + > - c2 + > * ( u(2,i,j,k,c)*u(2,i,j,k,c) * tmp2 + > + qs(i,j,k,c) ) + fjac(5,3,i) = - c2 * ( u(3,i,j,k,c)*u(2,i,j,k,c) ) + > * tmp2 + fjac(5,4,i) = - c2 * ( u(4,i,j,k,c)*u(2,i,j,k,c) ) + > * tmp2 + fjac(5,5,i) = c1 * ( u(2,i,j,k,c) * tmp1 ) + + njac(1,1,i) = 0.0d+00 + njac(1,2,i) = 0.0d+00 + njac(1,3,i) = 0.0d+00 + njac(1,4,i) = 0.0d+00 + njac(1,5,i) = 0.0d+00 + + njac(2,1,i) = - con43 * c3c4 * tmp2 * u(2,i,j,k,c) + njac(2,2,i) = con43 * c3c4 * tmp1 + njac(2,3,i) = 0.0d+00 + njac(2,4,i) = 0.0d+00 + njac(2,5,i) = 0.0d+00 + + njac(3,1,i) = - c3c4 * tmp2 * u(3,i,j,k,c) + njac(3,2,i) = 0.0d+00 + njac(3,3,i) = c3c4 * tmp1 + njac(3,4,i) = 0.0d+00 + njac(3,5,i) = 0.0d+00 + + njac(4,1,i) = - c3c4 * tmp2 * u(4,i,j,k,c) + njac(4,2,i) = 0.0d+00 + njac(4,3,i) = 0.0d+00 + njac(4,4,i) = c3c4 * tmp1 + njac(4,5,i) = 0.0d+00 + + njac(5,1,i) = - ( con43 * c3c4 + > - c1345 ) * tmp3 * (u(2,i,j,k,c)**2) + > - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2) + > - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2) + > - c1345 * tmp2 * u(5,i,j,k,c) + + njac(5,2,i) = ( con43 * c3c4 + > - c1345 ) * tmp2 * u(2,i,j,k,c) + njac(5,3,i) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c) + njac(5,4,i) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c) + njac(5,5,i) = ( c1345 ) * tmp1 + + enddo +c--------------------------------------------------------------------- +c now jacobians set, so form left hand side in x direction +c--------------------------------------------------------------------- + do i = start(1,c), isize - end(1,c) + + tmp1 = dt * tx1 + tmp2 = dt * tx2 + + lhsa(1,1,i) = - tmp2 * fjac(1,1,i-1) + > - tmp1 * njac(1,1,i-1) + > - tmp1 * dx1 + lhsa(1,2,i) = - tmp2 * fjac(1,2,i-1) + > - tmp1 * njac(1,2,i-1) + lhsa(1,3,i) = - tmp2 * fjac(1,3,i-1) + > - tmp1 * njac(1,3,i-1) + lhsa(1,4,i) = - tmp2 * fjac(1,4,i-1) + > - tmp1 * njac(1,4,i-1) + lhsa(1,5,i) = - tmp2 * fjac(1,5,i-1) + > - tmp1 * njac(1,5,i-1) + + lhsa(2,1,i) = - tmp2 * fjac(2,1,i-1) + > - tmp1 * njac(2,1,i-1) + lhsa(2,2,i) = - tmp2 * fjac(2,2,i-1) + > - tmp1 * njac(2,2,i-1) + > - tmp1 * dx2 + lhsa(2,3,i) = - tmp2 * fjac(2,3,i-1) + > - tmp1 * njac(2,3,i-1) + lhsa(2,4,i) = - tmp2 * fjac(2,4,i-1) + > - tmp1 * njac(2,4,i-1) + lhsa(2,5,i) = - tmp2 * fjac(2,5,i-1) + > - tmp1 * njac(2,5,i-1) + + lhsa(3,1,i) = - tmp2 * fjac(3,1,i-1) + > - tmp1 * njac(3,1,i-1) + lhsa(3,2,i) = - tmp2 * fjac(3,2,i-1) + > - tmp1 * njac(3,2,i-1) + lhsa(3,3,i) = - tmp2 * fjac(3,3,i-1) + > - tmp1 * njac(3,3,i-1) + > - tmp1 * dx3 + lhsa(3,4,i) = - tmp2 * fjac(3,4,i-1) + > - tmp1 * njac(3,4,i-1) + lhsa(3,5,i) = - tmp2 * fjac(3,5,i-1) + > - tmp1 * njac(3,5,i-1) + + lhsa(4,1,i) = - tmp2 * fjac(4,1,i-1) + > - tmp1 * njac(4,1,i-1) + lhsa(4,2,i) = - tmp2 * fjac(4,2,i-1) + > - tmp1 * njac(4,2,i-1) + lhsa(4,3,i) = - tmp2 * fjac(4,3,i-1) + > - tmp1 * njac(4,3,i-1) + lhsa(4,4,i) = - tmp2 * fjac(4,4,i-1) + > - tmp1 * njac(4,4,i-1) + > - tmp1 * dx4 + lhsa(4,5,i) = - tmp2 * fjac(4,5,i-1) + > - tmp1 * njac(4,5,i-1) + + lhsa(5,1,i) = - tmp2 * fjac(5,1,i-1) + > - tmp1 * njac(5,1,i-1) + lhsa(5,2,i) = - tmp2 * fjac(5,2,i-1) + > - tmp1 * njac(5,2,i-1) + lhsa(5,3,i) = - tmp2 * fjac(5,3,i-1) + > - tmp1 * njac(5,3,i-1) + lhsa(5,4,i) = - tmp2 * fjac(5,4,i-1) + > - tmp1 * njac(5,4,i-1) + lhsa(5,5,i) = - tmp2 * fjac(5,5,i-1) + > - tmp1 * njac(5,5,i-1) + > - tmp1 * dx5 + + lhsb(1,1,i) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(1,1,i) + > + tmp1 * 2.0d+00 * dx1 + lhsb(1,2,i) = tmp1 * 2.0d+00 * njac(1,2,i) + lhsb(1,3,i) = tmp1 * 2.0d+00 * njac(1,3,i) + lhsb(1,4,i) = tmp1 * 2.0d+00 * njac(1,4,i) + lhsb(1,5,i) = tmp1 * 2.0d+00 * njac(1,5,i) + + lhsb(2,1,i) = tmp1 * 2.0d+00 * njac(2,1,i) + lhsb(2,2,i) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(2,2,i) + > + tmp1 * 2.0d+00 * dx2 + lhsb(2,3,i) = tmp1 * 2.0d+00 * njac(2,3,i) + lhsb(2,4,i) = tmp1 * 2.0d+00 * njac(2,4,i) + lhsb(2,5,i) = tmp1 * 2.0d+00 * njac(2,5,i) + + lhsb(3,1,i) = tmp1 * 2.0d+00 * njac(3,1,i) + lhsb(3,2,i) = tmp1 * 2.0d+00 * njac(3,2,i) + lhsb(3,3,i) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(3,3,i) + > + tmp1 * 2.0d+00 * dx3 + lhsb(3,4,i) = tmp1 * 2.0d+00 * njac(3,4,i) + lhsb(3,5,i) = tmp1 * 2.0d+00 * njac(3,5,i) + + lhsb(4,1,i) = tmp1 * 2.0d+00 * njac(4,1,i) + lhsb(4,2,i) = tmp1 * 2.0d+00 * njac(4,2,i) + lhsb(4,3,i) = tmp1 * 2.0d+00 * njac(4,3,i) + lhsb(4,4,i) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(4,4,i) + > + tmp1 * 2.0d+00 * dx4 + lhsb(4,5,i) = tmp1 * 2.0d+00 * njac(4,5,i) + + lhsb(5,1,i) = tmp1 * 2.0d+00 * njac(5,1,i) + lhsb(5,2,i) = tmp1 * 2.0d+00 * njac(5,2,i) + lhsb(5,3,i) = tmp1 * 2.0d+00 * njac(5,3,i) + lhsb(5,4,i) = tmp1 * 2.0d+00 * njac(5,4,i) + lhsb(5,5,i) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(5,5,i) + > + tmp1 * 2.0d+00 * dx5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,i+1) + > - tmp1 * njac(1,1,i+1) + > - tmp1 * dx1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,i+1) + > - tmp1 * njac(1,2,i+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,i+1) + > - tmp1 * njac(1,3,i+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,i+1) + > - tmp1 * njac(1,4,i+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,i+1) + > - tmp1 * njac(1,5,i+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,i+1) + > - tmp1 * njac(2,1,i+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,i+1) + > - tmp1 * njac(2,2,i+1) + > - tmp1 * dx2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,i+1) + > - tmp1 * njac(2,3,i+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,i+1) + > - tmp1 * njac(2,4,i+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,i+1) + > - tmp1 * njac(2,5,i+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,i+1) + > - tmp1 * njac(3,1,i+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,i+1) + > - tmp1 * njac(3,2,i+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,i+1) + > - tmp1 * njac(3,3,i+1) + > - tmp1 * dx3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,i+1) + > - tmp1 * njac(3,4,i+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,i+1) + > - tmp1 * njac(3,5,i+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,i+1) + > - tmp1 * njac(4,1,i+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,i+1) + > - tmp1 * njac(4,2,i+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,i+1) + > - tmp1 * njac(4,3,i+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,i+1) + > - tmp1 * njac(4,4,i+1) + > - tmp1 * dx4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,i+1) + > - tmp1 * njac(4,5,i+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,i+1) + > - tmp1 * njac(5,1,i+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,i+1) + > - tmp1 * njac(5,2,i+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,i+1) + > - tmp1 * njac(5,3,i+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,i+1) + > - tmp1 * njac(5,4,i+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,i+1) + > - tmp1 * njac(5,5,i+1) + > - tmp1 * dx5 + + enddo + + +c--------------------------------------------------------------------- +c outer most do loops - sweeping in i direction +c--------------------------------------------------------------------- + if (first .eq. 1) then + +c--------------------------------------------------------------------- +c multiply c(istart,j,k) by b_inverse and copy back to c +c multiply rhs(istart) by b_inverse(istart) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,istart), + > lhsc(1,1,istart,j,k,c), + > rhs(1,istart,j,k,c) ) + + endif + +c--------------------------------------------------------------------- +c begin inner most do loop +c do all the elements of the cell unless last +c--------------------------------------------------------------------- + do i=istart+first,isize-last + +c--------------------------------------------------------------------- +c rhs(i) = rhs(i) - A*rhs(i-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i), + > rhs(1,i-1,j,k,c),rhs(1,i,j,k,c)) + +c--------------------------------------------------------------------- +c B(i) = B(i) - C(i-1)*A(i) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i), + > lhsc(1,1,i-1,j,k,c), + > lhsb(1,1,i)) + + +c--------------------------------------------------------------------- +c multiply c(i,j,k) by b_inverse and copy back to c +c multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,i), + > lhsc(1,1,i,j,k,c), + > rhs(1,i,j,k,c) ) + + enddo + +c--------------------------------------------------------------------- +c Now finish up special cases for last cell +c--------------------------------------------------------------------- + if (last .eq. 1) then + +c--------------------------------------------------------------------- +c rhs(isize) = rhs(isize) - A*rhs(isize-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,isize), + > rhs(1,isize-1,j,k,c),rhs(1,isize,j,k,c)) + +c--------------------------------------------------------------------- +c B(isize) = B(isize) - C(isize-1)*A(isize) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,isize), + > lhsc(1,1,isize-1,j,k,c), + > lhsb(1,1,isize)) + +c--------------------------------------------------------------------- +c multiply rhs() by b_inverse() and copy to rhs +c--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,isize), + > rhs(1,isize,j,k,c) ) + + endif + enddo + enddo + + + return + end + diff --git b/NPB3.3-MPI/BT/x_solve_vec.f a/NPB3.3-MPI/BT/x_solve_vec.f new file mode 100644 index 0000000..a4deef2 --- /dev/null +++ a/NPB3.3-MPI/BT/x_solve_vec.f @@ -0,0 +1,799 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_solve + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c +c Performs line solves in X direction by first factoring +c the block-tridiagonal matrix into an upper triangular matrix, +c and then performing back substitution to solve for the unknow +c vectors of each line. +c +c Make sure we treat elements zero to cell_size in the direction +c of the sweep. +c +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + integer c, istart, stage, + > first, last, recv_id, error, r_status(MPI_STATUS_SIZE), + > isize,jsize,ksize,send_id + + istart = 0 + + if (timeron) call timer_start(t_xsolve) +c--------------------------------------------------------------------- +c in our terminology stage is the number of the cell in the x-direct +c i.e. stage = 1 means the start of the line stage=ncells means end +c--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(1,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 + +c--------------------------------------------------------------------- +c set last-cell flag +c--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +c--------------------------------------------------------------------- +c This is the first cell, so solve without receiving data +c--------------------------------------------------------------------- + first = 1 +c call lhsx(c) + call x_solve_cell(first,last,c) + else +c--------------------------------------------------------------------- +c Not the first cell of this line, so receive info from +c processor working on preceeding cell +c--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_xcomm) + call x_receive_solve_info(recv_id,c) +c--------------------------------------------------------------------- +c overlap computations and communications +c--------------------------------------------------------------------- +c call lhsx(c) +c--------------------------------------------------------------------- +c wait for completion +c--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_xcomm) +c--------------------------------------------------------------------- +c install C'(istart) and rhs'(istart) to be used in this cell +c--------------------------------------------------------------------- + call x_unpack_solve_info(c) + call x_solve_cell(first,last,c) + endif + + if (last .eq. 0) call x_send_solve_info(send_id,c) + enddo + +c--------------------------------------------------------------------- +c now perform backsubstitution in reverse direction +c--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(1,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +c--------------------------------------------------------------------- +c last cell, so perform back substitute without waiting +c--------------------------------------------------------------------- + call x_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_xcomm) + call x_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_xcomm) + call x_unpack_backsub_info(c) + call x_backsubstitute(first,last,c) + endif + if (first .eq. 0) call x_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_xsolve) + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_unpack_solve_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack C'(-1) and rhs'(-1) for +c all j and k +c--------------------------------------------------------------------- + + include 'header.h' + integer j,k,m,n,ptr,c,istart + + istart = 0 + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,istart-1,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,istart-1,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_send_solve_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send C'(iend) and rhs'(iend) for +c all j and k +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer j,k,m,n,isize,ptr,c,jp,kp + integer error,send_id,buffer_size + + isize = cell_size(1,c)-1 + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +c--------------------------------------------------------------------- +c pack up buffer +c--------------------------------------------------------------------- + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,isize,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,isize,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +c--------------------------------------------------------------------- +c send buffer +c--------------------------------------------------------------------- + if (timeron) call timer_start(t_xcomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, successor(1), + > WEST+jp+kp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_xcomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_send_backsub_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send U(istart) for all j and k +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer j,k,n,ptr,c,istart,jp,kp + integer error,send_id,buffer_size + +c--------------------------------------------------------------------- +c Send element 0 to previous processor +c--------------------------------------------------------------------- + istart = 0 + jp = cell_coord(2,c)-1 + kp = cell_coord(3,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,istart,j,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + if (timeron) call timer_start(t_xcomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, predecessor(1), + > EAST+jp+kp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_xcomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_unpack_backsub_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack U(isize) for all j and k +c--------------------------------------------------------------------- + + include 'header.h' + integer j,k,n,ptr,c + + ptr = 0 + do k=0,KMAX-1 + do j=0,JMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,j,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_receive_backsub_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer error,recv_id,jp,kp,c,buffer_size + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, + > dp_type, successor(1), + > EAST+jp+kp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_receive_solve_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer jp,kp,recv_id,error,c,buffer_size + jp = cell_coord(2,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, + > dp_type, predecessor(1), + > WEST+jp+kp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_backsubstitute(first, last, c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c back solve: if last cell, then generate U(isize)=rhs(isize) +c else assume U(isize) is loaded in un pack backsub_info +c so just use it +c after call u(istart) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + + integer first, last, c, i, j, k + integer m,n,isize,jsize,ksize,istart + + istart = 0 + isize = cell_size(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + if (last .eq. 0) then + do k=start(3,c),ksize + do j=start(2,c),jsize +c--------------------------------------------------------------------- +c U(isize) uses info from previous cell if not last cell +c--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) + > - lhsc(m,n,isize,j,k,c)* + > backsub_info(n,j,k,c) +c--------------------------------------------------------------------- +c rhs(m,isize,j,k,c) = rhs(m,isize,j,k,c) +c $ - lhsc(m,n,isize,j,k,c)*rhs(n,isize+1,j,k,c) +c--------------------------------------------------------------------- + enddo + enddo + enddo + enddo + endif + do k=start(3,c),ksize + do j=start(2,c),jsize + do i=isize-1,istart,-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) + > - lhsc(m,n,i,j,k,c)*rhs(n,i+1,j,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine x_solve_cell(first,last,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c performs guaussian elimination on this cell. +c +c assumes that unpacking routines for non-first cells +c preload C' and rhs' from previous cell. +c +c assumed send happens outside this routine, but that +c c'(IMAX) and rhs'(IMAX) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + include 'work_lhs_vec.h' + + integer first,last,c + integer i,j,k,m,n,isize,ksize,jsize,istart + + istart = 0 + isize = cell_size(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + +c--------------------------------------------------------------------- +c zero the left hand side for starters +c set diagonal values to 1. This is overkill, but convenient +c--------------------------------------------------------------------- + do j = 0, jsize + do m = 1, 5 + do n = 1, 5 + lhsa(m,n,0,j) = 0.0d0 + lhsb(m,n,0,j) = 0.0d0 + lhsa(m,n,isize,j) = 0.0d0 + lhsb(m,n,isize,j) = 0.0d0 + enddo + lhsb(m,m,0,j) = 1.0d0 + lhsb(m,m,isize,j) = 1.0d0 + enddo + enddo + + do k=start(3,c),ksize + +c--------------------------------------------------------------------- +c This function computes the left hand side in the xi-direction +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c determine a (labeled f) and n jacobians for cell c +c--------------------------------------------------------------------- + do j=start(2,c),jsize + do i = start(1,c)-1, cell_size(1,c) - end(1,c) + + tmp1 = rho_i(i,j,k,c) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 +c--------------------------------------------------------------------- +c +c--------------------------------------------------------------------- + fjac(1,1,i,j) = 0.0d+00 + fjac(1,2,i,j) = 1.0d+00 + fjac(1,3,i,j) = 0.0d+00 + fjac(1,4,i,j) = 0.0d+00 + fjac(1,5,i,j) = 0.0d+00 + + fjac(2,1,i,j) = -(u(2,i,j,k,c) * tmp2 * + > u(2,i,j,k,c)) + > + c2 * qs(i,j,k,c) + fjac(2,2,i,j) = ( 2.0d+00 - c2 ) + > * ( u(2,i,j,k,c) * tmp1 ) + fjac(2,3,i,j) = - c2 * ( u(3,i,j,k,c) * tmp1 ) + fjac(2,4,i,j) = - c2 * ( u(4,i,j,k,c) * tmp1 ) + fjac(2,5,i,j) = c2 + + fjac(3,1,i,j) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) * tmp2 + fjac(3,2,i,j) = u(3,i,j,k,c) * tmp1 + fjac(3,3,i,j) = u(2,i,j,k,c) * tmp1 + fjac(3,4,i,j) = 0.0d+00 + fjac(3,5,i,j) = 0.0d+00 + + fjac(4,1,i,j) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) * tmp2 + fjac(4,2,i,j) = u(4,i,j,k,c) * tmp1 + fjac(4,3,i,j) = 0.0d+00 + fjac(4,4,i,j) = u(2,i,j,k,c) * tmp1 + fjac(4,5,i,j) = 0.0d+00 + + fjac(5,1,i,j) = ( c2 * 2.0d0 * qs(i,j,k,c) + > - c1 * ( u(5,i,j,k,c) * tmp1 ) ) + > * ( u(2,i,j,k,c) * tmp1 ) + fjac(5,2,i,j) = c1 * u(5,i,j,k,c) * tmp1 + > - c2 + > * ( u(2,i,j,k,c)*u(2,i,j,k,c) * tmp2 + > + qs(i,j,k,c) ) + fjac(5,3,i,j) = - c2 * ( u(3,i,j,k,c)*u(2,i,j,k,c) ) + > * tmp2 + fjac(5,4,i,j) = - c2 * ( u(4,i,j,k,c)*u(2,i,j,k,c) ) + > * tmp2 + fjac(5,5,i,j) = c1 * ( u(2,i,j,k,c) * tmp1 ) + + njac(1,1,i,j) = 0.0d+00 + njac(1,2,i,j) = 0.0d+00 + njac(1,3,i,j) = 0.0d+00 + njac(1,4,i,j) = 0.0d+00 + njac(1,5,i,j) = 0.0d+00 + + njac(2,1,i,j) = - con43 * c3c4 * tmp2 * u(2,i,j,k,c) + njac(2,2,i,j) = con43 * c3c4 * tmp1 + njac(2,3,i,j) = 0.0d+00 + njac(2,4,i,j) = 0.0d+00 + njac(2,5,i,j) = 0.0d+00 + + njac(3,1,i,j) = - c3c4 * tmp2 * u(3,i,j,k,c) + njac(3,2,i,j) = 0.0d+00 + njac(3,3,i,j) = c3c4 * tmp1 + njac(3,4,i,j) = 0.0d+00 + njac(3,5,i,j) = 0.0d+00 + + njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k,c) + njac(4,2,i,j) = 0.0d+00 + njac(4,3,i,j) = 0.0d+00 + njac(4,4,i,j) = c3c4 * tmp1 + njac(4,5,i,j) = 0.0d+00 + + njac(5,1,i,j) = - ( con43 * c3c4 + > - c1345 ) * tmp3 * (u(2,i,j,k,c)**2) + > - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2) + > - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2) + > - c1345 * tmp2 * u(5,i,j,k,c) + + njac(5,2,i,j) = ( con43 * c3c4 + > - c1345 ) * tmp2 * u(2,i,j,k,c) + njac(5,3,i,j) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c) + njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c) + njac(5,5,i,j) = ( c1345 ) * tmp1 + + enddo + enddo + +c--------------------------------------------------------------------- +c now jacobians set, so form left hand side in x direction +c--------------------------------------------------------------------- + do j=start(2,c),jsize + do i = start(1,c), isize - end(1,c) + + tmp1 = dt * tx1 + tmp2 = dt * tx2 + + lhsa(1,1,i,j) = - tmp2 * fjac(1,1,i-1,j) + > - tmp1 * njac(1,1,i-1,j) + > - tmp1 * dx1 + lhsa(1,2,i,j) = - tmp2 * fjac(1,2,i-1,j) + > - tmp1 * njac(1,2,i-1,j) + lhsa(1,3,i,j) = - tmp2 * fjac(1,3,i-1,j) + > - tmp1 * njac(1,3,i-1,j) + lhsa(1,4,i,j) = - tmp2 * fjac(1,4,i-1,j) + > - tmp1 * njac(1,4,i-1,j) + lhsa(1,5,i,j) = - tmp2 * fjac(1,5,i-1,j) + > - tmp1 * njac(1,5,i-1,j) + + lhsa(2,1,i,j) = - tmp2 * fjac(2,1,i-1,j) + > - tmp1 * njac(2,1,i-1,j) + lhsa(2,2,i,j) = - tmp2 * fjac(2,2,i-1,j) + > - tmp1 * njac(2,2,i-1,j) + > - tmp1 * dx2 + lhsa(2,3,i,j) = - tmp2 * fjac(2,3,i-1,j) + > - tmp1 * njac(2,3,i-1,j) + lhsa(2,4,i,j) = - tmp2 * fjac(2,4,i-1,j) + > - tmp1 * njac(2,4,i-1,j) + lhsa(2,5,i,j) = - tmp2 * fjac(2,5,i-1,j) + > - tmp1 * njac(2,5,i-1,j) + + lhsa(3,1,i,j) = - tmp2 * fjac(3,1,i-1,j) + > - tmp1 * njac(3,1,i-1,j) + lhsa(3,2,i,j) = - tmp2 * fjac(3,2,i-1,j) + > - tmp1 * njac(3,2,i-1,j) + lhsa(3,3,i,j) = - tmp2 * fjac(3,3,i-1,j) + > - tmp1 * njac(3,3,i-1,j) + > - tmp1 * dx3 + lhsa(3,4,i,j) = - tmp2 * fjac(3,4,i-1,j) + > - tmp1 * njac(3,4,i-1,j) + lhsa(3,5,i,j) = - tmp2 * fjac(3,5,i-1,j) + > - tmp1 * njac(3,5,i-1,j) + + lhsa(4,1,i,j) = - tmp2 * fjac(4,1,i-1,j) + > - tmp1 * njac(4,1,i-1,j) + lhsa(4,2,i,j) = - tmp2 * fjac(4,2,i-1,j) + > - tmp1 * njac(4,2,i-1,j) + lhsa(4,3,i,j) = - tmp2 * fjac(4,3,i-1,j) + > - tmp1 * njac(4,3,i-1,j) + lhsa(4,4,i,j) = - tmp2 * fjac(4,4,i-1,j) + > - tmp1 * njac(4,4,i-1,j) + > - tmp1 * dx4 + lhsa(4,5,i,j) = - tmp2 * fjac(4,5,i-1,j) + > - tmp1 * njac(4,5,i-1,j) + + lhsa(5,1,i,j) = - tmp2 * fjac(5,1,i-1,j) + > - tmp1 * njac(5,1,i-1,j) + lhsa(5,2,i,j) = - tmp2 * fjac(5,2,i-1,j) + > - tmp1 * njac(5,2,i-1,j) + lhsa(5,3,i,j) = - tmp2 * fjac(5,3,i-1,j) + > - tmp1 * njac(5,3,i-1,j) + lhsa(5,4,i,j) = - tmp2 * fjac(5,4,i-1,j) + > - tmp1 * njac(5,4,i-1,j) + lhsa(5,5,i,j) = - tmp2 * fjac(5,5,i-1,j) + > - tmp1 * njac(5,5,i-1,j) + > - tmp1 * dx5 + + lhsb(1,1,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(1,1,i,j) + > + tmp1 * 2.0d+00 * dx1 + lhsb(1,2,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j) + lhsb(1,3,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j) + lhsb(1,4,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j) + lhsb(1,5,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j) + + lhsb(2,1,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j) + lhsb(2,2,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(2,2,i,j) + > + tmp1 * 2.0d+00 * dx2 + lhsb(2,3,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j) + lhsb(2,4,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j) + lhsb(2,5,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j) + + lhsb(3,1,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j) + lhsb(3,2,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j) + lhsb(3,3,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(3,3,i,j) + > + tmp1 * 2.0d+00 * dx3 + lhsb(3,4,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j) + lhsb(3,5,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j) + + lhsb(4,1,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j) + lhsb(4,2,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j) + lhsb(4,3,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j) + lhsb(4,4,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(4,4,i,j) + > + tmp1 * 2.0d+00 * dx4 + lhsb(4,5,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j) + + lhsb(5,1,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j) + lhsb(5,2,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j) + lhsb(5,3,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j) + lhsb(5,4,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j) + lhsb(5,5,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(5,5,i,j) + > + tmp1 * 2.0d+00 * dx5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,i+1,j) + > - tmp1 * njac(1,1,i+1,j) + > - tmp1 * dx1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,i+1,j) + > - tmp1 * njac(1,2,i+1,j) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,i+1,j) + > - tmp1 * njac(1,3,i+1,j) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,i+1,j) + > - tmp1 * njac(1,4,i+1,j) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,i+1,j) + > - tmp1 * njac(1,5,i+1,j) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,i+1,j) + > - tmp1 * njac(2,1,i+1,j) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,i+1,j) + > - tmp1 * njac(2,2,i+1,j) + > - tmp1 * dx2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,i+1,j) + > - tmp1 * njac(2,3,i+1,j) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,i+1,j) + > - tmp1 * njac(2,4,i+1,j) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,i+1,j) + > - tmp1 * njac(2,5,i+1,j) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,i+1,j) + > - tmp1 * njac(3,1,i+1,j) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,i+1,j) + > - tmp1 * njac(3,2,i+1,j) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,i+1,j) + > - tmp1 * njac(3,3,i+1,j) + > - tmp1 * dx3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,i+1,j) + > - tmp1 * njac(3,4,i+1,j) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,i+1,j) + > - tmp1 * njac(3,5,i+1,j) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,i+1,j) + > - tmp1 * njac(4,1,i+1,j) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,i+1,j) + > - tmp1 * njac(4,2,i+1,j) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,i+1,j) + > - tmp1 * njac(4,3,i+1,j) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,i+1,j) + > - tmp1 * njac(4,4,i+1,j) + > - tmp1 * dx4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,i+1,j) + > - tmp1 * njac(4,5,i+1,j) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,i+1,j) + > - tmp1 * njac(5,1,i+1,j) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,i+1,j) + > - tmp1 * njac(5,2,i+1,j) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,i+1,j) + > - tmp1 * njac(5,3,i+1,j) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,i+1,j) + > - tmp1 * njac(5,4,i+1,j) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,i+1,j) + > - tmp1 * njac(5,5,i+1,j) + > - tmp1 * dx5 + + enddo + enddo + + +c--------------------------------------------------------------------- +c outer most do loops - sweeping in i direction +c--------------------------------------------------------------------- + if (first .eq. 1) then + +c--------------------------------------------------------------------- +c multiply c(istart,j,k) by b_inverse and copy back to c +c multiply rhs(istart) by b_inverse(istart) and copy to rhs +c--------------------------------------------------------------------- +!dir$ ivdep + do j=start(2,c),jsize + call binvcrhs( lhsb(1,1,istart,j), + > lhsc(1,1,istart,j,k,c), + > rhs(1,istart,j,k,c) ) + enddo + + endif + +c--------------------------------------------------------------------- +c begin inner most do loop +c do all the elements of the cell unless last +c--------------------------------------------------------------------- +!dir$ ivdep +!dir$ interchange(i,j) + do j=start(2,c),jsize + do i=istart+first,isize-last + +c--------------------------------------------------------------------- +c rhs(i) = rhs(i) - A*rhs(i-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,j), + > rhs(1,i-1,j,k,c),rhs(1,i,j,k,c)) + +c--------------------------------------------------------------------- +c B(i) = B(i) - C(i-1)*A(i) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,j), + > lhsc(1,1,i-1,j,k,c), + > lhsb(1,1,i,j)) + + +c--------------------------------------------------------------------- +c multiply c(i,j,k) by b_inverse and copy back to c +c multiply rhs(1,j,k) by b_inverse(1,j,k) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,i,j), + > lhsc(1,1,i,j,k,c), + > rhs(1,i,j,k,c) ) + + enddo + enddo + +c--------------------------------------------------------------------- +c Now finish up special cases for last cell +c--------------------------------------------------------------------- + if (last .eq. 1) then + +!dir$ ivdep + do j=start(2,c),jsize +c--------------------------------------------------------------------- +c rhs(isize) = rhs(isize) - A*rhs(isize-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,isize,j), + > rhs(1,isize-1,j,k,c),rhs(1,isize,j,k,c)) + +c--------------------------------------------------------------------- +c B(isize) = B(isize) - C(isize-1)*A(isize) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,isize,j), + > lhsc(1,1,isize-1,j,k,c), + > lhsb(1,1,isize,j)) + +c--------------------------------------------------------------------- +c multiply rhs() by b_inverse() and copy to rhs +c--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,isize,j), + > rhs(1,isize,j,k,c) ) + enddo + + endif + enddo + + + return + end + diff --git b/NPB3.3-MPI/BT/y_solve.f a/NPB3.3-MPI/BT/y_solve.f new file mode 100644 index 0000000..50a028b --- /dev/null +++ a/NPB3.3-MPI/BT/y_solve.f @@ -0,0 +1,781 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_solve + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Performs line solves in Y direction by first factoring +c the block-tridiagonal matrix into an upper triangular matrix, +c and then performing back substitution to solve for the unknow +c vectors of each line. +c +c Make sure we treat elements zero to cell_size in the direction +c of the sweep. +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer + > c, jstart, stage, + > first, last, recv_id, error, r_status(MPI_STATUS_SIZE), + > isize,jsize,ksize,send_id + + jstart = 0 + + if (timeron) call timer_start(t_ysolve) +c--------------------------------------------------------------------- +c in our terminology stage is the number of the cell in the y-direction +c i.e. stage = 1 means the start of the line stage=ncells means end +c--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(2,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 + +c--------------------------------------------------------------------- +c set last-cell flag +c--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +c--------------------------------------------------------------------- +c This is the first cell, so solve without receiving data +c--------------------------------------------------------------------- + first = 1 +c call lhsy(c) + call y_solve_cell(first,last,c) + else +c--------------------------------------------------------------------- +c Not the first cell of this line, so receive info from +c processor working on preceeding cell +c--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_ycomm) + call y_receive_solve_info(recv_id,c) +c--------------------------------------------------------------------- +c overlap computations and communications +c--------------------------------------------------------------------- +c call lhsy(c) +c--------------------------------------------------------------------- +c wait for completion +c--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_ycomm) +c--------------------------------------------------------------------- +c install C'(jstart+1) and rhs'(jstart+1) to be used in this cell +c--------------------------------------------------------------------- + call y_unpack_solve_info(c) + call y_solve_cell(first,last,c) + endif + + if (last .eq. 0) call y_send_solve_info(send_id,c) + enddo + +c--------------------------------------------------------------------- +c now perform backsubstitution in reverse direction +c--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(2,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +c--------------------------------------------------------------------- +c last cell, so perform back substitute without waiting +c--------------------------------------------------------------------- + call y_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_ycomm) + call y_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_ycomm) + call y_unpack_backsub_info(c) + call y_backsubstitute(first,last,c) + endif + if (first .eq. 0) call y_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_ysolve) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_unpack_solve_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack C'(-1) and rhs'(-1) for +c all i and k +c--------------------------------------------------------------------- + + include 'header.h' + + integer i,k,m,n,ptr,c,jstart + + jstart = 0 + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,i,jstart-1,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,i,jstart-1,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_send_solve_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send C'(jend) and rhs'(jend) for +c all i and k +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i,k,m,n,jsize,ptr,c,ip,kp + integer error,send_id,buffer_size + + jsize = cell_size(2,c)-1 + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +c--------------------------------------------------------------------- +c pack up buffer +c--------------------------------------------------------------------- + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,i,jsize,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,jsize,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +c--------------------------------------------------------------------- +c send buffer +c--------------------------------------------------------------------- + if (timeron) call timer_start(t_ycomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, successor(2), + > SOUTH+ip+kp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_ycomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_send_backsub_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send U(jstart) for all i and k +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i,k,n,ptr,c,jstart,ip,kp + integer error,send_id,buffer_size + +c--------------------------------------------------------------------- +c Send element 0 to previous processor +c--------------------------------------------------------------------- + jstart = 0 + ip = cell_coord(1,c)-1 + kp = cell_coord(3,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,jstart,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + if (timeron) call timer_start(t_ycomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, predecessor(2), + > NORTH+ip+kp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_ycomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_unpack_backsub_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack U(jsize) for all i and k +c--------------------------------------------------------------------- + + include 'header.h' + + integer i,k,n,ptr,c + + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,i,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_receive_backsub_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer error,recv_id,ip,kp,c,buffer_size + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, + > dp_type, successor(2), + > NORTH+ip+kp*NCELLS, comm_solve, + > recv_id, error) + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_receive_solve_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer ip,kp,recv_id,error,c,buffer_size + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, + > dp_type, predecessor(2), + > SOUTH+ip+kp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_backsubstitute(first, last, c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c back solve: if last cell, then generate U(jsize)=rhs(jsize) +c else assume U(jsize) is loaded in un pack backsub_info +c so just use it +c after call u(jstart) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + + integer first, last, c, i, k + integer m,n,j,jsize,isize,ksize,jstart + + jstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + if (last .eq. 0) then + do k=start(3,c),ksize + do i=start(1,c),isize +c--------------------------------------------------------------------- +c U(jsize) uses info from previous cell if not last cell +c--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,jsize,k,c) = rhs(m,i,jsize,k,c) + > - lhsc(m,n,i,jsize,k,c)* + > backsub_info(n,i,k,c) + enddo + enddo + enddo + enddo + endif + do k=start(3,c),ksize + do j=jsize-1,jstart,-1 + do i=start(1,c),isize + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) + > - lhsc(m,n,i,j,k,c)*rhs(n,i,j+1,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_solve_cell(first,last,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c performs guaussian elimination on this cell. +c +c assumes that unpacking routines for non-first cells +c preload C' and rhs' from previous cell. +c +c assumed send happens outside this routine, but that +c c'(JMAX) and rhs'(JMAX) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + include 'work_lhs.h' + + integer first,last,c + integer i,j,k,isize,ksize,jsize,jstart + double precision utmp(6,-2:JMAX+1) + + jstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + + call lhsabinit(lhsa, lhsb, jsize) + + do k=start(3,c),ksize + do i=start(1,c),isize + +c--------------------------------------------------------------------- +c This function computes the left hand side for the three y-factors +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Compute the indices for storing the tri-diagonal matrix; +c determine a (labeled f) and n jacobians for cell c +c--------------------------------------------------------------------- + do j = start(2,c)-1, cell_size(2,c)-end(2,c) + utmp(1,j) = 1.0d0 / u(1,i,j,k,c) + utmp(2,j) = u(2,i,j,k,c) + utmp(3,j) = u(3,i,j,k,c) + utmp(4,j) = u(4,i,j,k,c) + utmp(5,j) = u(5,i,j,k,c) + utmp(6,j) = qs(i,j,k,c) + end do + + do j = start(2,c)-1, cell_size(2,c)-end(2,c) + + tmp1 = utmp(1,j) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 + + fjac(1,1,j) = 0.0d+00 + fjac(1,2,j) = 0.0d+00 + fjac(1,3,j) = 1.0d+00 + fjac(1,4,j) = 0.0d+00 + fjac(1,5,j) = 0.0d+00 + + fjac(2,1,j) = - ( utmp(2,j)*utmp(3,j) ) + > * tmp2 + fjac(2,2,j) = utmp(3,j) * tmp1 + fjac(2,3,j) = utmp(2,j) * tmp1 + fjac(2,4,j) = 0.0d+00 + fjac(2,5,j) = 0.0d+00 + + fjac(3,1,j) = - ( utmp(3,j)*utmp(3,j)*tmp2) + > + c2 * utmp(6,j) + fjac(3,2,j) = - c2 * utmp(2,j) * tmp1 + fjac(3,3,j) = ( 2.0d+00 - c2 ) + > * utmp(3,j) * tmp1 + fjac(3,4,j) = - c2 * utmp(4,j) * tmp1 + fjac(3,5,j) = c2 + + fjac(4,1,j) = - ( utmp(3,j)*utmp(4,j) ) + > * tmp2 + fjac(4,2,j) = 0.0d+00 + fjac(4,3,j) = utmp(4,j) * tmp1 + fjac(4,4,j) = utmp(3,j) * tmp1 + fjac(4,5,j) = 0.0d+00 + + fjac(5,1,j) = ( c2 * 2.0d0 * utmp(6,j) + > - c1 * utmp(5,j) * tmp1 ) + > * utmp(3,j) * tmp1 + fjac(5,2,j) = - c2 * utmp(2,j)*utmp(3,j) + > * tmp2 + fjac(5,3,j) = c1 * utmp(5,j) * tmp1 + > - c2 * ( utmp(6,j) + > + utmp(3,j)*utmp(3,j) * tmp2 ) + fjac(5,4,j) = - c2 * ( utmp(3,j)*utmp(4,j) ) + > * tmp2 + fjac(5,5,j) = c1 * utmp(3,j) * tmp1 + + njac(1,1,j) = 0.0d+00 + njac(1,2,j) = 0.0d+00 + njac(1,3,j) = 0.0d+00 + njac(1,4,j) = 0.0d+00 + njac(1,5,j) = 0.0d+00 + + njac(2,1,j) = - c3c4 * tmp2 * utmp(2,j) + njac(2,2,j) = c3c4 * tmp1 + njac(2,3,j) = 0.0d+00 + njac(2,4,j) = 0.0d+00 + njac(2,5,j) = 0.0d+00 + + njac(3,1,j) = - con43 * c3c4 * tmp2 * utmp(3,j) + njac(3,2,j) = 0.0d+00 + njac(3,3,j) = con43 * c3c4 * tmp1 + njac(3,4,j) = 0.0d+00 + njac(3,5,j) = 0.0d+00 + + njac(4,1,j) = - c3c4 * tmp2 * utmp(4,j) + njac(4,2,j) = 0.0d+00 + njac(4,3,j) = 0.0d+00 + njac(4,4,j) = c3c4 * tmp1 + njac(4,5,j) = 0.0d+00 + + njac(5,1,j) = - ( c3c4 + > - c1345 ) * tmp3 * (utmp(2,j)**2) + > - ( con43 * c3c4 + > - c1345 ) * tmp3 * (utmp(3,j)**2) + > - ( c3c4 - c1345 ) * tmp3 * (utmp(4,j)**2) + > - c1345 * tmp2 * utmp(5,j) + + njac(5,2,j) = ( c3c4 - c1345 ) * tmp2 * utmp(2,j) + njac(5,3,j) = ( con43 * c3c4 + > - c1345 ) * tmp2 * utmp(3,j) + njac(5,4,j) = ( c3c4 - c1345 ) * tmp2 * utmp(4,j) + njac(5,5,j) = ( c1345 ) * tmp1 + + enddo + +c--------------------------------------------------------------------- +c now joacobians set, so form left hand side in y direction +c--------------------------------------------------------------------- + do j = start(2,c), jsize-end(2,c) + + tmp1 = dt * ty1 + tmp2 = dt * ty2 + + lhsa(1,1,j) = - tmp2 * fjac(1,1,j-1) + > - tmp1 * njac(1,1,j-1) + > - tmp1 * dy1 + lhsa(1,2,j) = - tmp2 * fjac(1,2,j-1) + > - tmp1 * njac(1,2,j-1) + lhsa(1,3,j) = - tmp2 * fjac(1,3,j-1) + > - tmp1 * njac(1,3,j-1) + lhsa(1,4,j) = - tmp2 * fjac(1,4,j-1) + > - tmp1 * njac(1,4,j-1) + lhsa(1,5,j) = - tmp2 * fjac(1,5,j-1) + > - tmp1 * njac(1,5,j-1) + + lhsa(2,1,j) = - tmp2 * fjac(2,1,j-1) + > - tmp1 * njac(2,1,j-1) + lhsa(2,2,j) = - tmp2 * fjac(2,2,j-1) + > - tmp1 * njac(2,2,j-1) + > - tmp1 * dy2 + lhsa(2,3,j) = - tmp2 * fjac(2,3,j-1) + > - tmp1 * njac(2,3,j-1) + lhsa(2,4,j) = - tmp2 * fjac(2,4,j-1) + > - tmp1 * njac(2,4,j-1) + lhsa(2,5,j) = - tmp2 * fjac(2,5,j-1) + > - tmp1 * njac(2,5,j-1) + + lhsa(3,1,j) = - tmp2 * fjac(3,1,j-1) + > - tmp1 * njac(3,1,j-1) + lhsa(3,2,j) = - tmp2 * fjac(3,2,j-1) + > - tmp1 * njac(3,2,j-1) + lhsa(3,3,j) = - tmp2 * fjac(3,3,j-1) + > - tmp1 * njac(3,3,j-1) + > - tmp1 * dy3 + lhsa(3,4,j) = - tmp2 * fjac(3,4,j-1) + > - tmp1 * njac(3,4,j-1) + lhsa(3,5,j) = - tmp2 * fjac(3,5,j-1) + > - tmp1 * njac(3,5,j-1) + + lhsa(4,1,j) = - tmp2 * fjac(4,1,j-1) + > - tmp1 * njac(4,1,j-1) + lhsa(4,2,j) = - tmp2 * fjac(4,2,j-1) + > - tmp1 * njac(4,2,j-1) + lhsa(4,3,j) = - tmp2 * fjac(4,3,j-1) + > - tmp1 * njac(4,3,j-1) + lhsa(4,4,j) = - tmp2 * fjac(4,4,j-1) + > - tmp1 * njac(4,4,j-1) + > - tmp1 * dy4 + lhsa(4,5,j) = - tmp2 * fjac(4,5,j-1) + > - tmp1 * njac(4,5,j-1) + + lhsa(5,1,j) = - tmp2 * fjac(5,1,j-1) + > - tmp1 * njac(5,1,j-1) + lhsa(5,2,j) = - tmp2 * fjac(5,2,j-1) + > - tmp1 * njac(5,2,j-1) + lhsa(5,3,j) = - tmp2 * fjac(5,3,j-1) + > - tmp1 * njac(5,3,j-1) + lhsa(5,4,j) = - tmp2 * fjac(5,4,j-1) + > - tmp1 * njac(5,4,j-1) + lhsa(5,5,j) = - tmp2 * fjac(5,5,j-1) + > - tmp1 * njac(5,5,j-1) + > - tmp1 * dy5 + + lhsb(1,1,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(1,1,j) + > + tmp1 * 2.0d+00 * dy1 + lhsb(1,2,j) = tmp1 * 2.0d+00 * njac(1,2,j) + lhsb(1,3,j) = tmp1 * 2.0d+00 * njac(1,3,j) + lhsb(1,4,j) = tmp1 * 2.0d+00 * njac(1,4,j) + lhsb(1,5,j) = tmp1 * 2.0d+00 * njac(1,5,j) + + lhsb(2,1,j) = tmp1 * 2.0d+00 * njac(2,1,j) + lhsb(2,2,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(2,2,j) + > + tmp1 * 2.0d+00 * dy2 + lhsb(2,3,j) = tmp1 * 2.0d+00 * njac(2,3,j) + lhsb(2,4,j) = tmp1 * 2.0d+00 * njac(2,4,j) + lhsb(2,5,j) = tmp1 * 2.0d+00 * njac(2,5,j) + + lhsb(3,1,j) = tmp1 * 2.0d+00 * njac(3,1,j) + lhsb(3,2,j) = tmp1 * 2.0d+00 * njac(3,2,j) + lhsb(3,3,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(3,3,j) + > + tmp1 * 2.0d+00 * dy3 + lhsb(3,4,j) = tmp1 * 2.0d+00 * njac(3,4,j) + lhsb(3,5,j) = tmp1 * 2.0d+00 * njac(3,5,j) + + lhsb(4,1,j) = tmp1 * 2.0d+00 * njac(4,1,j) + lhsb(4,2,j) = tmp1 * 2.0d+00 * njac(4,2,j) + lhsb(4,3,j) = tmp1 * 2.0d+00 * njac(4,3,j) + lhsb(4,4,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(4,4,j) + > + tmp1 * 2.0d+00 * dy4 + lhsb(4,5,j) = tmp1 * 2.0d+00 * njac(4,5,j) + + lhsb(5,1,j) = tmp1 * 2.0d+00 * njac(5,1,j) + lhsb(5,2,j) = tmp1 * 2.0d+00 * njac(5,2,j) + lhsb(5,3,j) = tmp1 * 2.0d+00 * njac(5,3,j) + lhsb(5,4,j) = tmp1 * 2.0d+00 * njac(5,4,j) + lhsb(5,5,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(5,5,j) + > + tmp1 * 2.0d+00 * dy5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,j+1) + > - tmp1 * njac(1,1,j+1) + > - tmp1 * dy1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,j+1) + > - tmp1 * njac(1,2,j+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,j+1) + > - tmp1 * njac(1,3,j+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,j+1) + > - tmp1 * njac(1,4,j+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,j+1) + > - tmp1 * njac(1,5,j+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,j+1) + > - tmp1 * njac(2,1,j+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,j+1) + > - tmp1 * njac(2,2,j+1) + > - tmp1 * dy2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,j+1) + > - tmp1 * njac(2,3,j+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,j+1) + > - tmp1 * njac(2,4,j+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,j+1) + > - tmp1 * njac(2,5,j+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,j+1) + > - tmp1 * njac(3,1,j+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,j+1) + > - tmp1 * njac(3,2,j+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,j+1) + > - tmp1 * njac(3,3,j+1) + > - tmp1 * dy3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,j+1) + > - tmp1 * njac(3,4,j+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,j+1) + > - tmp1 * njac(3,5,j+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,j+1) + > - tmp1 * njac(4,1,j+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,j+1) + > - tmp1 * njac(4,2,j+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,j+1) + > - tmp1 * njac(4,3,j+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,j+1) + > - tmp1 * njac(4,4,j+1) + > - tmp1 * dy4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,j+1) + > - tmp1 * njac(4,5,j+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,j+1) + > - tmp1 * njac(5,1,j+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,j+1) + > - tmp1 * njac(5,2,j+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,j+1) + > - tmp1 * njac(5,3,j+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,j+1) + > - tmp1 * njac(5,4,j+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,j+1) + > - tmp1 * njac(5,5,j+1) + > - tmp1 * dy5 + + enddo + + +c--------------------------------------------------------------------- +c outer most do loops - sweeping in i direction +c--------------------------------------------------------------------- + if (first .eq. 1) then + +c--------------------------------------------------------------------- +c multiply c(i,jstart,k) by b_inverse and copy back to c +c multiply rhs(jstart) by b_inverse(jstart) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,jstart), + > lhsc(1,1,i,jstart,k,c), + > rhs(1,i,jstart,k,c) ) + + endif + +c--------------------------------------------------------------------- +c begin inner most do loop +c do all the elements of the cell unless last +c--------------------------------------------------------------------- + do j=jstart+first,jsize-last + +c--------------------------------------------------------------------- +c subtract A*lhs_vector(j-1) from lhs_vector(j) +c +c rhs(j) = rhs(j) - A*rhs(j-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,j), + > rhs(1,i,j-1,k,c),rhs(1,i,j,k,c)) + +c--------------------------------------------------------------------- +c B(j) = B(j) - C(j-1)*A(j) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,j), + > lhsc(1,1,i,j-1,k,c), + > lhsb(1,1,j)) + +c--------------------------------------------------------------------- +c multiply c(i,j,k) by b_inverse and copy back to c +c multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,j), + > lhsc(1,1,i,j,k,c), + > rhs(1,i,j,k,c) ) + + enddo + +c--------------------------------------------------------------------- +c Now finish up special cases for last cell +c--------------------------------------------------------------------- + if (last .eq. 1) then + +c--------------------------------------------------------------------- +c rhs(jsize) = rhs(jsize) - A*rhs(jsize-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,jsize), + > rhs(1,i,jsize-1,k,c),rhs(1,i,jsize,k,c)) + +c--------------------------------------------------------------------- +c B(jsize) = B(jsize) - C(jsize-1)*A(jsize) +c call matmul_sub(aa,i,jsize,k,c, +c $ cc,i,jsize-1,k,c,bb,i,jsize,k,c) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,jsize), + > lhsc(1,1,i,jsize-1,k,c), + > lhsb(1,1,jsize)) + +c--------------------------------------------------------------------- +c multiply rhs(jsize) by b_inverse(jsize) and copy to rhs +c--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,jsize), + > rhs(1,i,jsize,k,c) ) + + endif + enddo + enddo + + + return + end + + + diff --git b/NPB3.3-MPI/BT/y_solve_vec.f a/NPB3.3-MPI/BT/y_solve_vec.f new file mode 100644 index 0000000..e954028 --- /dev/null +++ a/NPB3.3-MPI/BT/y_solve_vec.f @@ -0,0 +1,798 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_solve + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Performs line solves in Y direction by first factoring +c the block-tridiagonal matrix into an upper triangular matrix, +c and then performing back substitution to solve for the unknow +c vectors of each line. +c +c Make sure we treat elements zero to cell_size in the direction +c of the sweep. +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer + > c, jstart, stage, + > first, last, recv_id, error, r_status(MPI_STATUS_SIZE), + > isize,jsize,ksize,send_id + + jstart = 0 + + if (timeron) call timer_start(t_ysolve) +c--------------------------------------------------------------------- +c in our terminology stage is the number of the cell in the y-direct +c i.e. stage = 1 means the start of the line stage=ncells means end +c--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(2,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 + +c--------------------------------------------------------------------- +c set last-cell flag +c--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +c--------------------------------------------------------------------- +c This is the first cell, so solve without receiving data +c--------------------------------------------------------------------- + first = 1 +c call lhsy(c) + call y_solve_cell(first,last,c) + else +c--------------------------------------------------------------------- +c Not the first cell of this line, so receive info from +c processor working on preceeding cell +c--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_ycomm) + call y_receive_solve_info(recv_id,c) +c--------------------------------------------------------------------- +c overlap computations and communications +c--------------------------------------------------------------------- +c call lhsy(c) +c--------------------------------------------------------------------- +c wait for completion +c--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_ycomm) +c--------------------------------------------------------------------- +c install C'(jstart+1) and rhs'(jstart+1) to be used in this cell +c--------------------------------------------------------------------- + call y_unpack_solve_info(c) + call y_solve_cell(first,last,c) + endif + + if (last .eq. 0) call y_send_solve_info(send_id,c) + enddo + +c--------------------------------------------------------------------- +c now perform backsubstitution in reverse direction +c--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(2,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +c--------------------------------------------------------------------- +c last cell, so perform back substitute without waiting +c--------------------------------------------------------------------- + call y_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_ycomm) + call y_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_ycomm) + call y_unpack_backsub_info(c) + call y_backsubstitute(first,last,c) + endif + if (first .eq. 0) call y_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_ysolve) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_unpack_solve_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack C'(-1) and rhs'(-1) for +c all i and k +c--------------------------------------------------------------------- + + include 'header.h' + + integer i,k,m,n,ptr,c,jstart + + jstart = 0 + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,i,jstart-1,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,i,jstart-1,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_send_solve_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send C'(jend) and rhs'(jend) for +c all i and k +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i,k,m,n,jsize,ptr,c,ip,kp + integer error,send_id,buffer_size + + jsize = cell_size(2,c)-1 + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +c--------------------------------------------------------------------- +c pack up buffer +c--------------------------------------------------------------------- + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,i,jsize,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,jsize,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +c--------------------------------------------------------------------- +c send buffer +c--------------------------------------------------------------------- + if (timeron) call timer_start(t_ycomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, successor(2), + > SOUTH+ip+kp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_ycomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_send_backsub_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send U(jstart) for all i and k +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i,k,n,ptr,c,jstart,ip,kp + integer error,send_id,buffer_size + +c--------------------------------------------------------------------- +c Send element 0 to previous processor +c--------------------------------------------------------------------- + jstart = 0 + ip = cell_coord(1,c)-1 + kp = cell_coord(3,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,jstart,k,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + if (timeron) call timer_start(t_ycomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, predecessor(2), + > NORTH+ip+kp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_ycomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_unpack_backsub_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack U(jsize) for all i and k +c--------------------------------------------------------------------- + + include 'header.h' + + integer i,k,n,ptr,c + + ptr = 0 + do k=0,KMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,i,k,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_receive_backsub_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer error,recv_id,ip,kp,c,buffer_size + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, + > dp_type, successor(2), + > NORTH+ip+kp*NCELLS, comm_solve, + > recv_id, error) + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_receive_solve_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer ip,kp,recv_id,error,c,buffer_size + ip = cell_coord(1,c) - 1 + kp = cell_coord(3,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, + > dp_type, predecessor(2), + > SOUTH+ip+kp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_backsubstitute(first, last, c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c back solve: if last cell, then generate U(jsize)=rhs(jsize) +c else assume U(jsize) is loaded in un pack backsub_info +c so just use it +c after call u(jstart) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + + integer first, last, c, i, k + integer m,n,j,jsize,isize,ksize,jstart + + jstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + if (last .eq. 0) then + do k=start(3,c),ksize + do i=start(1,c),isize +c--------------------------------------------------------------------- +c U(jsize) uses info from previous cell if not last cell +c--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,jsize,k,c) = rhs(m,i,jsize,k,c) + > - lhsc(m,n,i,jsize,k,c)* + > backsub_info(n,i,k,c) + enddo + enddo + enddo + enddo + endif + do k=start(3,c),ksize + do j=jsize-1,jstart,-1 + do i=start(1,c),isize + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) + > - lhsc(m,n,i,j,k,c)*rhs(n,i,j+1,k,c) + enddo + enddo + enddo + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine y_solve_cell(first,last,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c performs guaussian elimination on this cell. +c +c assumes that unpacking routines for non-first cells +c preload C' and rhs' from previous cell. +c +c assumed send happens outside this routine, but that +c c'(JMAX) and rhs'(JMAX) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + include 'work_lhs_vec.h' + + integer first,last,c + integer i,j,k,m,n,isize,ksize,jsize,jstart + + jstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-1 + ksize = cell_size(3,c)-end(3,c)-1 + +c--------------------------------------------------------------------- +c zero the left hand side for starters +c set diagonal values to 1. This is overkill, but convenient +c--------------------------------------------------------------------- + do i = 0, isize + do m = 1, 5 + do n = 1, 5 + lhsa(m,n,i,0) = 0.0d0 + lhsb(m,n,i,0) = 0.0d0 + lhsa(m,n,i,jsize) = 0.0d0 + lhsb(m,n,i,jsize) = 0.0d0 + enddo + lhsb(m,m,i,0) = 1.0d0 + lhsb(m,m,i,jsize) = 1.0d0 + enddo + enddo + + do k=start(3,c),ksize + +c--------------------------------------------------------------------- +c This function computes the left hand side for the three y-factors +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Compute the indices for storing the tri-diagonal matrix; +c determine a (labeled f) and n jacobians for cell c +c--------------------------------------------------------------------- + + do j = start(2,c)-1, cell_size(2,c)-end(2,c) + do i=start(1,c),isize + + tmp1 = 1.0d0 / u(1,i,j,k,c) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 + + fjac(1,1,i,j) = 0.0d+00 + fjac(1,2,i,j) = 0.0d+00 + fjac(1,3,i,j) = 1.0d+00 + fjac(1,4,i,j) = 0.0d+00 + fjac(1,5,i,j) = 0.0d+00 + + fjac(2,1,i,j) = - ( u(2,i,j,k,c)*u(3,i,j,k,c) ) + > * tmp2 + fjac(2,2,i,j) = u(3,i,j,k,c) * tmp1 + fjac(2,3,i,j) = u(2,i,j,k,c) * tmp1 + fjac(2,4,i,j) = 0.0d+00 + fjac(2,5,i,j) = 0.0d+00 + + fjac(3,1,i,j) = - ( u(3,i,j,k,c)*u(3,i,j,k,c)*tmp2) + > + c2 * qs(i,j,k,c) + fjac(3,2,i,j) = - c2 * u(2,i,j,k,c) * tmp1 + fjac(3,3,i,j) = ( 2.0d+00 - c2 ) + > * u(3,i,j,k,c) * tmp1 + fjac(3,4,i,j) = - c2 * u(4,i,j,k,c) * tmp1 + fjac(3,5,i,j) = c2 + + fjac(4,1,i,j) = - ( u(3,i,j,k,c)*u(4,i,j,k,c) ) + > * tmp2 + fjac(4,2,i,j) = 0.0d+00 + fjac(4,3,i,j) = u(4,i,j,k,c) * tmp1 + fjac(4,4,i,j) = u(3,i,j,k,c) * tmp1 + fjac(4,5,i,j) = 0.0d+00 + + fjac(5,1,i,j) = ( c2 * 2.0d0 * qs(i,j,k,c) + > - c1 * u(5,i,j,k,c) * tmp1 ) + > * u(3,i,j,k,c) * tmp1 + fjac(5,2,i,j) = - c2 * u(2,i,j,k,c)*u(3,i,j,k,c) + > * tmp2 + fjac(5,3,i,j) = c1 * u(5,i,j,k,c) * tmp1 + > - c2 * ( qs(i,j,k,c) + > + u(3,i,j,k,c)*u(3,i,j,k,c) * tmp2 ) + fjac(5,4,i,j) = - c2 * ( u(3,i,j,k,c)*u(4,i,j,k,c) ) + > * tmp2 + fjac(5,5,i,j) = c1 * u(3,i,j,k,c) * tmp1 + + njac(1,1,i,j) = 0.0d+00 + njac(1,2,i,j) = 0.0d+00 + njac(1,3,i,j) = 0.0d+00 + njac(1,4,i,j) = 0.0d+00 + njac(1,5,i,j) = 0.0d+00 + + njac(2,1,i,j) = - c3c4 * tmp2 * u(2,i,j,k,c) + njac(2,2,i,j) = c3c4 * tmp1 + njac(2,3,i,j) = 0.0d+00 + njac(2,4,i,j) = 0.0d+00 + njac(2,5,i,j) = 0.0d+00 + + njac(3,1,i,j) = - con43 * c3c4 * tmp2 * u(3,i,j,k,c) + njac(3,2,i,j) = 0.0d+00 + njac(3,3,i,j) = con43 * c3c4 * tmp1 + njac(3,4,i,j) = 0.0d+00 + njac(3,5,i,j) = 0.0d+00 + + njac(4,1,i,j) = - c3c4 * tmp2 * u(4,i,j,k,c) + njac(4,2,i,j) = 0.0d+00 + njac(4,3,i,j) = 0.0d+00 + njac(4,4,i,j) = c3c4 * tmp1 + njac(4,5,i,j) = 0.0d+00 + + njac(5,1,i,j) = - ( c3c4 + > - c1345 ) * tmp3 * (u(2,i,j,k,c)**2) + > - ( con43 * c3c4 + > - c1345 ) * tmp3 * (u(3,i,j,k,c)**2) + > - ( c3c4 - c1345 ) * tmp3 * (u(4,i,j,k,c)**2) + > - c1345 * tmp2 * u(5,i,j,k,c) + + njac(5,2,i,j) = ( c3c4 - c1345 ) * tmp2 * u(2,i,j,k,c) + njac(5,3,i,j) = ( con43 * c3c4 + > - c1345 ) * tmp2 * u(3,i,j,k,c) + njac(5,4,i,j) = ( c3c4 - c1345 ) * tmp2 * u(4,i,j,k,c) + njac(5,5,i,j) = ( c1345 ) * tmp1 + + enddo + enddo + +c--------------------------------------------------------------------- +c now joacobians set, so form left hand side in y direction +c--------------------------------------------------------------------- + do j = start(2,c), jsize-end(2,c) + do i=start(1,c),isize + + tmp1 = dt * ty1 + tmp2 = dt * ty2 + + lhsa(1,1,i,j) = - tmp2 * fjac(1,1,i,j-1) + > - tmp1 * njac(1,1,i,j-1) + > - tmp1 * dy1 + lhsa(1,2,i,j) = - tmp2 * fjac(1,2,i,j-1) + > - tmp1 * njac(1,2,i,j-1) + lhsa(1,3,i,j) = - tmp2 * fjac(1,3,i,j-1) + > - tmp1 * njac(1,3,i,j-1) + lhsa(1,4,i,j) = - tmp2 * fjac(1,4,i,j-1) + > - tmp1 * njac(1,4,i,j-1) + lhsa(1,5,i,j) = - tmp2 * fjac(1,5,i,j-1) + > - tmp1 * njac(1,5,i,j-1) + + lhsa(2,1,i,j) = - tmp2 * fjac(2,1,i,j-1) + > - tmp1 * njac(2,1,i,j-1) + lhsa(2,2,i,j) = - tmp2 * fjac(2,2,i,j-1) + > - tmp1 * njac(2,2,i,j-1) + > - tmp1 * dy2 + lhsa(2,3,i,j) = - tmp2 * fjac(2,3,i,j-1) + > - tmp1 * njac(2,3,i,j-1) + lhsa(2,4,i,j) = - tmp2 * fjac(2,4,i,j-1) + > - tmp1 * njac(2,4,i,j-1) + lhsa(2,5,i,j) = - tmp2 * fjac(2,5,i,j-1) + > - tmp1 * njac(2,5,i,j-1) + + lhsa(3,1,i,j) = - tmp2 * fjac(3,1,i,j-1) + > - tmp1 * njac(3,1,i,j-1) + lhsa(3,2,i,j) = - tmp2 * fjac(3,2,i,j-1) + > - tmp1 * njac(3,2,i,j-1) + lhsa(3,3,i,j) = - tmp2 * fjac(3,3,i,j-1) + > - tmp1 * njac(3,3,i,j-1) + > - tmp1 * dy3 + lhsa(3,4,i,j) = - tmp2 * fjac(3,4,i,j-1) + > - tmp1 * njac(3,4,i,j-1) + lhsa(3,5,i,j) = - tmp2 * fjac(3,5,i,j-1) + > - tmp1 * njac(3,5,i,j-1) + + lhsa(4,1,i,j) = - tmp2 * fjac(4,1,i,j-1) + > - tmp1 * njac(4,1,i,j-1) + lhsa(4,2,i,j) = - tmp2 * fjac(4,2,i,j-1) + > - tmp1 * njac(4,2,i,j-1) + lhsa(4,3,i,j) = - tmp2 * fjac(4,3,i,j-1) + > - tmp1 * njac(4,3,i,j-1) + lhsa(4,4,i,j) = - tmp2 * fjac(4,4,i,j-1) + > - tmp1 * njac(4,4,i,j-1) + > - tmp1 * dy4 + lhsa(4,5,i,j) = - tmp2 * fjac(4,5,i,j-1) + > - tmp1 * njac(4,5,i,j-1) + + lhsa(5,1,i,j) = - tmp2 * fjac(5,1,i,j-1) + > - tmp1 * njac(5,1,i,j-1) + lhsa(5,2,i,j) = - tmp2 * fjac(5,2,i,j-1) + > - tmp1 * njac(5,2,i,j-1) + lhsa(5,3,i,j) = - tmp2 * fjac(5,3,i,j-1) + > - tmp1 * njac(5,3,i,j-1) + lhsa(5,4,i,j) = - tmp2 * fjac(5,4,i,j-1) + > - tmp1 * njac(5,4,i,j-1) + lhsa(5,5,i,j) = - tmp2 * fjac(5,5,i,j-1) + > - tmp1 * njac(5,5,i,j-1) + > - tmp1 * dy5 + + lhsb(1,1,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(1,1,i,j) + > + tmp1 * 2.0d+00 * dy1 + lhsb(1,2,i,j) = tmp1 * 2.0d+00 * njac(1,2,i,j) + lhsb(1,3,i,j) = tmp1 * 2.0d+00 * njac(1,3,i,j) + lhsb(1,4,i,j) = tmp1 * 2.0d+00 * njac(1,4,i,j) + lhsb(1,5,i,j) = tmp1 * 2.0d+00 * njac(1,5,i,j) + + lhsb(2,1,i,j) = tmp1 * 2.0d+00 * njac(2,1,i,j) + lhsb(2,2,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(2,2,i,j) + > + tmp1 * 2.0d+00 * dy2 + lhsb(2,3,i,j) = tmp1 * 2.0d+00 * njac(2,3,i,j) + lhsb(2,4,i,j) = tmp1 * 2.0d+00 * njac(2,4,i,j) + lhsb(2,5,i,j) = tmp1 * 2.0d+00 * njac(2,5,i,j) + + lhsb(3,1,i,j) = tmp1 * 2.0d+00 * njac(3,1,i,j) + lhsb(3,2,i,j) = tmp1 * 2.0d+00 * njac(3,2,i,j) + lhsb(3,3,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(3,3,i,j) + > + tmp1 * 2.0d+00 * dy3 + lhsb(3,4,i,j) = tmp1 * 2.0d+00 * njac(3,4,i,j) + lhsb(3,5,i,j) = tmp1 * 2.0d+00 * njac(3,5,i,j) + + lhsb(4,1,i,j) = tmp1 * 2.0d+00 * njac(4,1,i,j) + lhsb(4,2,i,j) = tmp1 * 2.0d+00 * njac(4,2,i,j) + lhsb(4,3,i,j) = tmp1 * 2.0d+00 * njac(4,3,i,j) + lhsb(4,4,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(4,4,i,j) + > + tmp1 * 2.0d+00 * dy4 + lhsb(4,5,i,j) = tmp1 * 2.0d+00 * njac(4,5,i,j) + + lhsb(5,1,i,j) = tmp1 * 2.0d+00 * njac(5,1,i,j) + lhsb(5,2,i,j) = tmp1 * 2.0d+00 * njac(5,2,i,j) + lhsb(5,3,i,j) = tmp1 * 2.0d+00 * njac(5,3,i,j) + lhsb(5,4,i,j) = tmp1 * 2.0d+00 * njac(5,4,i,j) + lhsb(5,5,i,j) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(5,5,i,j) + > + tmp1 * 2.0d+00 * dy5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,i,j+1) + > - tmp1 * njac(1,1,i,j+1) + > - tmp1 * dy1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,i,j+1) + > - tmp1 * njac(1,2,i,j+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,i,j+1) + > - tmp1 * njac(1,3,i,j+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,i,j+1) + > - tmp1 * njac(1,4,i,j+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,i,j+1) + > - tmp1 * njac(1,5,i,j+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,i,j+1) + > - tmp1 * njac(2,1,i,j+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,i,j+1) + > - tmp1 * njac(2,2,i,j+1) + > - tmp1 * dy2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,i,j+1) + > - tmp1 * njac(2,3,i,j+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,i,j+1) + > - tmp1 * njac(2,4,i,j+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,i,j+1) + > - tmp1 * njac(2,5,i,j+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,i,j+1) + > - tmp1 * njac(3,1,i,j+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,i,j+1) + > - tmp1 * njac(3,2,i,j+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,i,j+1) + > - tmp1 * njac(3,3,i,j+1) + > - tmp1 * dy3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,i,j+1) + > - tmp1 * njac(3,4,i,j+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,i,j+1) + > - tmp1 * njac(3,5,i,j+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,i,j+1) + > - tmp1 * njac(4,1,i,j+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,i,j+1) + > - tmp1 * njac(4,2,i,j+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,i,j+1) + > - tmp1 * njac(4,3,i,j+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,i,j+1) + > - tmp1 * njac(4,4,i,j+1) + > - tmp1 * dy4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,i,j+1) + > - tmp1 * njac(4,5,i,j+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,i,j+1) + > - tmp1 * njac(5,1,i,j+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,i,j+1) + > - tmp1 * njac(5,2,i,j+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,i,j+1) + > - tmp1 * njac(5,3,i,j+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,i,j+1) + > - tmp1 * njac(5,4,i,j+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,i,j+1) + > - tmp1 * njac(5,5,i,j+1) + > - tmp1 * dy5 + + enddo + enddo + + +c--------------------------------------------------------------------- +c outer most do loops - sweeping in i direction +c--------------------------------------------------------------------- + if (first .eq. 1) then + +c--------------------------------------------------------------------- +c multiply c(i,jstart,k) by b_inverse and copy back to c +c multiply rhs(jstart) by b_inverse(jstart) and copy to rhs +c--------------------------------------------------------------------- +!dir$ ivdep + do i=start(1,c),isize + call binvcrhs( lhsb(1,1,i,jstart), + > lhsc(1,1,i,jstart,k,c), + > rhs(1,i,jstart,k,c) ) + enddo + + endif + +c--------------------------------------------------------------------- +c begin inner most do loop +c do all the elements of the cell unless last +c--------------------------------------------------------------------- + do j=jstart+first,jsize-last +!dir$ ivdep + do i=start(1,c),isize + +c--------------------------------------------------------------------- +c subtract A*lhs_vector(j-1) from lhs_vector(j) +c +c rhs(j) = rhs(j) - A*rhs(j-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,j), + > rhs(1,i,j-1,k,c),rhs(1,i,j,k,c)) + +c--------------------------------------------------------------------- +c B(j) = B(j) - C(j-1)*A(j) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,j), + > lhsc(1,1,i,j-1,k,c), + > lhsb(1,1,i,j)) + +c--------------------------------------------------------------------- +c multiply c(i,j,k) by b_inverse and copy back to c +c multiply rhs(i,1,k) by b_inverse(i,1,k) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,i,j), + > lhsc(1,1,i,j,k,c), + > rhs(1,i,j,k,c) ) + + enddo + enddo + +c--------------------------------------------------------------------- +c Now finish up special cases for last cell +c--------------------------------------------------------------------- + if (last .eq. 1) then + +!dir$ ivdep + do i=start(1,c),isize +c--------------------------------------------------------------------- +c rhs(jsize) = rhs(jsize) - A*rhs(jsize-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,jsize), + > rhs(1,i,jsize-1,k,c),rhs(1,i,jsize,k,c)) + +c--------------------------------------------------------------------- +c B(jsize) = B(jsize) - C(jsize-1)*A(jsize) +c call matmul_sub(aa,i,jsize,k,c, +c $ cc,i,jsize-1,k,c,bb,i,jsize,k,c) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,jsize), + > lhsc(1,1,i,jsize-1,k,c), + > lhsb(1,1,i,jsize)) + +c--------------------------------------------------------------------- +c multiply rhs(jsize) by b_inverse(jsize) and copy to rhs +c--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,i,jsize), + > rhs(1,i,jsize,k,c) ) + enddo + + endif + enddo + + + return + end + + + diff --git b/NPB3.3-MPI/BT/z_solve.f a/NPB3.3-MPI/BT/z_solve.f new file mode 100644 index 0000000..796fccd --- /dev/null +++ a/NPB3.3-MPI/BT/z_solve.f @@ -0,0 +1,786 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_solve + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Performs line solves in Z direction by first factoring +c the block-tridiagonal matrix into an upper triangular matrix, +c and then performing back substitution to solve for the unknow +c vectors of each line. +c +c Make sure we treat elements zero to cell_size in the direction +c of the sweep. +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer c, kstart, stage, + > first, last, recv_id, error, r_status(MPI_STATUS_SIZE), + > isize,jsize,ksize,send_id + + kstart = 0 + + if (timeron) call timer_start(t_zsolve) +c--------------------------------------------------------------------- +c in our terminology stage is the number of the cell in the y-direction +c i.e. stage = 1 means the start of the line stage=ncells means end +c--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(3,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 +c--------------------------------------------------------------------- +c set last-cell flag +c--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +c--------------------------------------------------------------------- +c This is the first cell, so solve without receiving data +c--------------------------------------------------------------------- + first = 1 +c call lhsz(c) + call z_solve_cell(first,last,c) + else +c--------------------------------------------------------------------- +c Not the first cell of this line, so receive info from +c processor working on preceeding cell +c--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_zcomm) + call z_receive_solve_info(recv_id,c) +c--------------------------------------------------------------------- +c overlap computations and communications +c--------------------------------------------------------------------- +c call lhsz(c) +c--------------------------------------------------------------------- +c wait for completion +c--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_zcomm) +c--------------------------------------------------------------------- +c install C'(kstart+1) and rhs'(kstart+1) to be used in this cell +c--------------------------------------------------------------------- + call z_unpack_solve_info(c) + call z_solve_cell(first,last,c) + endif + + if (last .eq. 0) call z_send_solve_info(send_id,c) + enddo + +c--------------------------------------------------------------------- +c now perform backsubstitution in reverse direction +c--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(3,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +c--------------------------------------------------------------------- +c last cell, so perform back substitute without waiting +c--------------------------------------------------------------------- + call z_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_zcomm) + call z_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_zcomm) + call z_unpack_backsub_info(c) + call z_backsubstitute(first,last,c) + endif + if (first .eq. 0) call z_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_zsolve) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_unpack_solve_info(c) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack C'(-1) and rhs'(-1) for +c all i and j +c--------------------------------------------------------------------- + + include 'header.h' + + integer i,j,m,n,ptr,c,kstart + + kstart = 0 + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,i,j,kstart-1,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,i,j,kstart-1,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_send_solve_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send C'(kend) and rhs'(kend) for +c all i and j +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i,j,m,n,ksize,ptr,c,ip,jp + integer error,send_id,buffer_size + + ksize = cell_size(3,c)-1 + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +c--------------------------------------------------------------------- +c pack up buffer +c--------------------------------------------------------------------- + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,i,j,ksize,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,j,ksize,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +c--------------------------------------------------------------------- +c send buffer +c--------------------------------------------------------------------- + if (timeron) call timer_start(t_zcomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, successor(3), + > BOTTOM+ip+jp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_zcomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_send_backsub_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send U(jstart) for all i and j +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i,j,n,ptr,c,kstart,ip,jp + integer error,send_id,buffer_size + +c--------------------------------------------------------------------- +c Send element 0 to previous processor +c--------------------------------------------------------------------- + kstart = 0 + ip = cell_coord(1,c)-1 + jp = cell_coord(2,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,j,kstart,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + if (timeron) call timer_start(t_zcomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, predecessor(3), + > TOP+ip+jp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_zcomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_unpack_backsub_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack U(ksize) for all i and j +c--------------------------------------------------------------------- + + include 'header.h' + + integer i,j,n,ptr,c + + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,i,j,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_receive_backsub_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer error,recv_id,ip,jp,c,buffer_size + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, + > dp_type, successor(3), + > TOP+ip+jp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_receive_solve_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer ip,jp,recv_id,error,c,buffer_size + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, + > dp_type, predecessor(3), + > BOTTOM+ip+jp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_backsubstitute(first, last, c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c back solve: if last cell, then generate U(ksize)=rhs(ksize) +c else assume U(ksize) is loaded in un pack backsub_info +c so just use it +c after call u(kstart) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + + integer first, last, c, i, k + integer m,n,j,jsize,isize,ksize,kstart + + kstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-1 + if (last .eq. 0) then + do j=start(2,c),jsize + do i=start(1,c),isize +c--------------------------------------------------------------------- +c U(jsize) uses info from previous cell if not last cell +c--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,ksize,c) = rhs(m,i,j,ksize,c) + > - lhsc(m,n,i,j,ksize,c)* + > backsub_info(n,i,j,c) + enddo + enddo + enddo + enddo + endif + do k=ksize-1,kstart,-1 + do j=start(2,c),jsize + do i=start(1,c),isize + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) + > - lhsc(m,n,i,j,k,c)*rhs(n,i,j,k+1,c) + enddo + enddo + enddo + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_solve_cell(first,last,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c performs guaussian elimination on this cell. +c +c assumes that unpacking routines for non-first cells +c preload C' and rhs' from previous cell. +c +c assumed send happens outside this routine, but that +c c'(KMAX) and rhs'(KMAX) will be sent to next cell. +c--------------------------------------------------------------------- + + include 'header.h' + include 'work_lhs.h' + + integer first,last,c + integer i,j,k,isize,ksize,jsize,kstart + double precision utmp(6,-2:KMAX+1) + + kstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-1 + + call lhsabinit(lhsa, lhsb, ksize) + + do j=start(2,c),jsize + do i=start(1,c),isize + +c--------------------------------------------------------------------- +c This function computes the left hand side for the three z-factors +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Compute the indices for storing the block-diagonal matrix; +c determine c (labeled f) and s jacobians for cell c +c--------------------------------------------------------------------- + do k = start(3,c)-1, cell_size(3,c)-end(3,c) + utmp(1,k) = 1.0d0 / u(1,i,j,k,c) + utmp(2,k) = u(2,i,j,k,c) + utmp(3,k) = u(3,i,j,k,c) + utmp(4,k) = u(4,i,j,k,c) + utmp(5,k) = u(5,i,j,k,c) + utmp(6,k) = qs(i,j,k,c) + end do + + do k = start(3,c)-1, cell_size(3,c)-end(3,c) + + tmp1 = utmp(1,k) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 + + fjac(1,1,k) = 0.0d+00 + fjac(1,2,k) = 0.0d+00 + fjac(1,3,k) = 0.0d+00 + fjac(1,4,k) = 1.0d+00 + fjac(1,5,k) = 0.0d+00 + + fjac(2,1,k) = - ( utmp(2,k)*utmp(4,k) ) + > * tmp2 + fjac(2,2,k) = utmp(4,k) * tmp1 + fjac(2,3,k) = 0.0d+00 + fjac(2,4,k) = utmp(2,k) * tmp1 + fjac(2,5,k) = 0.0d+00 + + fjac(3,1,k) = - ( utmp(3,k)*utmp(4,k) ) + > * tmp2 + fjac(3,2,k) = 0.0d+00 + fjac(3,3,k) = utmp(4,k) * tmp1 + fjac(3,4,k) = utmp(3,k) * tmp1 + fjac(3,5,k) = 0.0d+00 + + fjac(4,1,k) = - (utmp(4,k)*utmp(4,k) * tmp2 ) + > + c2 * utmp(6,k) + fjac(4,2,k) = - c2 * utmp(2,k) * tmp1 + fjac(4,3,k) = - c2 * utmp(3,k) * tmp1 + fjac(4,4,k) = ( 2.0d+00 - c2 ) + > * utmp(4,k) * tmp1 + fjac(4,5,k) = c2 + + fjac(5,1,k) = ( c2 * 2.0d0 * utmp(6,k) + > - c1 * ( utmp(5,k) * tmp1 ) ) + > * ( utmp(4,k) * tmp1 ) + fjac(5,2,k) = - c2 * ( utmp(2,k)*utmp(4,k) ) + > * tmp2 + fjac(5,3,k) = - c2 * ( utmp(3,k)*utmp(4,k) ) + > * tmp2 + fjac(5,4,k) = c1 * ( utmp(5,k) * tmp1 ) + > - c2 * ( utmp(6,k) + > + utmp(4,k)*utmp(4,k) * tmp2 ) + fjac(5,5,k) = c1 * utmp(4,k) * tmp1 + + njac(1,1,k) = 0.0d+00 + njac(1,2,k) = 0.0d+00 + njac(1,3,k) = 0.0d+00 + njac(1,4,k) = 0.0d+00 + njac(1,5,k) = 0.0d+00 + + njac(2,1,k) = - c3c4 * tmp2 * utmp(2,k) + njac(2,2,k) = c3c4 * tmp1 + njac(2,3,k) = 0.0d+00 + njac(2,4,k) = 0.0d+00 + njac(2,5,k) = 0.0d+00 + + njac(3,1,k) = - c3c4 * tmp2 * utmp(3,k) + njac(3,2,k) = 0.0d+00 + njac(3,3,k) = c3c4 * tmp1 + njac(3,4,k) = 0.0d+00 + njac(3,5,k) = 0.0d+00 + + njac(4,1,k) = - con43 * c3c4 * tmp2 * utmp(4,k) + njac(4,2,k) = 0.0d+00 + njac(4,3,k) = 0.0d+00 + njac(4,4,k) = con43 * c3 * c4 * tmp1 + njac(4,5,k) = 0.0d+00 + + njac(5,1,k) = - ( c3c4 + > - c1345 ) * tmp3 * (utmp(2,k)**2) + > - ( c3c4 - c1345 ) * tmp3 * (utmp(3,k)**2) + > - ( con43 * c3c4 + > - c1345 ) * tmp3 * (utmp(4,k)**2) + > - c1345 * tmp2 * utmp(5,k) + + njac(5,2,k) = ( c3c4 - c1345 ) * tmp2 * utmp(2,k) + njac(5,3,k) = ( c3c4 - c1345 ) * tmp2 * utmp(3,k) + njac(5,4,k) = ( con43 * c3c4 + > - c1345 ) * tmp2 * utmp(4,k) + njac(5,5,k) = ( c1345 )* tmp1 + + + enddo + +c--------------------------------------------------------------------- +c now joacobians set, so form left hand side in z direction +c--------------------------------------------------------------------- + do k = start(3,c), ksize-end(3,c) + + tmp1 = dt * tz1 + tmp2 = dt * tz2 + + lhsa(1,1,k) = - tmp2 * fjac(1,1,k-1) + > - tmp1 * njac(1,1,k-1) + > - tmp1 * dz1 + lhsa(1,2,k) = - tmp2 * fjac(1,2,k-1) + > - tmp1 * njac(1,2,k-1) + lhsa(1,3,k) = - tmp2 * fjac(1,3,k-1) + > - tmp1 * njac(1,3,k-1) + lhsa(1,4,k) = - tmp2 * fjac(1,4,k-1) + > - tmp1 * njac(1,4,k-1) + lhsa(1,5,k) = - tmp2 * fjac(1,5,k-1) + > - tmp1 * njac(1,5,k-1) + + lhsa(2,1,k) = - tmp2 * fjac(2,1,k-1) + > - tmp1 * njac(2,1,k-1) + lhsa(2,2,k) = - tmp2 * fjac(2,2,k-1) + > - tmp1 * njac(2,2,k-1) + > - tmp1 * dz2 + lhsa(2,3,k) = - tmp2 * fjac(2,3,k-1) + > - tmp1 * njac(2,3,k-1) + lhsa(2,4,k) = - tmp2 * fjac(2,4,k-1) + > - tmp1 * njac(2,4,k-1) + lhsa(2,5,k) = - tmp2 * fjac(2,5,k-1) + > - tmp1 * njac(2,5,k-1) + + lhsa(3,1,k) = - tmp2 * fjac(3,1,k-1) + > - tmp1 * njac(3,1,k-1) + lhsa(3,2,k) = - tmp2 * fjac(3,2,k-1) + > - tmp1 * njac(3,2,k-1) + lhsa(3,3,k) = - tmp2 * fjac(3,3,k-1) + > - tmp1 * njac(3,3,k-1) + > - tmp1 * dz3 + lhsa(3,4,k) = - tmp2 * fjac(3,4,k-1) + > - tmp1 * njac(3,4,k-1) + lhsa(3,5,k) = - tmp2 * fjac(3,5,k-1) + > - tmp1 * njac(3,5,k-1) + + lhsa(4,1,k) = - tmp2 * fjac(4,1,k-1) + > - tmp1 * njac(4,1,k-1) + lhsa(4,2,k) = - tmp2 * fjac(4,2,k-1) + > - tmp1 * njac(4,2,k-1) + lhsa(4,3,k) = - tmp2 * fjac(4,3,k-1) + > - tmp1 * njac(4,3,k-1) + lhsa(4,4,k) = - tmp2 * fjac(4,4,k-1) + > - tmp1 * njac(4,4,k-1) + > - tmp1 * dz4 + lhsa(4,5,k) = - tmp2 * fjac(4,5,k-1) + > - tmp1 * njac(4,5,k-1) + + lhsa(5,1,k) = - tmp2 * fjac(5,1,k-1) + > - tmp1 * njac(5,1,k-1) + lhsa(5,2,k) = - tmp2 * fjac(5,2,k-1) + > - tmp1 * njac(5,2,k-1) + lhsa(5,3,k) = - tmp2 * fjac(5,3,k-1) + > - tmp1 * njac(5,3,k-1) + lhsa(5,4,k) = - tmp2 * fjac(5,4,k-1) + > - tmp1 * njac(5,4,k-1) + lhsa(5,5,k) = - tmp2 * fjac(5,5,k-1) + > - tmp1 * njac(5,5,k-1) + > - tmp1 * dz5 + + lhsb(1,1,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(1,1,k) + > + tmp1 * 2.0d+00 * dz1 + lhsb(1,2,k) = tmp1 * 2.0d+00 * njac(1,2,k) + lhsb(1,3,k) = tmp1 * 2.0d+00 * njac(1,3,k) + lhsb(1,4,k) = tmp1 * 2.0d+00 * njac(1,4,k) + lhsb(1,5,k) = tmp1 * 2.0d+00 * njac(1,5,k) + + lhsb(2,1,k) = tmp1 * 2.0d+00 * njac(2,1,k) + lhsb(2,2,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(2,2,k) + > + tmp1 * 2.0d+00 * dz2 + lhsb(2,3,k) = tmp1 * 2.0d+00 * njac(2,3,k) + lhsb(2,4,k) = tmp1 * 2.0d+00 * njac(2,4,k) + lhsb(2,5,k) = tmp1 * 2.0d+00 * njac(2,5,k) + + lhsb(3,1,k) = tmp1 * 2.0d+00 * njac(3,1,k) + lhsb(3,2,k) = tmp1 * 2.0d+00 * njac(3,2,k) + lhsb(3,3,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(3,3,k) + > + tmp1 * 2.0d+00 * dz3 + lhsb(3,4,k) = tmp1 * 2.0d+00 * njac(3,4,k) + lhsb(3,5,k) = tmp1 * 2.0d+00 * njac(3,5,k) + + lhsb(4,1,k) = tmp1 * 2.0d+00 * njac(4,1,k) + lhsb(4,2,k) = tmp1 * 2.0d+00 * njac(4,2,k) + lhsb(4,3,k) = tmp1 * 2.0d+00 * njac(4,3,k) + lhsb(4,4,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(4,4,k) + > + tmp1 * 2.0d+00 * dz4 + lhsb(4,5,k) = tmp1 * 2.0d+00 * njac(4,5,k) + + lhsb(5,1,k) = tmp1 * 2.0d+00 * njac(5,1,k) + lhsb(5,2,k) = tmp1 * 2.0d+00 * njac(5,2,k) + lhsb(5,3,k) = tmp1 * 2.0d+00 * njac(5,3,k) + lhsb(5,4,k) = tmp1 * 2.0d+00 * njac(5,4,k) + lhsb(5,5,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(5,5,k) + > + tmp1 * 2.0d+00 * dz5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,k+1) + > - tmp1 * njac(1,1,k+1) + > - tmp1 * dz1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,k+1) + > - tmp1 * njac(1,2,k+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,k+1) + > - tmp1 * njac(1,3,k+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,k+1) + > - tmp1 * njac(1,4,k+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,k+1) + > - tmp1 * njac(1,5,k+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,k+1) + > - tmp1 * njac(2,1,k+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,k+1) + > - tmp1 * njac(2,2,k+1) + > - tmp1 * dz2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,k+1) + > - tmp1 * njac(2,3,k+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,k+1) + > - tmp1 * njac(2,4,k+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,k+1) + > - tmp1 * njac(2,5,k+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,k+1) + > - tmp1 * njac(3,1,k+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,k+1) + > - tmp1 * njac(3,2,k+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,k+1) + > - tmp1 * njac(3,3,k+1) + > - tmp1 * dz3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,k+1) + > - tmp1 * njac(3,4,k+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,k+1) + > - tmp1 * njac(3,5,k+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,k+1) + > - tmp1 * njac(4,1,k+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,k+1) + > - tmp1 * njac(4,2,k+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,k+1) + > - tmp1 * njac(4,3,k+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,k+1) + > - tmp1 * njac(4,4,k+1) + > - tmp1 * dz4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,k+1) + > - tmp1 * njac(4,5,k+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,k+1) + > - tmp1 * njac(5,1,k+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,k+1) + > - tmp1 * njac(5,2,k+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,k+1) + > - tmp1 * njac(5,3,k+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,k+1) + > - tmp1 * njac(5,4,k+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,k+1) + > - tmp1 * njac(5,5,k+1) + > - tmp1 * dz5 + + enddo + + +c--------------------------------------------------------------------- +c outer most do loops - sweeping in i direction +c--------------------------------------------------------------------- + if (first .eq. 1) then + +c--------------------------------------------------------------------- +c multiply c(i,j,kstart) by b_inverse and copy back to c +c multiply rhs(kstart) by b_inverse(kstart) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,kstart), + > lhsc(1,1,i,j,kstart,c), + > rhs(1,i,j,kstart,c) ) + + endif + +c--------------------------------------------------------------------- +c begin inner most do loop +c do all the elements of the cell unless last +c--------------------------------------------------------------------- + do k=kstart+first,ksize-last + +c--------------------------------------------------------------------- +c subtract A*lhs_vector(k-1) from lhs_vector(k) +c +c rhs(k) = rhs(k) - A*rhs(k-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,k), + > rhs(1,i,j,k-1,c),rhs(1,i,j,k,c)) + +c--------------------------------------------------------------------- +c B(k) = B(k) - C(k-1)*A(k) +c call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k,c) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,k), + > lhsc(1,1,i,j,k-1,c), + > lhsb(1,1,k)) + +c--------------------------------------------------------------------- +c multiply c(i,j,k) by b_inverse and copy back to c +c multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,k), + > lhsc(1,1,i,j,k,c), + > rhs(1,i,j,k,c) ) + + enddo + +c--------------------------------------------------------------------- +c Now finish up special cases for last cell +c--------------------------------------------------------------------- + if (last .eq. 1) then + +c--------------------------------------------------------------------- +c rhs(ksize) = rhs(ksize) - A*rhs(ksize-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,ksize), + > rhs(1,i,j,ksize-1,c),rhs(1,i,j,ksize,c)) + +c--------------------------------------------------------------------- +c B(ksize) = B(ksize) - C(ksize-1)*A(ksize) +c call matmul_sub(aa,i,j,ksize,c, +c $ cc,i,j,ksize-1,c,bb,i,j,ksize,c) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,ksize), + > lhsc(1,1,i,j,ksize-1,c), + > lhsb(1,1,ksize)) + +c--------------------------------------------------------------------- +c multiply rhs(ksize) by b_inverse(ksize) and copy to rhs +c--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,ksize), + > rhs(1,i,j,ksize,c) ) + + endif + enddo + enddo + + + return + end + + + + + + diff --git b/NPB3.3-MPI/BT/z_solve_vec.f a/NPB3.3-MPI/BT/z_solve_vec.f new file mode 100644 index 0000000..bb84b0e --- /dev/null +++ a/NPB3.3-MPI/BT/z_solve_vec.f @@ -0,0 +1,803 @@ +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_solve + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Performs line solves in Z direction by first factoring +c the block-tridiagonal matrix into an upper triangular matrix, +c and then performing back substitution to solve for the unknow +c vectors of each line. +c +c Make sure we treat elements zero to cell_size in the direction +c of the sweep. +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer c, kstart, stage, + > first, last, recv_id, error, r_status(MPI_STATUS_SIZE), + > isize,jsize,ksize,send_id + + kstart = 0 + + if (timeron) call timer_start(t_zsolve) +c--------------------------------------------------------------------- +c in our terminology stage is the number of the cell in the y-direct +c i.e. stage = 1 means the start of the line stage=ncells means end +c--------------------------------------------------------------------- + do stage = 1,ncells + c = slice(3,stage) + isize = cell_size(1,c) - 1 + jsize = cell_size(2,c) - 1 + ksize = cell_size(3,c) - 1 +c--------------------------------------------------------------------- +c set last-cell flag +c--------------------------------------------------------------------- + if (stage .eq. ncells) then + last = 1 + else + last = 0 + endif + + if (stage .eq. 1) then +c--------------------------------------------------------------------- +c This is the first cell, so solve without receiving data +c--------------------------------------------------------------------- + first = 1 +c call lhsz(c) + call z_solve_cell(first,last,c) + else +c--------------------------------------------------------------------- +c Not the first cell of this line, so receive info from +c processor working on preceeding cell +c--------------------------------------------------------------------- + first = 0 + if (timeron) call timer_start(t_zcomm) + call z_receive_solve_info(recv_id,c) +c--------------------------------------------------------------------- +c overlap computations and communications +c--------------------------------------------------------------------- +c call lhsz(c) +c--------------------------------------------------------------------- +c wait for completion +c--------------------------------------------------------------------- + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_zcomm) +c--------------------------------------------------------------------- +c install C'(kstart+1) and rhs'(kstart+1) to be used in this cell +c--------------------------------------------------------------------- + call z_unpack_solve_info(c) + call z_solve_cell(first,last,c) + endif + + if (last .eq. 0) call z_send_solve_info(send_id,c) + enddo + +c--------------------------------------------------------------------- +c now perform backsubstitution in reverse direction +c--------------------------------------------------------------------- + do stage = ncells, 1, -1 + c = slice(3,stage) + first = 0 + last = 0 + if (stage .eq. 1) first = 1 + if (stage .eq. ncells) then + last = 1 +c--------------------------------------------------------------------- +c last cell, so perform back substitute without waiting +c--------------------------------------------------------------------- + call z_backsubstitute(first, last,c) + else + if (timeron) call timer_start(t_zcomm) + call z_receive_backsub_info(recv_id,c) + call mpi_wait(send_id,r_status,error) + call mpi_wait(recv_id,r_status,error) + if (timeron) call timer_stop(t_zcomm) + call z_unpack_backsub_info(c) + call z_backsubstitute(first,last,c) + endif + if (first .eq. 0) call z_send_backsub_info(send_id,c) + enddo + + if (timeron) call timer_stop(t_zsolve) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_unpack_solve_info(c) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack C'(-1) and rhs'(-1) for +c all i and j +c--------------------------------------------------------------------- + + include 'header.h' + + integer i,j,m,n,ptr,c,kstart + + kstart = 0 + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + lhsc(m,n,i,j,kstart-1,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + rhs(n,i,j,kstart-1,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_send_solve_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send C'(kend) and rhs'(kend) for +c all i and j +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i,j,m,n,ksize,ptr,c,ip,jp + integer error,send_id,buffer_size + + ksize = cell_size(3,c)-1 + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + +c--------------------------------------------------------------------- +c pack up buffer +c--------------------------------------------------------------------- + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = lhsc(m,n,i,j,ksize,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,j,ksize,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + +c--------------------------------------------------------------------- +c send buffer +c--------------------------------------------------------------------- + if (timeron) call timer_start(t_zcomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, successor(3), + > BOTTOM+ip+jp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_zcomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_send_backsub_info(send_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c pack up and send U(jstart) for all i and j +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer i,j,n,ptr,c,kstart,ip,jp + integer error,send_id,buffer_size + +c--------------------------------------------------------------------- +c Send element 0 to previous processor +c--------------------------------------------------------------------- + kstart = 0 + ip = cell_coord(1,c)-1 + jp = cell_coord(2,c)-1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + in_buffer(ptr+n) = rhs(n,i,j,kstart,c) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + if (timeron) call timer_start(t_zcomm) + call mpi_isend(in_buffer, buffer_size, + > dp_type, predecessor(3), + > TOP+ip+jp*NCELLS, comm_solve, + > send_id,error) + if (timeron) call timer_stop(t_zcomm) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_unpack_backsub_info(c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c unpack U(ksize) for all i and j +c--------------------------------------------------------------------- + + include 'header.h' + + integer i,j,n,ptr,c + + ptr = 0 + do j=0,JMAX-1 + do i=0,IMAX-1 + do n=1,BLOCK_SIZE + backsub_info(n,i,j,c) = out_buffer(ptr+n) + enddo + ptr = ptr+BLOCK_SIZE + enddo + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_receive_backsub_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer error,recv_id,ip,jp,c,buffer_size + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM*BLOCK_SIZE + call mpi_irecv(out_buffer, buffer_size, + > dp_type, successor(3), + > TOP+ip+jp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_receive_solve_info(recv_id,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c post mpi receives +c--------------------------------------------------------------------- + + include 'header.h' + include 'mpinpb.h' + + integer ip,jp,recv_id,error,c,buffer_size + ip = cell_coord(1,c) - 1 + jp = cell_coord(2,c) - 1 + buffer_size=MAX_CELL_DIM*MAX_CELL_DIM* + > (BLOCK_SIZE*BLOCK_SIZE + BLOCK_SIZE) + call mpi_irecv(out_buffer, buffer_size, + > dp_type, predecessor(3), + > BOTTOM+ip+jp*NCELLS, comm_solve, + > recv_id, error) + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_backsubstitute(first, last, c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c back solve: if last cell, then generate U(ksize)=rhs(ksize) +c else assume U(ksize) is loaded in un pack backsub_info +c so just use it +c after call u(kstart) will be sent to next cell +c--------------------------------------------------------------------- + + include 'header.h' + + integer first, last, c, i, k + integer m,n,j,jsize,isize,ksize,kstart + + kstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-1 + if (last .eq. 0) then + do j=start(2,c),jsize + do i=start(1,c),isize +c--------------------------------------------------------------------- +c U(jsize) uses info from previous cell if not last cell +c--------------------------------------------------------------------- + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,ksize,c) = rhs(m,i,j,ksize,c) + > - lhsc(m,n,i,j,ksize,c)* + > backsub_info(n,i,j,c) + enddo + enddo + enddo + enddo + endif + do k=ksize-1,kstart,-1 + do j=start(2,c),jsize + do i=start(1,c),isize + do m=1,BLOCK_SIZE + do n=1,BLOCK_SIZE + rhs(m,i,j,k,c) = rhs(m,i,j,k,c) + > - lhsc(m,n,i,j,k,c)*rhs(n,i,j,k+1,c) + enddo + enddo + enddo + enddo + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine z_solve_cell(first,last,c) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c performs guaussian elimination on this cell. +c +c assumes that unpacking routines for non-first cells +c preload C' and rhs' from previous cell. +c +c assumed send happens outside this routine, but that +c c'(KMAX) and rhs'(KMAX) will be sent to next cell. +c--------------------------------------------------------------------- + + include 'header.h' + include 'work_lhs_vec.h' + + integer first,last,c + integer i,j,k,m,n,isize,ksize,jsize,kstart + + kstart = 0 + isize = cell_size(1,c)-end(1,c)-1 + jsize = cell_size(2,c)-end(2,c)-1 + ksize = cell_size(3,c)-1 + +c--------------------------------------------------------------------- +c zero the left hand side for starters +c set diagonal values to 1. This is overkill, but convenient +c--------------------------------------------------------------------- + do i = 0, isize + do m = 1, 5 + do n = 1, 5 + lhsa(m,n,i,0) = 0.0d0 + lhsb(m,n,i,0) = 0.0d0 + lhsa(m,n,i,ksize) = 0.0d0 + lhsb(m,n,i,ksize) = 0.0d0 + enddo + lhsb(m,m,i,0) = 1.0d0 + lhsb(m,m,i,ksize) = 1.0d0 + enddo + enddo + + do j=start(2,c),jsize + +c--------------------------------------------------------------------- +c This function computes the left hand side for the three z-factors +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Compute the indices for storing the block-diagonal matrix; +c determine c (labeled f) and s jacobians for cell c +c--------------------------------------------------------------------- + + do k = start(3,c)-1, cell_size(3,c)-end(3,c) + do i=start(1,c),isize + + tmp1 = 1.0d0 / u(1,i,j,k,c) + tmp2 = tmp1 * tmp1 + tmp3 = tmp1 * tmp2 + + fjac(1,1,i,k) = 0.0d+00 + fjac(1,2,i,k) = 0.0d+00 + fjac(1,3,i,k) = 0.0d+00 + fjac(1,4,i,k) = 1.0d+00 + fjac(1,5,i,k) = 0.0d+00 + + fjac(2,1,i,k) = - ( u(2,i,j,k,c)*u(4,i,j,k,c) ) + > * tmp2 + fjac(2,2,i,k) = u(4,i,j,k,c) * tmp1 + fjac(2,3,i,k) = 0.0d+00 + fjac(2,4,i,k) = u(2,i,j,k,c) * tmp1 + fjac(2,5,i,k) = 0.0d+00 + + fjac(3,1,i,k) = - ( u(3,i,j,k,c)*u(4,i,j,k,c) ) + > * tmp2 + fjac(3,2,i,k) = 0.0d+00 + fjac(3,3,i,k) = u(4,i,j,k,c) * tmp1 + fjac(3,4,i,k) = u(3,i,j,k,c) * tmp1 + fjac(3,5,i,k) = 0.0d+00 + + fjac(4,1,i,k) = - (u(4,i,j,k,c)*u(4,i,j,k,c) * tmp2 ) + > + c2 * qs(i,j,k,c) + fjac(4,2,i,k) = - c2 * u(2,i,j,k,c) * tmp1 + fjac(4,3,i,k) = - c2 * u(3,i,j,k,c) * tmp1 + fjac(4,4,i,k) = ( 2.0d+00 - c2 ) + > * u(4,i,j,k,c) * tmp1 + fjac(4,5,i,k) = c2 + + fjac(5,1,i,k) = ( c2 * 2.0d0 * qs(i,j,k,c) + > - c1 * ( u(5,i,j,k,c) * tmp1 ) ) + > * ( u(4,i,j,k,c) * tmp1 ) + fjac(5,2,i,k) = - c2 * ( u(2,i,j,k,c)*u(4,i,j,k,c) ) + > * tmp2 + fjac(5,3,i,k) = - c2 * ( u(3,i,j,k,c)*u(4,i,j,k,c) ) + > * tmp2 + fjac(5,4,i,k) = c1 * ( u(5,i,j,k,c) * tmp1 ) + > - c2 * ( qs(i,j,k,c) + > + u(4,i,j,k,c)*u(4,i,j,k,c) * tmp2 ) + fjac(5,5,i,k) = c1 * u(4,i,j,k,c) * tmp1 + + njac(1,1,i,k) = 0.0d+00 + njac(1,2,i,k) = 0.0d+00 + njac(1,3,i,k) = 0.0d+00 + njac(1,4,i,k) = 0.0d+00 + njac(1,5,i,k) = 0.0d+00 + + njac(2,1,i,k) = - c3c4 * tmp2 * u(2,i,j,k,c) + njac(2,2,i,k) = c3c4 * tmp1 + njac(2,3,i,k) = 0.0d+00 + njac(2,4,i,k) = 0.0d+00 + njac(2,5,i,k) = 0.0d+00 + + njac(3,1,i,k) = - c3c4 * tmp2 * u(3,i,j,k,c) + njac(3,2,i,k) = 0.0d+00 + njac(3,3,i,k) = c3c4 * tmp1 + njac(3,4,i,k) = 0.0d+00 + njac(3,5,i,k) = 0.0d+00 + + njac(4,1,i,k) = - con43 * c3c4 * tmp2 * u(4,i,j,k,c) + njac(4,2,i,k) = 0.0d+00 + njac(4,3,i,k) = 0.0d+00 + njac(4,4,i,k) = con43 * c3 * c4 * tmp1 + njac(4,5,i,k) = 0.0d+00 + + njac(5,1,i,k) = - ( c3c4 + > - c1345 ) * tmp3 * (u(2,i,j,k,c)**2) + > - ( c3c4 - c1345 ) * tmp3 * (u(3,i,j,k,c)**2) + > - ( con43 * c3c4 + > - c1345 ) * tmp3 * (u(4,i,j,k,c)**2) + > - c1345 * tmp2 * u(5,i,j,k,c) + + njac(5,2,i,k) = ( c3c4 - c1345 ) * tmp2 * u(2,i,j,k,c) + njac(5,3,i,k) = ( c3c4 - c1345 ) * tmp2 * u(3,i,j,k,c) + njac(5,4,i,k) = ( con43 * c3c4 + > - c1345 ) * tmp2 * u(4,i,j,k,c) + njac(5,5,i,k) = ( c1345 )* tmp1 + + + enddo + enddo + +c--------------------------------------------------------------------- +c now joacobians set, so form left hand side in z direction +c--------------------------------------------------------------------- + do k = start(3,c), ksize-end(3,c) + do i=start(1,c),isize + + tmp1 = dt * tz1 + tmp2 = dt * tz2 + + lhsa(1,1,i,k) = - tmp2 * fjac(1,1,i,k-1) + > - tmp1 * njac(1,1,i,k-1) + > - tmp1 * dz1 + lhsa(1,2,i,k) = - tmp2 * fjac(1,2,i,k-1) + > - tmp1 * njac(1,2,i,k-1) + lhsa(1,3,i,k) = - tmp2 * fjac(1,3,i,k-1) + > - tmp1 * njac(1,3,i,k-1) + lhsa(1,4,i,k) = - tmp2 * fjac(1,4,i,k-1) + > - tmp1 * njac(1,4,i,k-1) + lhsa(1,5,i,k) = - tmp2 * fjac(1,5,i,k-1) + > - tmp1 * njac(1,5,i,k-1) + + lhsa(2,1,i,k) = - tmp2 * fjac(2,1,i,k-1) + > - tmp1 * njac(2,1,i,k-1) + lhsa(2,2,i,k) = - tmp2 * fjac(2,2,i,k-1) + > - tmp1 * njac(2,2,i,k-1) + > - tmp1 * dz2 + lhsa(2,3,i,k) = - tmp2 * fjac(2,3,i,k-1) + > - tmp1 * njac(2,3,i,k-1) + lhsa(2,4,i,k) = - tmp2 * fjac(2,4,i,k-1) + > - tmp1 * njac(2,4,i,k-1) + lhsa(2,5,i,k) = - tmp2 * fjac(2,5,i,k-1) + > - tmp1 * njac(2,5,i,k-1) + + lhsa(3,1,i,k) = - tmp2 * fjac(3,1,i,k-1) + > - tmp1 * njac(3,1,i,k-1) + lhsa(3,2,i,k) = - tmp2 * fjac(3,2,i,k-1) + > - tmp1 * njac(3,2,i,k-1) + lhsa(3,3,i,k) = - tmp2 * fjac(3,3,i,k-1) + > - tmp1 * njac(3,3,i,k-1) + > - tmp1 * dz3 + lhsa(3,4,i,k) = - tmp2 * fjac(3,4,i,k-1) + > - tmp1 * njac(3,4,i,k-1) + lhsa(3,5,i,k) = - tmp2 * fjac(3,5,i,k-1) + > - tmp1 * njac(3,5,i,k-1) + + lhsa(4,1,i,k) = - tmp2 * fjac(4,1,i,k-1) + > - tmp1 * njac(4,1,i,k-1) + lhsa(4,2,i,k) = - tmp2 * fjac(4,2,i,k-1) + > - tmp1 * njac(4,2,i,k-1) + lhsa(4,3,i,k) = - tmp2 * fjac(4,3,i,k-1) + > - tmp1 * njac(4,3,i,k-1) + lhsa(4,4,i,k) = - tmp2 * fjac(4,4,i,k-1) + > - tmp1 * njac(4,4,i,k-1) + > - tmp1 * dz4 + lhsa(4,5,i,k) = - tmp2 * fjac(4,5,i,k-1) + > - tmp1 * njac(4,5,i,k-1) + + lhsa(5,1,i,k) = - tmp2 * fjac(5,1,i,k-1) + > - tmp1 * njac(5,1,i,k-1) + lhsa(5,2,i,k) = - tmp2 * fjac(5,2,i,k-1) + > - tmp1 * njac(5,2,i,k-1) + lhsa(5,3,i,k) = - tmp2 * fjac(5,3,i,k-1) + > - tmp1 * njac(5,3,i,k-1) + lhsa(5,4,i,k) = - tmp2 * fjac(5,4,i,k-1) + > - tmp1 * njac(5,4,i,k-1) + lhsa(5,5,i,k) = - tmp2 * fjac(5,5,i,k-1) + > - tmp1 * njac(5,5,i,k-1) + > - tmp1 * dz5 + + lhsb(1,1,i,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(1,1,i,k) + > + tmp1 * 2.0d+00 * dz1 + lhsb(1,2,i,k) = tmp1 * 2.0d+00 * njac(1,2,i,k) + lhsb(1,3,i,k) = tmp1 * 2.0d+00 * njac(1,3,i,k) + lhsb(1,4,i,k) = tmp1 * 2.0d+00 * njac(1,4,i,k) + lhsb(1,5,i,k) = tmp1 * 2.0d+00 * njac(1,5,i,k) + + lhsb(2,1,i,k) = tmp1 * 2.0d+00 * njac(2,1,i,k) + lhsb(2,2,i,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(2,2,i,k) + > + tmp1 * 2.0d+00 * dz2 + lhsb(2,3,i,k) = tmp1 * 2.0d+00 * njac(2,3,i,k) + lhsb(2,4,i,k) = tmp1 * 2.0d+00 * njac(2,4,i,k) + lhsb(2,5,i,k) = tmp1 * 2.0d+00 * njac(2,5,i,k) + + lhsb(3,1,i,k) = tmp1 * 2.0d+00 * njac(3,1,i,k) + lhsb(3,2,i,k) = tmp1 * 2.0d+00 * njac(3,2,i,k) + lhsb(3,3,i,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(3,3,i,k) + > + tmp1 * 2.0d+00 * dz3 + lhsb(3,4,i,k) = tmp1 * 2.0d+00 * njac(3,4,i,k) + lhsb(3,5,i,k) = tmp1 * 2.0d+00 * njac(3,5,i,k) + + lhsb(4,1,i,k) = tmp1 * 2.0d+00 * njac(4,1,i,k) + lhsb(4,2,i,k) = tmp1 * 2.0d+00 * njac(4,2,i,k) + lhsb(4,3,i,k) = tmp1 * 2.0d+00 * njac(4,3,i,k) + lhsb(4,4,i,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(4,4,i,k) + > + tmp1 * 2.0d+00 * dz4 + lhsb(4,5,i,k) = tmp1 * 2.0d+00 * njac(4,5,i,k) + + lhsb(5,1,i,k) = tmp1 * 2.0d+00 * njac(5,1,i,k) + lhsb(5,2,i,k) = tmp1 * 2.0d+00 * njac(5,2,i,k) + lhsb(5,3,i,k) = tmp1 * 2.0d+00 * njac(5,3,i,k) + lhsb(5,4,i,k) = tmp1 * 2.0d+00 * njac(5,4,i,k) + lhsb(5,5,i,k) = 1.0d+00 + > + tmp1 * 2.0d+00 * njac(5,5,i,k) + > + tmp1 * 2.0d+00 * dz5 + + lhsc(1,1,i,j,k,c) = tmp2 * fjac(1,1,i,k+1) + > - tmp1 * njac(1,1,i,k+1) + > - tmp1 * dz1 + lhsc(1,2,i,j,k,c) = tmp2 * fjac(1,2,i,k+1) + > - tmp1 * njac(1,2,i,k+1) + lhsc(1,3,i,j,k,c) = tmp2 * fjac(1,3,i,k+1) + > - tmp1 * njac(1,3,i,k+1) + lhsc(1,4,i,j,k,c) = tmp2 * fjac(1,4,i,k+1) + > - tmp1 * njac(1,4,i,k+1) + lhsc(1,5,i,j,k,c) = tmp2 * fjac(1,5,i,k+1) + > - tmp1 * njac(1,5,i,k+1) + + lhsc(2,1,i,j,k,c) = tmp2 * fjac(2,1,i,k+1) + > - tmp1 * njac(2,1,i,k+1) + lhsc(2,2,i,j,k,c) = tmp2 * fjac(2,2,i,k+1) + > - tmp1 * njac(2,2,i,k+1) + > - tmp1 * dz2 + lhsc(2,3,i,j,k,c) = tmp2 * fjac(2,3,i,k+1) + > - tmp1 * njac(2,3,i,k+1) + lhsc(2,4,i,j,k,c) = tmp2 * fjac(2,4,i,k+1) + > - tmp1 * njac(2,4,i,k+1) + lhsc(2,5,i,j,k,c) = tmp2 * fjac(2,5,i,k+1) + > - tmp1 * njac(2,5,i,k+1) + + lhsc(3,1,i,j,k,c) = tmp2 * fjac(3,1,i,k+1) + > - tmp1 * njac(3,1,i,k+1) + lhsc(3,2,i,j,k,c) = tmp2 * fjac(3,2,i,k+1) + > - tmp1 * njac(3,2,i,k+1) + lhsc(3,3,i,j,k,c) = tmp2 * fjac(3,3,i,k+1) + > - tmp1 * njac(3,3,i,k+1) + > - tmp1 * dz3 + lhsc(3,4,i,j,k,c) = tmp2 * fjac(3,4,i,k+1) + > - tmp1 * njac(3,4,i,k+1) + lhsc(3,5,i,j,k,c) = tmp2 * fjac(3,5,i,k+1) + > - tmp1 * njac(3,5,i,k+1) + + lhsc(4,1,i,j,k,c) = tmp2 * fjac(4,1,i,k+1) + > - tmp1 * njac(4,1,i,k+1) + lhsc(4,2,i,j,k,c) = tmp2 * fjac(4,2,i,k+1) + > - tmp1 * njac(4,2,i,k+1) + lhsc(4,3,i,j,k,c) = tmp2 * fjac(4,3,i,k+1) + > - tmp1 * njac(4,3,i,k+1) + lhsc(4,4,i,j,k,c) = tmp2 * fjac(4,4,i,k+1) + > - tmp1 * njac(4,4,i,k+1) + > - tmp1 * dz4 + lhsc(4,5,i,j,k,c) = tmp2 * fjac(4,5,i,k+1) + > - tmp1 * njac(4,5,i,k+1) + + lhsc(5,1,i,j,k,c) = tmp2 * fjac(5,1,i,k+1) + > - tmp1 * njac(5,1,i,k+1) + lhsc(5,2,i,j,k,c) = tmp2 * fjac(5,2,i,k+1) + > - tmp1 * njac(5,2,i,k+1) + lhsc(5,3,i,j,k,c) = tmp2 * fjac(5,3,i,k+1) + > - tmp1 * njac(5,3,i,k+1) + lhsc(5,4,i,j,k,c) = tmp2 * fjac(5,4,i,k+1) + > - tmp1 * njac(5,4,i,k+1) + lhsc(5,5,i,j,k,c) = tmp2 * fjac(5,5,i,k+1) + > - tmp1 * njac(5,5,i,k+1) + > - tmp1 * dz5 + + enddo + enddo + + +c--------------------------------------------------------------------- +c outer most do loops - sweeping in i direction +c--------------------------------------------------------------------- + if (first .eq. 1) then + +c--------------------------------------------------------------------- +c multiply c(i,j,kstart) by b_inverse and copy back to c +c multiply rhs(kstart) by b_inverse(kstart) and copy to rhs +c--------------------------------------------------------------------- +!dir$ ivdep + do i=start(1,c),isize + call binvcrhs( lhsb(1,1,i,kstart), + > lhsc(1,1,i,j,kstart,c), + > rhs(1,i,j,kstart,c) ) + enddo + + endif + +c--------------------------------------------------------------------- +c begin inner most do loop +c do all the elements of the cell unless last +c--------------------------------------------------------------------- + do k=kstart+first,ksize-last +!dir$ ivdep + do i=start(1,c),isize + +c--------------------------------------------------------------------- +c subtract A*lhs_vector(k-1) from lhs_vector(k) +c +c rhs(k) = rhs(k) - A*rhs(k-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,k), + > rhs(1,i,j,k-1,c),rhs(1,i,j,k,c)) + +c--------------------------------------------------------------------- +c B(k) = B(k) - C(k-1)*A(k) +c call matmul_sub(aa,i,j,k,c,cc,i,j,k-1,c,bb,i,j,k,c) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,k), + > lhsc(1,1,i,j,k-1,c), + > lhsb(1,1,i,k)) + +c--------------------------------------------------------------------- +c multiply c(i,j,k) by b_inverse and copy back to c +c multiply rhs(i,j,1) by b_inverse(i,j,1) and copy to rhs +c--------------------------------------------------------------------- + call binvcrhs( lhsb(1,1,i,k), + > lhsc(1,1,i,j,k,c), + > rhs(1,i,j,k,c) ) + + enddo + enddo + +c--------------------------------------------------------------------- +c Now finish up special cases for last cell +c--------------------------------------------------------------------- + if (last .eq. 1) then + +!dir$ ivdep + do i=start(1,c),isize +c--------------------------------------------------------------------- +c rhs(ksize) = rhs(ksize) - A*rhs(ksize-1) +c--------------------------------------------------------------------- + call matvec_sub(lhsa(1,1,i,ksize), + > rhs(1,i,j,ksize-1,c),rhs(1,i,j,ksize,c)) + +c--------------------------------------------------------------------- +c B(ksize) = B(ksize) - C(ksize-1)*A(ksize) +c call matmul_sub(aa,i,j,ksize,c, +c $ cc,i,j,ksize-1,c,bb,i,j,ksize,c) +c--------------------------------------------------------------------- + call matmul_sub(lhsa(1,1,i,ksize), + > lhsc(1,1,i,j,ksize-1,c), + > lhsb(1,1,i,ksize)) + +c--------------------------------------------------------------------- +c multiply rhs(ksize) by b_inverse(ksize) and copy to rhs +c--------------------------------------------------------------------- + call binvrhs( lhsb(1,1,i,ksize), + > rhs(1,i,j,ksize,c) ) + enddo + + endif + enddo + + + return + end + + + + + + diff --git b/NPB3.3-MPI/CG/Makefile a/NPB3.3-MPI/CG/Makefile new file mode 100644 index 0000000..e9f0c98 --- /dev/null +++ a/NPB3.3-MPI/CG/Makefile @@ -0,0 +1,23 @@ +SHELL=/bin/sh +BENCHMARK=cg +BENCHMARKU=CG + +include ../config/make.def + +OBJS = cg.o ${COMMON}/print_results.o \ + ${COMMON}/${RAND}.o ${COMMON}/timers.o + +include ../sys/make.common + +${PROGRAM}: config ${OBJS} + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB} + +cg.o: cg.f mpinpb.h npbparams.h timing.h + ${FCOMPILE} cg.f + +clean: + - rm -f *.o *~ + - rm -f npbparams.h core + + + diff --git b/NPB3.3-MPI/CG/cg.f a/NPB3.3-MPI/CG/cg.f new file mode 100644 index 0000000..9a82466 --- /dev/null +++ a/NPB3.3-MPI/CG/cg.f @@ -0,0 +1,1864 @@ +!-------------------------------------------------------------------------! +! ! +! N A S P A R A L L E L B E N C H M A R K S 3.3 ! +! ! +! C G ! +! ! +!-------------------------------------------------------------------------! +! ! +! This benchmark is part of the NAS Parallel Benchmark 3.3 suite. ! +! It is described in NAS Technical Reports 95-020 and 02-007 ! +! ! +! Permission to use, copy, distribute and modify this software ! +! for any purpose with or without fee is hereby granted. We ! +! request, however, that all derived work reference the NAS ! +! Parallel Benchmarks 3.3. This software is provided "as is" ! +! without express or implied warranty. ! +! ! +! Information on NPB 3.3, including the technical report, the ! +! original specifications, source code, results and information ! +! on how to submit new results, is available at: ! +! ! +! http://www.nas.nasa.gov/Software/NPB/ ! +! ! +! Send comments or suggestions to npb@nas.nasa.gov ! +! ! +! NAS Parallel Benchmarks Group ! +! NASA Ames Research Center ! +! Mail Stop: T27A-1 ! +! Moffett Field, CA 94035-1000 ! +! ! +! E-mail: npb@nas.nasa.gov ! +! Fax: (650) 604-3957 ! +! ! +!-------------------------------------------------------------------------! + + +c--------------------------------------------------------------------- +c +c Authors: M. Yarrow +c C. Kuszmaul +c R. F. Van der Wijngaart +c H. Jin +c +c--------------------------------------------------------------------- + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + program cg +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + + implicit none + + include 'mpinpb.h' + include 'timing.h' + integer status(MPI_STATUS_SIZE), request, ierr + + include 'npbparams.h' + +c--------------------------------------------------------------------- +c num_procs must be a power of 2, and num_procs=num_proc_cols*num_proc_rows. +c num_proc_cols and num_proc_cols are to be found in npbparams.h. +c When num_procs is not square, then num_proc_cols must be = 2*num_proc_rows. +c--------------------------------------------------------------------- + integer num_procs + parameter( num_procs = num_proc_cols * num_proc_rows ) + + + +c--------------------------------------------------------------------- +c Class specific parameters: +c It appears here for reference only. +c These are their values, however, this info is imported in the npbparams.h +c include file, which is written by the sys/setparams.c program. +c--------------------------------------------------------------------- + +C---------- +C Class S: +C---------- +CC parameter( na=1400, +CC > nonzer=7, +CC > shift=10., +CC > niter=15, +CC > rcond=1.0d-1 ) +C---------- +C Class W: +C---------- +CC parameter( na=7000, +CC > nonzer=8, +CC > shift=12., +CC > niter=15, +CC > rcond=1.0d-1 ) +C---------- +C Class A: +C---------- +CC parameter( na=14000, +CC > nonzer=11, +CC > shift=20., +CC > niter=15, +CC > rcond=1.0d-1 ) +C---------- +C Class B: +C---------- +CC parameter( na=75000, +CC > nonzer=13, +CC > shift=60., +CC > niter=75, +CC > rcond=1.0d-1 ) +C---------- +C Class C: +C---------- +CC parameter( na=150000, +CC > nonzer=15, +CC > shift=110., +CC > niter=75, +CC > rcond=1.0d-1 ) +C---------- +C Class D: +C---------- +CC parameter( na=1500000, +CC > nonzer=21, +CC > shift=500., +CC > niter=100, +CC > rcond=1.0d-1 ) +C---------- +C Class E: +C---------- +CC parameter( na=9000000, +CC > nonzer=26, +CC > shift=1500., +CC > niter=100, +CC > rcond=1.0d-1 ) + + + + integer nz + parameter( nz = na*(nonzer+1)/num_procs*(nonzer+1)+nonzer + > + na*(nonzer+2+num_procs/256)/num_proc_cols ) + + + + common / partit_size / naa, nzz, + > npcols, nprows, + > proc_col, proc_row, + > firstrow, + > lastrow, + > firstcol, + > lastcol, + > exch_proc, + > exch_recv_length, + > send_start, + > send_len + integer naa, nzz, + > npcols, nprows, + > proc_col, proc_row, + > firstrow, + > lastrow, + > firstcol, + > lastcol, + > exch_proc, + > exch_recv_length, + > send_start, + > send_len + + + common / main_int_mem / colidx, rowstr, + > iv, arow, acol + integer colidx(nz), rowstr(na+1), + > iv(2*na+1), arow(nz), acol(nz) + + + common / main_flt_mem / v, aelt, a, + > x, + > z, + > p, + > q, + > r, + > w + double precision v(na+1), aelt(nz), a(nz), + > x(na/num_proc_rows+2), + > z(na/num_proc_rows+2), + > p(na/num_proc_rows+2), + > q(na/num_proc_rows+2), + > r(na/num_proc_rows+2), + > w(na/num_proc_rows+2) + + + common /urando/ amult, tran + double precision amult, tran + + + + integer l2npcols + integer reduce_exch_proc(num_proc_cols) + integer reduce_send_starts(num_proc_cols) + integer reduce_send_lengths(num_proc_cols) + integer reduce_recv_starts(num_proc_cols) + integer reduce_recv_lengths(num_proc_cols) + + integer i, j, k, it + + double precision zeta, randlc + external randlc + double precision rnorm + double precision norm_temp1(2), norm_temp2(2) + + double precision t, tmax, mflops + external timer_read + double precision timer_read + character class + logical verified + double precision zeta_verify_value, epsilon, err + + double precision tsum(t_last+2), t1(t_last+2), + > tming(t_last+2), tmaxg(t_last+2) + character t_recs(t_last+2)*8 + + data t_recs/'total', 'conjg', 'rcomm', 'ncomm', + > ' totcomp', ' totcomm'/ + + +c--------------------------------------------------------------------- +c Set up mpi initialization and number of proc testing +c--------------------------------------------------------------------- + call initialize_mpi + + + if( na .eq. 1400 .and. + & nonzer .eq. 7 .and. + & niter .eq. 15 .and. + & shift .eq. 10.d0 ) then + class = 'S' + zeta_verify_value = 8.5971775078648d0 + else if( na .eq. 7000 .and. + & nonzer .eq. 8 .and. + & niter .eq. 15 .and. + & shift .eq. 12.d0 ) then + class = 'W' + zeta_verify_value = 10.362595087124d0 + else if( na .eq. 14000 .and. + & nonzer .eq. 11 .and. + & niter .eq. 15 .and. + & shift .eq. 20.d0 ) then + class = 'A' + zeta_verify_value = 17.130235054029d0 + else if( na .eq. 75000 .and. + & nonzer .eq. 13 .and. + & niter .eq. 75 .and. + & shift .eq. 60.d0 ) then + class = 'B' + zeta_verify_value = 22.712745482631d0 + else if( na .eq. 150000 .and. + & nonzer .eq. 15 .and. + & niter .eq. 75 .and. + & shift .eq. 110.d0 ) then + class = 'C' + zeta_verify_value = 28.973605592845d0 + else if( na .eq. 1500000 .and. + & nonzer .eq. 21 .and. + & niter .eq. 100 .and. + & shift .eq. 500.d0 ) then + class = 'D' + zeta_verify_value = 52.514532105794d0 + else if( na .eq. 9000000 .and. + & nonzer .eq. 26 .and. + & niter .eq. 100 .and. + & shift .eq. 1.5d3 ) then + class = 'E' + zeta_verify_value = 77.522164599383d0 + else + class = 'U' + endif + + if( me .eq. root )then + write( *,1000 ) + write( *,1001 ) na + write( *,1002 ) niter + write( *,1003 ) nprocs + write( *,1004 ) nonzer + write( *,1005 ) shift + 1000 format(//,' NAS Parallel Benchmarks 3.3 -- CG Benchmark', /) + 1001 format(' Size: ', i10 ) + 1002 format(' Iterations: ', i5 ) + 1003 format(' Number of active processes: ', i5 ) + 1004 format(' Number of nonzeroes per row: ', i8) + 1005 format(' Eigenvalue shift: ', e8.3) + endif + + if (.not. convertdouble) then + dp_type = MPI_DOUBLE_PRECISION + else + dp_type = MPI_REAL + endif + + + naa = na + nzz = nz + + +c--------------------------------------------------------------------- +c Set up processor info, such as whether sq num of procs, etc +c--------------------------------------------------------------------- + call setup_proc_info( num_procs, + > num_proc_rows, + > num_proc_cols ) + + +c--------------------------------------------------------------------- +c Set up partition's submatrix info: firstcol, lastcol, firstrow, lastrow +c--------------------------------------------------------------------- + call setup_submatrix_info( l2npcols, + > reduce_exch_proc, + > reduce_send_starts, + > reduce_send_lengths, + > reduce_recv_starts, + > reduce_recv_lengths ) + + + do i = 1, t_last + call timer_clear(i) + end do + +c--------------------------------------------------------------------- +c Inialize random number generator +c--------------------------------------------------------------------- + tran = 314159265.0D0 + amult = 1220703125.0D0 + zeta = randlc( tran, amult ) + +c--------------------------------------------------------------------- +c Set up partition's sparse random matrix for given class size +c--------------------------------------------------------------------- + call makea(naa, nzz, a, colidx, rowstr, nonzer, + > firstrow, lastrow, firstcol, lastcol, + > rcond, arow, acol, aelt, v, iv, shift) + + + +c--------------------------------------------------------------------- +c Note: as a result of the above call to makea: +c values of j used in indexing rowstr go from 1 --> lastrow-firstrow+1 +c values of colidx which are col indexes go from firstcol --> lastcol +c So: +c Shift the col index vals from actual (firstcol --> lastcol ) +c to local, i.e., (1 --> lastcol-firstcol+1) +c--------------------------------------------------------------------- + do j=1,lastrow-firstrow+1 + do k=rowstr(j),rowstr(j+1)-1 + colidx(k) = colidx(k) - firstcol + 1 + enddo + enddo + +c--------------------------------------------------------------------- +c set starting vector to (1, 1, .... 1) +c--------------------------------------------------------------------- + do i = 1, na/num_proc_rows+1 + x(i) = 1.0D0 + enddo + + zeta = 0.0d0 + +c--------------------------------------------------------------------- +c----> +c Do one iteration untimed to init all code and data page tables +c----> (then reinit, start timing, to niter its) +c--------------------------------------------------------------------- + do it = 1, 1 + +c--------------------------------------------------------------------- +c The call to the conjugate gradient routine: +c--------------------------------------------------------------------- + call conj_grad ( colidx, + > rowstr, + > x, + > z, + > a, + > p, + > q, + > r, + > w, + > rnorm, + > l2npcols, + > reduce_exch_proc, + > reduce_send_starts, + > reduce_send_lengths, + > reduce_recv_starts, + > reduce_recv_lengths ) + +c--------------------------------------------------------------------- +c zeta = shift + 1/(x.z) +c So, first: (x.z) +c Also, find norm of z +c So, first: (z.z) +c--------------------------------------------------------------------- + norm_temp1(1) = 0.0d0 + norm_temp1(2) = 0.0d0 + do j=1, lastcol-firstcol+1 + norm_temp1(1) = norm_temp1(1) + x(j)*z(j) + norm_temp1(2) = norm_temp1(2) + z(j)*z(j) + enddo + + do i = 1, l2npcols + if (timeron) call timer_start(t_ncomm) + call mpi_irecv( norm_temp2, + > 2, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > request, + > ierr ) + call mpi_send( norm_temp1, + > 2, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_ncomm) + + norm_temp1(1) = norm_temp1(1) + norm_temp2(1) + norm_temp1(2) = norm_temp1(2) + norm_temp2(2) + enddo + + norm_temp1(2) = 1.0d0 / sqrt( norm_temp1(2) ) + + +c--------------------------------------------------------------------- +c Normalize z to obtain x +c--------------------------------------------------------------------- + do j=1, lastcol-firstcol+1 + x(j) = norm_temp1(2)*z(j) + enddo + + + enddo ! end of do one iteration untimed + + +c--------------------------------------------------------------------- +c set starting vector to (1, 1, .... 1) +c--------------------------------------------------------------------- +c +c NOTE: a questionable limit on size: should this be na/num_proc_cols+1 ? +c + do i = 1, na/num_proc_rows+1 + x(i) = 1.0D0 + enddo + + zeta = 0.0d0 + +c--------------------------------------------------------------------- +c Synchronize and start timing +c--------------------------------------------------------------------- + do i = 1, t_last + call timer_clear(i) + end do + call mpi_barrier( mpi_comm_world, + > ierr ) + + call timer_clear( 1 ) + call timer_start( 1 ) + +c--------------------------------------------------------------------- +c----> +c Main Iteration for inverse power method +c----> +c--------------------------------------------------------------------- + do it = 1, niter + +c--------------------------------------------------------------------- +c The call to the conjugate gradient routine: +c--------------------------------------------------------------------- + call conj_grad ( colidx, + > rowstr, + > x, + > z, + > a, + > p, + > q, + > r, + > w, + > rnorm, + > l2npcols, + > reduce_exch_proc, + > reduce_send_starts, + > reduce_send_lengths, + > reduce_recv_starts, + > reduce_recv_lengths ) + + +c--------------------------------------------------------------------- +c zeta = shift + 1/(x.z) +c So, first: (x.z) +c Also, find norm of z +c So, first: (z.z) +c--------------------------------------------------------------------- + norm_temp1(1) = 0.0d0 + norm_temp1(2) = 0.0d0 + do j=1, lastcol-firstcol+1 + norm_temp1(1) = norm_temp1(1) + x(j)*z(j) + norm_temp1(2) = norm_temp1(2) + z(j)*z(j) + enddo + + do i = 1, l2npcols + if (timeron) call timer_start(t_ncomm) + call mpi_irecv( norm_temp2, + > 2, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > request, + > ierr ) + call mpi_send( norm_temp1, + > 2, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_ncomm) + + norm_temp1(1) = norm_temp1(1) + norm_temp2(1) + norm_temp1(2) = norm_temp1(2) + norm_temp2(2) + enddo + + norm_temp1(2) = 1.0d0 / sqrt( norm_temp1(2) ) + + + if( me .eq. root )then + zeta = shift + 1.0d0 / norm_temp1(1) + if( it .eq. 1 ) write( *,9000 ) + write( *,9001 ) it, rnorm, zeta + endif + 9000 format( /,' iteration ||r|| zeta' ) + 9001 format( 4x, i5, 7x, e20.14, f20.13 ) + +c--------------------------------------------------------------------- +c Normalize z to obtain x +c--------------------------------------------------------------------- + do j=1, lastcol-firstcol+1 + x(j) = norm_temp1(2)*z(j) + enddo + + + enddo ! end of main iter inv pow meth + + call timer_stop( 1 ) + +c--------------------------------------------------------------------- +c End of timed section +c--------------------------------------------------------------------- + + t = timer_read( 1 ) + + call mpi_reduce( t, + > tmax, + > 1, + > dp_type, + > MPI_MAX, + > root, + > mpi_comm_world, + > ierr ) + + if( me .eq. root )then + write(*,100) + 100 format(' Benchmark completed ') + + epsilon = 1.d-10 + if (class .ne. 'U') then + + err = abs( zeta - zeta_verify_value )/zeta_verify_value + if( err .le. epsilon ) then + verified = .TRUE. + write(*, 200) + write(*, 201) zeta + write(*, 202) err + 200 format(' VERIFICATION SUCCESSFUL ') + 201 format(' Zeta is ', E20.13) + 202 format(' Error is ', E20.13) + else + verified = .FALSE. + write(*, 300) + write(*, 301) zeta + write(*, 302) zeta_verify_value + 300 format(' VERIFICATION FAILED') + 301 format(' Zeta ', E20.13) + 302 format(' The correct zeta is ', E20.13) + endif + else + verified = .FALSE. + write (*, 400) + write (*, 401) + write (*, 201) zeta + 400 format(' Problem size unknown') + 401 format(' NO VERIFICATION PERFORMED') + endif + + + if( tmax .ne. 0. ) then + mflops = float( 2*niter*na ) + & * ( 3.+float( nonzer*(nonzer+1) ) + & + 25.*(5.+float( nonzer*(nonzer+1) )) + & + 3. ) / tmax / 1000000.0 + else + mflops = 0.0 + endif + + call print_results('CG', class, na, 0, 0, + > niter, nnodes_compiled, nprocs, tmax, + > mflops, ' floating point', + > verified, npbversion, compiletime, + > cs1, cs2, cs3, cs4, cs5, cs6, cs7) + + + endif + + + if (.not.timeron) goto 999 + + do i = 1, t_last + t1(i) = timer_read(i) + end do + t1(t_conjg) = t1(t_conjg) - t1(t_rcomm) + t1(t_last+2) = t1(t_rcomm) + t1(t_ncomm) + t1(t_last+1) = t1(t_total) - t1(t_last+2) + + call MPI_Reduce(t1, tsum, t_last+2, dp_type, MPI_SUM, + > 0, MPI_COMM_WORLD, ierr) + call MPI_Reduce(t1, tming, t_last+2, dp_type, MPI_MIN, + > 0, MPI_COMM_WORLD, ierr) + call MPI_Reduce(t1, tmaxg, t_last+2, dp_type, MPI_MAX, + > 0, MPI_COMM_WORLD, ierr) + + if (me .eq. 0) then + write(*, 800) nprocs + do i = 1, t_last+2 + tsum(i) = tsum(i) / nprocs + write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i) + end do + endif + 800 format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', + > 5x, 'average') + 810 format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4)) + + 999 continue + call mpi_finalize(ierr) + + + + end ! end main + + + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + subroutine initialize_mpi +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + + include 'mpinpb.h' + include 'timing.h' + + integer ierr, fstatus + + + call mpi_init( ierr ) + call mpi_comm_rank( mpi_comm_world, me, ierr ) + call mpi_comm_size( mpi_comm_world, nprocs, ierr ) + root = 0 + + if (me .eq. root) then + open (unit=2,file='timer.flag',status='old',iostat=fstatus) + timeron = .false. + if (fstatus .eq. 0) then + timeron = .true. + close(2) + endif + endif + + call mpi_bcast(timeron, 1, MPI_LOGICAL, 0, mpi_comm_world, ierr) + + return + end + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + subroutine setup_proc_info( num_procs, + > num_proc_rows, + > num_proc_cols ) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + + include 'mpinpb.h' + + common / partit_size / naa, nzz, + > npcols, nprows, + > proc_col, proc_row, + > firstrow, + > lastrow, + > firstcol, + > lastcol, + > exch_proc, + > exch_recv_length, + > send_start, + > send_len + integer naa, nzz, + > npcols, nprows, + > proc_col, proc_row, + > firstrow, + > lastrow, + > firstcol, + > lastcol, + > exch_proc, + > exch_recv_length, + > send_start, + > send_len + + integer num_procs, num_proc_cols, num_proc_rows + integer i, ierr + integer log2nprocs + +c--------------------------------------------------------------------- +c num_procs must be a power of 2, and num_procs=num_proc_cols*num_proc_rows +c When num_procs is not square, then num_proc_cols = 2*num_proc_rows +c--------------------------------------------------------------------- +c First, number of procs must be power of two. +c--------------------------------------------------------------------- + if( nprocs .ne. num_procs )then + if( me .eq. root ) write( *,9000 ) nprocs, num_procs + 9000 format( /,'Error: ',/,'num of procs allocated (', + > i4, ' )', + > /,'is not equal to',/, + > 'compiled number of procs (', + > i4, ' )',/ ) + call mpi_finalize(ierr) + stop + endif + + + i = num_proc_cols + 100 continue + if( i .ne. 1 .and. i/2*2 .ne. i )then + if ( me .eq. root ) then + write( *,* ) 'Error: num_proc_cols is ', + > num_proc_cols, + > ' which is not a power of two' + endif + call mpi_finalize(ierr) + stop + endif + i = i / 2 + if( i .ne. 0 )then + goto 100 + endif + + i = num_proc_rows + 200 continue + if( i .ne. 1 .and. i/2*2 .ne. i )then + if ( me .eq. root ) then + write( *,* ) 'Error: num_proc_rows is ', + > num_proc_rows, + > ' which is not a power of two' + endif + call mpi_finalize(ierr) + stop + endif + i = i / 2 + if( i .ne. 0 )then + goto 200 + endif + + log2nprocs = 0 + i = nprocs + 300 continue + if( i .ne. 1 .and. i/2*2 .ne. i )then + write( *,* ) 'Error: nprocs is ', + > nprocs, + > ' which is not a power of two' + call mpi_finalize(ierr) + stop + endif + i = i / 2 + if( i .ne. 0 )then + log2nprocs = log2nprocs + 1 + goto 300 + endif + +CC write( *,* ) 'nprocs, log2nprocs: ',nprocs,log2nprocs + + + npcols = num_proc_cols + nprows = num_proc_rows + + + return + end + + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + subroutine setup_submatrix_info( l2npcols, + > reduce_exch_proc, + > reduce_send_starts, + > reduce_send_lengths, + > reduce_recv_starts, + > reduce_recv_lengths ) + > +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + + include 'mpinpb.h' + + integer col_size, row_size + + common / partit_size / naa, nzz, + > npcols, nprows, + > proc_col, proc_row, + > firstrow, + > lastrow, + > firstcol, + > lastcol, + > exch_proc, + > exch_recv_length, + > send_start, + > send_len + integer naa, nzz, + > npcols, nprows, + > proc_col, proc_row, + > firstrow, + > lastrow, + > firstcol, + > lastcol, + > exch_proc, + > exch_recv_length, + > send_start, + > send_len + + integer reduce_exch_proc(*) + integer reduce_send_starts(*) + integer reduce_send_lengths(*) + integer reduce_recv_starts(*) + integer reduce_recv_lengths(*) + + integer i, j + integer div_factor + integer l2npcols + + + proc_row = me / npcols + proc_col = me - proc_row*npcols + + + +c--------------------------------------------------------------------- +c If naa evenly divisible by npcols, then it is evenly divisible +c by nprows +c--------------------------------------------------------------------- + + if( naa/npcols*npcols .eq. naa )then + col_size = naa/npcols + firstcol = proc_col*col_size + 1 + lastcol = firstcol - 1 + col_size + row_size = naa/nprows + firstrow = proc_row*row_size + 1 + lastrow = firstrow - 1 + row_size +c--------------------------------------------------------------------- +c If naa not evenly divisible by npcols, then first subdivide for nprows +c and then, if npcols not equal to nprows (i.e., not a sq number of procs), +c get col subdivisions by dividing by 2 each row subdivision. +c--------------------------------------------------------------------- + else + if( proc_row .lt. naa - naa/nprows*nprows)then + row_size = naa/nprows+ 1 + firstrow = proc_row*row_size + 1 + lastrow = firstrow - 1 + row_size + else + row_size = naa/nprows + firstrow = (naa - naa/nprows*nprows)*(row_size+1) + > + (proc_row-(naa-naa/nprows*nprows)) + > *row_size + 1 + lastrow = firstrow - 1 + row_size + endif + if( npcols .eq. nprows )then + if( proc_col .lt. naa - naa/npcols*npcols )then + col_size = naa/npcols+ 1 + firstcol = proc_col*col_size + 1 + lastcol = firstcol - 1 + col_size + else + col_size = naa/npcols + firstcol = (naa - naa/npcols*npcols)*(col_size+1) + > + (proc_col-(naa-naa/npcols*npcols)) + > *col_size + 1 + lastcol = firstcol - 1 + col_size + endif + else + if( (proc_col/2) .lt. + > naa - naa/(npcols/2)*(npcols/2) )then + col_size = naa/(npcols/2) + 1 + firstcol = (proc_col/2)*col_size + 1 + lastcol = firstcol - 1 + col_size + else + col_size = naa/(npcols/2) + firstcol = (naa - naa/(npcols/2)*(npcols/2)) + > *(col_size+1) + > + ((proc_col/2)-(naa-naa/(npcols/2)*(npcols/2))) + > *col_size + 1 + lastcol = firstcol - 1 + col_size + endif +CC write( *,* ) col_size,firstcol,lastcol + if( mod( me,2 ) .eq. 0 )then + lastcol = firstcol - 1 + (col_size-1)/2 + 1 + else + firstcol = firstcol + (col_size-1)/2 + 1 + lastcol = firstcol - 1 + col_size/2 +CC write( *,* ) firstcol,lastcol + endif + endif + endif + + + + if( npcols .eq. nprows )then + send_start = 1 + send_len = lastrow - firstrow + 1 + else + if( mod( me,2 ) .eq. 0 )then + send_start = 1 + send_len = (1 + lastrow-firstrow+1)/2 + else + send_start = (1 + lastrow-firstrow+1)/2 + 1 + send_len = (lastrow-firstrow+1)/2 + endif + endif + + + + +c--------------------------------------------------------------------- +c Transpose exchange processor +c--------------------------------------------------------------------- + + if( npcols .eq. nprows )then + exch_proc = mod( me,nprows )*nprows + me/nprows + else + exch_proc = 2*(mod( me/2,nprows )*nprows + me/2/nprows) + > + mod( me,2 ) + endif + + + + i = npcols / 2 + l2npcols = 0 + do while( i .gt. 0 ) + l2npcols = l2npcols + 1 + i = i / 2 + enddo + + +c--------------------------------------------------------------------- +c Set up the reduce phase schedules... +c--------------------------------------------------------------------- + + div_factor = npcols + do i = 1, l2npcols + + j = mod( proc_col+div_factor/2, div_factor ) + > + proc_col / div_factor * div_factor + reduce_exch_proc(i) = proc_row*npcols + j + + div_factor = div_factor / 2 + + enddo + + + do i = l2npcols, 1, -1 + + if( nprows .eq. npcols )then + reduce_send_starts(i) = send_start + reduce_send_lengths(i) = send_len + reduce_recv_lengths(i) = lastrow - firstrow + 1 + else + reduce_recv_lengths(i) = send_len + if( i .eq. l2npcols )then + reduce_send_lengths(i) = lastrow-firstrow+1 - send_len + if( me/2*2 .eq. me )then + reduce_send_starts(i) = send_start + send_len + else + reduce_send_starts(i) = 1 + endif + else + reduce_send_lengths(i) = send_len + reduce_send_starts(i) = send_start + endif + endif + reduce_recv_starts(i) = send_start + + enddo + + + exch_recv_length = lastcol - firstcol + 1 + + + return + end + + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + subroutine conj_grad ( colidx, + > rowstr, + > x, + > z, + > a, + > p, + > q, + > r, + > w, + > rnorm, + > l2npcols, + > reduce_exch_proc, + > reduce_send_starts, + > reduce_send_lengths, + > reduce_recv_starts, + > reduce_recv_lengths ) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Floaging point arrays here are named as in NPB1 spec discussion of +c CG algorithm +c--------------------------------------------------------------------- + + implicit none + + include 'mpinpb.h' + include 'timing.h' + + integer status(MPI_STATUS_SIZE ), request + + + common / partit_size / naa, nzz, + > npcols, nprows, + > proc_col, proc_row, + > firstrow, + > lastrow, + > firstcol, + > lastcol, + > exch_proc, + > exch_recv_length, + > send_start, + > send_len + integer naa, nzz, + > npcols, nprows, + > proc_col, proc_row, + > firstrow, + > lastrow, + > firstcol, + > lastcol, + > exch_proc, + > exch_recv_length, + > send_start, + > send_len + + + + double precision x(*), + > z(*), + > a(nzz) + integer colidx(nzz), rowstr(naa+1) + + double precision p(*), + > q(*), + > r(*), + > w(*) ! used as work temporary + + integer l2npcols + integer reduce_exch_proc(l2npcols) + integer reduce_send_starts(l2npcols) + integer reduce_send_lengths(l2npcols) + integer reduce_recv_starts(l2npcols) + integer reduce_recv_lengths(l2npcols) + + integer i, j, k, ierr + integer cgit, cgitmax + + double precision d, sum, rho, rho0, alpha, beta, rnorm + + external timer_read + double precision timer_read + + data cgitmax / 25 / + + + if (timeron) call timer_start(t_conjg) +c--------------------------------------------------------------------- +c Initialize the CG algorithm: +c--------------------------------------------------------------------- + do j=1,naa/nprows+1 + q(j) = 0.0d0 + z(j) = 0.0d0 + r(j) = x(j) + p(j) = r(j) + w(j) = 0.0d0 + enddo + + +c--------------------------------------------------------------------- +c rho = r.r +c Now, obtain the norm of r: First, sum squares of r elements locally... +c--------------------------------------------------------------------- + sum = 0.0d0 + do j=1, lastcol-firstcol+1 + sum = sum + r(j)*r(j) + enddo + +c--------------------------------------------------------------------- +c Exchange and sum with procs identified in reduce_exch_proc +c (This is equivalent to mpi_allreduce.) +c Sum the partial sums of rho, leaving rho on all processors +c--------------------------------------------------------------------- + do i = 1, l2npcols + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( rho, + > 1, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > request, + > ierr ) + call mpi_send( sum, + > 1, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + + sum = sum + rho + enddo + rho = sum + + + +c--------------------------------------------------------------------- +c----> +c The conj grad iteration loop +c----> +c--------------------------------------------------------------------- + do cgit = 1, cgitmax + + +c--------------------------------------------------------------------- +c q = A.p +c The partition submatrix-vector multiply: use workspace w +c--------------------------------------------------------------------- + do j=1,lastrow-firstrow+1 + sum = 0.d0 + do k=rowstr(j),rowstr(j+1)-1 + sum = sum + a(k)*p(colidx(k)) + enddo + w(j) = sum + enddo + +c--------------------------------------------------------------------- +c Sum the partition submatrix-vec A.p's across rows +c Exchange and sum piece of w with procs identified in reduce_exch_proc +c--------------------------------------------------------------------- + do i = l2npcols, 1, -1 + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( q(reduce_recv_starts(i)), + > reduce_recv_lengths(i), + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > request, + > ierr ) + call mpi_send( w(reduce_send_starts(i)), + > reduce_send_lengths(i), + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + do j=send_start,send_start + reduce_recv_lengths(i) - 1 + w(j) = w(j) + q(j) + enddo + enddo + + +c--------------------------------------------------------------------- +c Exchange piece of q with transpose processor: +c--------------------------------------------------------------------- + if( l2npcols .ne. 0 )then + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( q, + > exch_recv_length, + > dp_type, + > exch_proc, + > 1, + > mpi_comm_world, + > request, + > ierr ) + + call mpi_send( w(send_start), + > send_len, + > dp_type, + > exch_proc, + > 1, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + else + do j=1,exch_recv_length + q(j) = w(j) + enddo + endif + + +c--------------------------------------------------------------------- +c Clear w for reuse... +c--------------------------------------------------------------------- + do j=1, max( lastrow-firstrow+1, lastcol-firstcol+1 ) + w(j) = 0.0d0 + enddo + + +c--------------------------------------------------------------------- +c Obtain p.q +c--------------------------------------------------------------------- + sum = 0.0d0 + do j=1, lastcol-firstcol+1 + sum = sum + p(j)*q(j) + enddo + +c--------------------------------------------------------------------- +c Obtain d with a sum-reduce +c--------------------------------------------------------------------- + do i = 1, l2npcols + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( d, + > 1, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > request, + > ierr ) + call mpi_send( sum, + > 1, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > ierr ) + + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + + sum = sum + d + enddo + d = sum + + +c--------------------------------------------------------------------- +c Obtain alpha = rho / (p.q) +c--------------------------------------------------------------------- + alpha = rho / d + +c--------------------------------------------------------------------- +c Save a temporary of rho +c--------------------------------------------------------------------- + rho0 = rho + +c--------------------------------------------------------------------- +c Obtain z = z + alpha*p +c and r = r - alpha*q +c--------------------------------------------------------------------- + do j=1, lastcol-firstcol+1 + z(j) = z(j) + alpha*p(j) + r(j) = r(j) - alpha*q(j) + enddo + +c--------------------------------------------------------------------- +c rho = r.r +c Now, obtain the norm of r: First, sum squares of r elements locally... +c--------------------------------------------------------------------- + sum = 0.0d0 + do j=1, lastcol-firstcol+1 + sum = sum + r(j)*r(j) + enddo + +c--------------------------------------------------------------------- +c Obtain rho with a sum-reduce +c--------------------------------------------------------------------- + do i = 1, l2npcols + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( rho, + > 1, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > request, + > ierr ) + call mpi_send( sum, + > 1, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + + sum = sum + rho + enddo + rho = sum + +c--------------------------------------------------------------------- +c Obtain beta: +c--------------------------------------------------------------------- + beta = rho / rho0 + +c--------------------------------------------------------------------- +c p = r + beta*p +c--------------------------------------------------------------------- + do j=1, lastcol-firstcol+1 + p(j) = r(j) + beta*p(j) + enddo + + + + enddo ! end of do cgit=1,cgitmax + + + +c--------------------------------------------------------------------- +c Compute residual norm explicitly: ||r|| = ||x - A.z|| +c First, form A.z +c The partition submatrix-vector multiply +c--------------------------------------------------------------------- + do j=1,lastrow-firstrow+1 + sum = 0.d0 + do k=rowstr(j),rowstr(j+1)-1 + sum = sum + a(k)*z(colidx(k)) + enddo + w(j) = sum + enddo + + + +c--------------------------------------------------------------------- +c Sum the partition submatrix-vec A.z's across rows +c--------------------------------------------------------------------- + do i = l2npcols, 1, -1 + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( r(reduce_recv_starts(i)), + > reduce_recv_lengths(i), + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > request, + > ierr ) + call mpi_send( w(reduce_send_starts(i)), + > reduce_send_lengths(i), + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + + do j=send_start,send_start + reduce_recv_lengths(i) - 1 + w(j) = w(j) + r(j) + enddo + enddo + + +c--------------------------------------------------------------------- +c Exchange piece of q with transpose processor: +c--------------------------------------------------------------------- + if( l2npcols .ne. 0 )then + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( r, + > exch_recv_length, + > dp_type, + > exch_proc, + > 1, + > mpi_comm_world, + > request, + > ierr ) + + call mpi_send( w(send_start), + > send_len, + > dp_type, + > exch_proc, + > 1, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + else + do j=1,exch_recv_length + r(j) = w(j) + enddo + endif + + +c--------------------------------------------------------------------- +c At this point, r contains A.z +c--------------------------------------------------------------------- + sum = 0.0d0 + do j=1, lastcol-firstcol+1 + d = x(j) - r(j) + sum = sum + d*d + enddo + +c--------------------------------------------------------------------- +c Obtain d with a sum-reduce +c--------------------------------------------------------------------- + do i = 1, l2npcols + if (timeron) call timer_start(t_rcomm) + call mpi_irecv( d, + > 1, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > request, + > ierr ) + call mpi_send( sum, + > 1, + > dp_type, + > reduce_exch_proc(i), + > i, + > mpi_comm_world, + > ierr ) + call mpi_wait( request, status, ierr ) + if (timeron) call timer_stop(t_rcomm) + + sum = sum + d + enddo + d = sum + + + if( me .eq. root ) rnorm = sqrt( d ) + + if (timeron) call timer_stop(t_conjg) + + + return + end ! end of routine conj_grad + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + subroutine makea( n, nz, a, colidx, rowstr, nonzer, + > firstrow, lastrow, firstcol, lastcol, + > rcond, arow, acol, aelt, v, iv, shift ) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + integer n, nz, nonzer + integer firstrow, lastrow, firstcol, lastcol + integer colidx(nz), rowstr(n+1) + integer iv(2*n+1), arow(nz), acol(nz) + double precision v(n+1), aelt(nz) + double precision rcond, a(nz), shift + +c--------------------------------------------------------------------- +c generate the test problem for benchmark 6 +c makea generates a sparse matrix with a +c prescribed sparsity distribution +c +c parameter type usage +c +c input +c +c n i number of cols/rows of matrix +c nz i nonzeros as declared array size +c rcond r*8 condition number +c shift r*8 main diagonal shift +c +c output +c +c a r*8 array for nonzeros +c colidx i col indices +c rowstr i row pointers +c +c workspace +c +c iv, arow, acol i +c v, aelt r*8 +c--------------------------------------------------------------------- + + integer i, nnza, iouter, ivelt, ivelt1, irow, nzv, jcol + +c--------------------------------------------------------------------- +c nonzer is approximately (int(sqrt(nnza /n))); +c--------------------------------------------------------------------- + + double precision size, ratio, scale + external sparse, sprnvc, vecset + + size = 1.0D0 + ratio = rcond ** (1.0D0 / dfloat(n)) + nnza = 0 + +c--------------------------------------------------------------------- +c Initialize iv(n+1 .. 2n) to zero. +c Used by sprnvc to mark nonzero positions +c--------------------------------------------------------------------- + + do i = 1, n + iv(n+i) = 0 + enddo + do iouter = 1, n + nzv = nonzer + call sprnvc( n, nzv, v, colidx, iv(1), iv(n+1) ) + call vecset( n, v, colidx, nzv, iouter, .5D0 ) + do ivelt = 1, nzv + jcol = colidx(ivelt) + if (jcol.ge.firstcol .and. jcol.le.lastcol) then + scale = size * v(ivelt) + do ivelt1 = 1, nzv + irow = colidx(ivelt1) + if (irow.ge.firstrow .and. irow.le.lastrow) then + nnza = nnza + 1 + if (nnza .gt. nz) goto 9999 + acol(nnza) = jcol + arow(nnza) = irow + aelt(nnza) = v(ivelt1) * scale + endif + enddo + endif + enddo + size = size * ratio + enddo + + +c--------------------------------------------------------------------- +c ... add the identity * rcond to the generated matrix to bound +c the smallest eigenvalue from below by rcond +c--------------------------------------------------------------------- + do i = firstrow, lastrow + if (i.ge.firstcol .and. i.le.lastcol) then + iouter = n + i + nnza = nnza + 1 + if (nnza .gt. nz) goto 9999 + acol(nnza) = i + arow(nnza) = i + aelt(nnza) = rcond - shift + endif + enddo + + +c--------------------------------------------------------------------- +c ... make the sparse matrix from list of elements with duplicates +c (v and iv are used as workspace) +c--------------------------------------------------------------------- + call sparse( a, colidx, rowstr, n, arow, acol, aelt, + > firstrow, lastrow, + > v, iv(1), iv(n+1), nnza ) + return + + 9999 continue + write(*,*) 'Space for matrix elements exceeded in makea' + write(*,*) 'nnza, nzmax = ',nnza, nz + write(*,*) ' iouter = ',iouter + + stop + end +c-------end of makea------------------------------ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + subroutine sparse( a, colidx, rowstr, n, arow, acol, aelt, + > firstrow, lastrow, + > x, mark, nzloc, nnza ) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + integer colidx(*), rowstr(*) + integer firstrow, lastrow + integer n, arow(*), acol(*), nnza + double precision a(*), aelt(*) + +c--------------------------------------------------------------------- +c rows range from firstrow to lastrow +c the rowstr pointers are defined for nrows = lastrow-firstrow+1 values +c--------------------------------------------------------------------- + integer nzloc(n), nrows + double precision x(n) + logical mark(n) + +c--------------------------------------------------- +c generate a sparse matrix from a list of +c [col, row, element] tri +c--------------------------------------------------- + + integer i, j, jajp1, nza, k, nzrow + double precision xi + +c--------------------------------------------------------------------- +c how many rows of result +c--------------------------------------------------------------------- + nrows = lastrow - firstrow + 1 + +c--------------------------------------------------------------------- +c ...count the number of triples in each row +c--------------------------------------------------------------------- + do j = 1, n + rowstr(j) = 0 + mark(j) = .false. + enddo + rowstr(n+1) = 0 + + do nza = 1, nnza + j = (arow(nza) - firstrow + 1) + 1 + rowstr(j) = rowstr(j) + 1 + enddo + + rowstr(1) = 1 + do j = 2, nrows+1 + rowstr(j) = rowstr(j) + rowstr(j-1) + enddo + + +c--------------------------------------------------------------------- +c ... rowstr(j) now is the location of the first nonzero +c of row j of a +c--------------------------------------------------------------------- + + +c--------------------------------------------------------------------- +c ... do a bucket sort of the triples on the row index +c--------------------------------------------------------------------- + do nza = 1, nnza + j = arow(nza) - firstrow + 1 + k = rowstr(j) + a(k) = aelt(nza) + colidx(k) = acol(nza) + rowstr(j) = rowstr(j) + 1 + enddo + + +c--------------------------------------------------------------------- +c ... rowstr(j) now points to the first element of row j+1 +c--------------------------------------------------------------------- + do j = nrows, 1, -1 + rowstr(j+1) = rowstr(j) + enddo + rowstr(1) = 1 + + +c--------------------------------------------------------------------- +c ... generate the actual output rows by adding elements +c--------------------------------------------------------------------- + nza = 0 + do i = 1, n + x(i) = 0.0 + mark(i) = .false. + enddo + + jajp1 = rowstr(1) + do j = 1, nrows + nzrow = 0 + +c--------------------------------------------------------------------- +c ...loop over the jth row of a +c--------------------------------------------------------------------- + do k = jajp1 , rowstr(j+1)-1 + i = colidx(k) + x(i) = x(i) + a(k) + if ( (.not. mark(i)) .and. (x(i) .ne. 0.D0)) then + mark(i) = .true. + nzrow = nzrow + 1 + nzloc(nzrow) = i + endif + enddo + +c--------------------------------------------------------------------- +c ... extract the nonzeros of this row +c--------------------------------------------------------------------- + do k = 1, nzrow + i = nzloc(k) + mark(i) = .false. + xi = x(i) + x(i) = 0.D0 + if (xi .ne. 0.D0) then + nza = nza + 1 + a(nza) = xi + colidx(nza) = i + endif + enddo + jajp1 = rowstr(j+1) + rowstr(j+1) = nza + rowstr(1) + enddo +CC write (*, 11000) nza + return +11000 format ( //,'final nonzero count in sparse ', + 1 /,'number of nonzeros = ', i16 ) + end +c-------end of sparse----------------------------- + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + subroutine sprnvc( n, nz, v, iv, nzloc, mark ) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + double precision v(*) + integer n, nz, iv(*), nzloc(n), nn1 + integer mark(n) + common /urando/ amult, tran + double precision amult, tran + + +c--------------------------------------------------------------------- +c generate a sparse n-vector (v, iv) +c having nzv nonzeros +c +c mark(i) is set to 1 if position i is nonzero. +c mark is all zero on entry and is reset to all zero before exit +c this corrects a performance bug found by John G. Lewis, caused by +c reinitialization of mark on every one of the n calls to sprnvc +c--------------------------------------------------------------------- + + integer nzrow, nzv, ii, i, icnvrt + + external randlc, icnvrt + double precision randlc, vecelt, vecloc + + + nzv = 0 + nzrow = 0 + nn1 = 1 + 50 continue + nn1 = 2 * nn1 + if (nn1 .lt. n) goto 50 + +c--------------------------------------------------------------------- +c nn1 is the smallest power of two not less than n +c--------------------------------------------------------------------- + +100 continue + if (nzv .ge. nz) goto 110 + vecelt = randlc( tran, amult ) + +c--------------------------------------------------------------------- +c generate an integer between 1 and n in a portable manner +c--------------------------------------------------------------------- + vecloc = randlc(tran, amult) + i = icnvrt(vecloc, nn1) + 1 + if (i .gt. n) goto 100 + +c--------------------------------------------------------------------- +c was this integer generated already? +c--------------------------------------------------------------------- + if (mark(i) .eq. 0) then + mark(i) = 1 + nzrow = nzrow + 1 + nzloc(nzrow) = i + nzv = nzv + 1 + v(nzv) = vecelt + iv(nzv) = i + endif + goto 100 +110 continue + do ii = 1, nzrow + i = nzloc(ii) + mark(i) = 0 + enddo + return + end +c-------end of sprnvc----------------------------- + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + function icnvrt(x, ipwr2) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + double precision x + integer ipwr2, icnvrt + +c--------------------------------------------------------------------- +c scale a double precision number x in (0,1) by a power of 2 and chop it +c--------------------------------------------------------------------- + icnvrt = int(ipwr2 * x) + + return + end +c-------end of icnvrt----------------------------- + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + subroutine vecset(n, v, iv, nzv, i, val) +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + integer n, iv(*), nzv, i, k + double precision v(*), val + +c--------------------------------------------------------------------- +c set ith element of sparse vector (v, iv) with +c nzv nonzeros to val +c--------------------------------------------------------------------- + + logical set + + set = .false. + do k = 1, nzv + if (iv(k) .eq. i) then + v(k) = val + set = .true. + endif + enddo + if (.not. set) then + nzv = nzv + 1 + v(nzv) = val + iv(nzv) = i + endif + return + end +c-------end of vecset----------------------------- + diff --git b/NPB3.3-MPI/CG/mpinpb.h a/NPB3.3-MPI/CG/mpinpb.h new file mode 100644 index 0000000..1f0368c --- /dev/null +++ a/NPB3.3-MPI/CG/mpinpb.h @@ -0,0 +1,9 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'mpif.h' + + integer me, nprocs, root, dp_type + common /mpistuff/ me, nprocs, root, dp_type + diff --git b/NPB3.3-MPI/CG/timing.h a/NPB3.3-MPI/CG/timing.h new file mode 100644 index 0000000..2000af1 --- /dev/null +++ a/NPB3.3-MPI/CG/timing.h @@ -0,0 +1,5 @@ + integer t_total, t_conjg, t_rcomm, t_ncomm, t_last + parameter (t_total=1, t_conjg=2, t_rcomm=3, t_ncomm=4, t_last=4) + + logical timeron + common /timers/ timeron diff --git b/NPB3.3-MPI/DT/DGraph.c a/NPB3.3-MPI/DT/DGraph.c new file mode 100644 index 0000000..5d5839d --- /dev/null +++ a/NPB3.3-MPI/DT/DGraph.c @@ -0,0 +1,184 @@ +#include +#include +#include + +#include "DGraph.h" + +DGArc *newArc(DGNode *tl,DGNode *hd){ + DGArc *ar=(DGArc *)malloc(sizeof(DGArc)); + ar->tail=tl; + ar->head=hd; + return ar; +} +void arcShow(DGArc *ar){ + DGNode *tl=(DGNode *)ar->tail, + *hd=(DGNode *)ar->head; + fprintf(stderr,"%d. |%s ->%s\n",ar->id,tl->name,hd->name); +} + +DGNode *newNode(char *nm){ + DGNode *nd=(DGNode *)malloc(sizeof(DGNode)); + nd->attribute=0; + nd->color=0; + nd->inDegree=0; + nd->outDegree=0; + nd->maxInDegree=SMALL_BLOCK_SIZE; + nd->maxOutDegree=SMALL_BLOCK_SIZE; + nd->inArc=(DGArc **)malloc(nd->maxInDegree*sizeof(DGArc*)); + nd->outArc=(DGArc **)malloc(nd->maxOutDegree*sizeof(DGArc*)); + nd->name=strdup(nm); + nd->feat=NULL; + return nd; +} +void nodeShow(DGNode* nd){ + fprintf( stderr,"%3d.%s: (%d,%d)\n", + nd->id,nd->name,nd->inDegree,nd->outDegree); +/* + if(nd->verified==1) fprintf(stderr,"%ld.%s\t: usable.",nd->id,nd->name); + else if(nd->verified==0) fprintf(stderr,"%ld.%s\t: unusable.",nd->id,nd->name); + else fprintf(stderr,"%ld.%s\t: notverified.",nd->id,nd->name); +*/ +} + +DGraph* newDGraph(char* nm){ + DGraph *dg=(DGraph *)malloc(sizeof(DGraph)); + dg->numNodes=0; + dg->numArcs=0; + dg->maxNodes=BLOCK_SIZE; + dg->maxArcs=BLOCK_SIZE; + dg->node=(DGNode **)malloc(dg->maxNodes*sizeof(DGNode*)); + dg->arc=(DGArc **)malloc(dg->maxArcs*sizeof(DGArc*)); + dg->name=strdup(nm); + return dg; +} +int AttachNode(DGraph* dg, DGNode* nd) { + int i=0,j,len=0; + DGNode **nds =NULL, *tmpnd=NULL; + DGArc **ar=NULL; + + if (dg->numNodes == dg->maxNodes-1 ) { + dg->maxNodes += BLOCK_SIZE; + nds =(DGNode **) calloc(dg->maxNodes,sizeof(DGNode*)); + memcpy(nds,dg->node,(dg->maxNodes-BLOCK_SIZE)*sizeof(DGNode*)); + free(dg->node); + dg->node=nds; + } + + len = strlen( nd->name); + for (i = 0; i < dg->numNodes; i++) { + tmpnd =dg->node[ i]; + ar=NULL; + if ( strlen( tmpnd->name) != len ) continue; + if ( strncmp( nd->name, tmpnd->name, len) ) continue; + if ( nd->inDegree > 0 ) { + tmpnd->maxInDegree += nd->maxInDegree; + ar =(DGArc **) calloc(tmpnd->maxInDegree,sizeof(DGArc*)); + memcpy(ar,tmpnd->inArc,(tmpnd->inDegree)*sizeof(DGArc*)); + free(tmpnd->inArc); + tmpnd->inArc=ar; + for (j = 0; j < nd->inDegree; j++ ) { + nd->inArc[ j]->head = tmpnd; + } + memcpy( &(tmpnd->inArc[ tmpnd->inDegree]), nd->inArc, nd->inDegree*sizeof( DGArc *)); + tmpnd->inDegree += nd->inDegree; + } + if ( nd->outDegree > 0 ) { + tmpnd->maxOutDegree += nd->maxOutDegree; + ar =(DGArc **) calloc(tmpnd->maxOutDegree,sizeof(DGArc*)); + memcpy(ar,tmpnd->outArc,(tmpnd->outDegree)*sizeof(DGArc*)); + free(tmpnd->outArc); + tmpnd->outArc=ar; + for (j = 0; j < nd->outDegree; j++ ) { + nd->outArc[ j]->tail = tmpnd; + } + memcpy( &(tmpnd->outArc[tmpnd->outDegree]),nd->outArc,nd->outDegree*sizeof( DGArc *)); + tmpnd->outDegree += nd->outDegree; + } + free(nd); + return i; + } + nd->id = dg->numNodes; + dg->node[dg->numNodes] = nd; + dg->numNodes++; +return nd->id; +} +int AttachArc(DGraph *dg,DGArc* nar){ +int arcId = -1; +int i=0,newNumber=0; +DGNode *head = nar->head, + *tail = nar->tail; +DGArc **ars=NULL,*probe=NULL; +/*fprintf(stderr,"AttachArc %ld\n",dg->numArcs); */ + if ( !tail || !head ) return arcId; + if ( dg->numArcs == dg->maxArcs-1 ) { + dg->maxArcs += BLOCK_SIZE; + ars =(DGArc **) calloc(dg->maxArcs,sizeof(DGArc*)); + memcpy(ars,dg->arc,(dg->maxArcs-BLOCK_SIZE)*sizeof(DGArc*)); + free(dg->arc); + dg->arc=ars; + } + for(i = 0; i < tail->outDegree; i++ ) { /* parallel arc */ + probe = tail->outArc[ i]; + if(probe->head == head + && + probe->length == nar->length + ){ + free(nar); + return probe->id; + } + } + + nar->id = dg->numArcs; + arcId=dg->numArcs; + dg->arc[dg->numArcs] = nar; + dg->numArcs++; + + head->inArc[ head->inDegree] = nar; + head->inDegree++; + if ( head->inDegree >= head->maxInDegree ) { + newNumber = head->maxInDegree + SMALL_BLOCK_SIZE; + ars =(DGArc **) calloc(newNumber,sizeof(DGArc*)); + memcpy(ars,head->inArc,(head->inDegree)*sizeof(DGArc*)); + free(head->inArc); + head->inArc=ars; + head->maxInDegree = newNumber; + } + tail->outArc[ tail->outDegree] = nar; + tail->outDegree++; + if(tail->outDegree >= tail->maxOutDegree ) { + newNumber = tail->maxOutDegree + SMALL_BLOCK_SIZE; + ars =(DGArc **) calloc(newNumber,sizeof(DGArc*)); + memcpy(ars,tail->outArc,(tail->outDegree)*sizeof(DGArc*)); + free(tail->outArc); + tail->outArc=ars; + tail->maxOutDegree = newNumber; + } +/*fprintf(stderr,"AttachArc: head->in=%d tail->out=%ld\n",head->inDegree,tail->outDegree);*/ +return arcId; +} +void graphShow(DGraph *dg,int DetailsLevel){ + int i=0,j=0; + fprintf(stderr,"%d.%s: (%d,%d)\n",dg->id,dg->name,dg->numNodes,dg->numArcs); + if ( DetailsLevel < 1) return; + for (i = 0; i < dg->numNodes; i++ ) { + DGNode *focusNode = dg->node[ i]; + if(DetailsLevel >= 2) { + for (j = 0; j < focusNode->inDegree; j++ ) { + fprintf(stderr,"\t "); + nodeShow(focusNode->inArc[ j]->tail); + } + } + nodeShow(focusNode); + if ( DetailsLevel < 2) continue; + for (j = 0; j < focusNode->outDegree; j++ ) { + fprintf(stderr, "\t "); + nodeShow(focusNode->outArc[ j]->head); + } + fprintf(stderr, "---\n"); + } + fprintf(stderr,"----------------------------------------\n"); + if ( DetailsLevel < 3) return; +} + + + diff --git b/NPB3.3-MPI/DT/DGraph.h a/NPB3.3-MPI/DT/DGraph.h new file mode 100644 index 0000000..f38f898 --- /dev/null +++ a/NPB3.3-MPI/DT/DGraph.h @@ -0,0 +1,43 @@ +#ifndef _DGRAPH +#define _DGRAPH + +#define BLOCK_SIZE 128 +#define SMALL_BLOCK_SIZE 32 + +typedef struct{ + int id; + void *tail,*head; + int length,width,attribute,maxWidth; +}DGArc; + +typedef struct{ + int maxInDegree,maxOutDegree; + int inDegree,outDegree; + int id; + char *name; + DGArc **inArc,**outArc; + int depth,height,width; + int color,attribute,address,verified; + void *feat; +}DGNode; + +typedef struct{ + int maxNodes,maxArcs; + int id; + char *name; + int numNodes,numArcs; + DGNode **node; + DGArc **arc; +} DGraph; + +DGArc *newArc(DGNode *tl,DGNode *hd); +void arcShow(DGArc *ar); +DGNode *newNode(char *nm); +void nodeShow(DGNode* nd); + +DGraph* newDGraph(char *nm); +int AttachNode(DGraph *dg,DGNode *nd); +int AttachArc(DGraph *dg,DGArc* nar); +void graphShow(DGraph *dg,int DetailsLevel); + +#endif diff --git b/NPB3.3-MPI/DT/Makefile a/NPB3.3-MPI/DT/Makefile new file mode 100644 index 0000000..687ac33 --- /dev/null +++ a/NPB3.3-MPI/DT/Makefile @@ -0,0 +1,26 @@ +SHELL=/bin/sh +BENCHMARK=dt +BENCHMARKU=DT + +include ../config/make.def + +include ../sys/make.common +#Override PROGRAM +DTPROGRAM = $(BINDIR)/$(BENCHMARK).$(CLASS).x + +OBJS = dt.o DGraph.o \ + ${COMMON}/c_print_results.o ${COMMON}/c_timers.o ${COMMON}/c_randdp.o + + +${PROGRAM}: config ${OBJS} + ${CLINK} ${CLINKFLAGS} -o ${DTPROGRAM} ${OBJS} ${CMPI_LIB} + +.c.o: + ${CCOMPILE} $< + +dt.o: dt.c npbparams.h +DGraph.o: DGraph.c DGraph.h + +clean: + - rm -f *.o *~ mputil* + - rm -f dt npbparams.h core diff --git b/NPB3.3-MPI/DT/README a/NPB3.3-MPI/DT/README new file mode 100644 index 0000000..873e3ae --- /dev/null +++ a/NPB3.3-MPI/DT/README @@ -0,0 +1,22 @@ +Data Traffic benchmark DT is new in the NPB suite +(released as part of NPB3.x-MPI package). +---------------------------------------------------- + +DT is written in C and same executable can run on any number of processors, +provided this number is not less than the number of nodes in the communication +graph. DT benchmark takes one argument: BH, WH, or SH. This argument +specifies the communication graph Black Hole, White Hole, or SHuffle +respectively. The current release contains verification numbers for +CLASSES S, W, A, and B only. Classes C and D are defined, but verification +numbers are not provided in this release. + +The following table summarizes the number of nodes in the communication +graph based on CLASS and graph TYPE. + +CLASS N_Source N_Nodes(BH,WH) N_Nodes(SH) + S 4 5 12 + W 8 11 32 + A 16 21 80 + B 32 43 192 + C 64 85 448 + D 128 171 1024 diff --git b/NPB3.3-MPI/DT/dt.c a/NPB3.3-MPI/DT/dt.c new file mode 100644 index 0000000..1ee85f6 --- /dev/null +++ a/NPB3.3-MPI/DT/dt.c @@ -0,0 +1,759 @@ +/************************************************************************* + * * + * N A S P A R A L L E L B E N C H M A R K S 3.3 * + * * + * D T * + * * + ************************************************************************* + * * + * This benchmark is part of the NAS Parallel Benchmark 3.3 suite. * + * * + * Permission to use, copy, distribute and modify this software * + * for any purpose with or without fee is hereby granted. We * + * request, however, that all derived work reference the NAS * + * Parallel Benchmarks 3.3. This software is provided "as is" * + * without express or implied warranty. * + * * + * Information on NPB 3.3, including the technical report, the * + * original specifications, source code, results and information * + * on how to submit new results, is available at: * + * * + * http: www.nas.nasa.gov/Software/NPB * + * * + * Send comments or suggestions to npb@nas.nasa.gov * + * Send bug reports to npb-bugs@nas.nasa.gov * + * * + * NAS Parallel Benchmarks Group * + * NASA Ames Research Center * + * Mail Stop: T27A-1 * + * Moffett Field, CA 94035-1000 * + * * + * E-mail: npb@nas.nasa.gov * + * Fax: (650) 604-3957 * + * * + ************************************************************************* + * * + * Author: M. Frumkin * * + * * + *************************************************************************/ + +#include +#include +#include + +#include "mpi.h" +#include "npbparams.h" + +#ifndef CLASS +#define CLASS 'S' +#define NUM_PROCS 1 +#endif + +int passed_verification; +extern double randlc( double *X, double *A ); +extern +void c_print_results( char *name, + char class, + int n1, + int n2, + int n3, + int niter, + int nprocs_compiled, + int nprocs_total, + double t, + double mops, + char *optype, + int passed_verification, + char *npbversion, + char *compiletime, + char *mpicc, + char *clink, + char *cmpi_lib, + char *cmpi_inc, + char *cflags, + char *clinkflags ); + +void timer_clear( int n ); +void timer_start( int n ); +void timer_stop( int n ); +double timer_read( int n ); +int timer_on=0,timers_tot=64; + +int verify(char *bmname,double rnm2){ + double verify_value=0.0; + double epsilon=1.0E-8; + char cls=CLASS; + int verified=-1; + if (cls != 'U') { + if(cls=='S') { + if(strstr(bmname,"BH")){ + verify_value=30892725.0; + }else if(strstr(bmname,"WH")){ + verify_value=67349758.0; + }else if(strstr(bmname,"SH")){ + verify_value=58875767.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + } + verified = 0; + }else if(cls=='W') { + if(strstr(bmname,"BH")){ + verify_value = 4102461.0; + }else if(strstr(bmname,"WH")){ + verify_value = 204280762.0; + }else if(strstr(bmname,"SH")){ + verify_value = 186944764.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + } + verified = 0; + }else if(cls=='A') { + if(strstr(bmname,"BH")){ + verify_value = 17809491.0; + }else if(strstr(bmname,"WH")){ + verify_value = 1289925229.0; + }else if(strstr(bmname,"SH")){ + verify_value = 610856482.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + } + verified = 0; + }else if(cls=='B') { + if(strstr(bmname,"BH")){ + verify_value = 4317114.0; + }else if(strstr(bmname,"WH")){ + verify_value = 7877279917.0; + }else if(strstr(bmname,"SH")){ + verify_value = 1836863082.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + verified = 0; + } + }else if(cls=='C') { + if(strstr(bmname,"BH")){ + verify_value = 0.0; + }else if(strstr(bmname,"WH")){ + verify_value = 0.0; + }else if(strstr(bmname,"SH")){ + verify_value = 0.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + verified = -1; + } + }else if(cls=='D') { + if(strstr(bmname,"BH")){ + verify_value = 0.0; + }else if(strstr(bmname,"WH")){ + verify_value = 0.0; + }else if(strstr(bmname,"SH")){ + verify_value = 0.0; + }else{ + fprintf(stderr,"No such benchmark as %s.\n",bmname); + } + verified = -1; + }else{ + fprintf(stderr,"No such class as %c.\n",cls); + } + fprintf(stderr," %s L2 Norm = %f\n",bmname,rnm2); + if(verified==-1){ + fprintf(stderr," No verification was performed.\n"); + }else if( rnm2 - verify_value < epsilon && + rnm2 - verify_value > -epsilon) { /* abs here does not work on ALTIX */ + verified = 1; + fprintf(stderr," Deviation = %f\n",(rnm2 - verify_value)); + }else{ + verified = 0; + fprintf(stderr," The correct verification value = %f\n",verify_value); + fprintf(stderr," Got value = %f\n",rnm2); + } + }else{ + verified = -1; + } + return verified; + } + +int ipowMod(int a,long long int n,int md){ + int seed=1,q=a,r=1; + if(n<0){ + fprintf(stderr,"ipowMod: exponent must be nonnegative exp=%lld\n",n); + n=-n; /* temp fix */ +/* return 1; */ + } + if(md<=0){ + fprintf(stderr,"ipowMod: module must be positive mod=%d",md); + return 1; + } + if(n==0) return 1; + while(n>1){ + int n2 = n/2; + if (n2*2==n){ + seed = (q*q)%md; + q=seed; + n = n2; + }else{ + seed = (r*q)%md; + r=seed; + n = n-1; + } + } + seed = (r*q)%md; + return seed; +} + +#include "DGraph.h" +DGraph *buildSH(char cls){ +/* + Nodes of the graph must be topologically sorted + to avoid MPI deadlock. +*/ + DGraph *dg; + int numSources=NUM_SOURCES; /* must be power of 2 */ + int numOfLayers=0,tmpS=numSources>>1; + int firstLayerNode=0; + DGArc *ar=NULL; + DGNode *nd=NULL; + int mask=0x0,ndid=0,ndoff=0; + int i=0,j=0; + char nm[BLOCK_SIZE]; + + sprintf(nm,"DT_SH.%c",cls); + dg=newDGraph(nm); + + while(tmpS>1){ + numOfLayers++; + tmpS>>=1; + } + for(i=0;inode[ndid],nd); + AttachArc(dg,ar); + ndoff+=mask; + ndid=firstLayerNode+ndoff; + ar=newArc(dg->node[ndid],nd); + AttachArc(dg,ar); + } + firstLayerNode+=numSources; + } + mask=0x00000001<node[ndid],nd); + AttachArc(dg,ar); + ndoff+=mask; + ndid=firstLayerNode+ndoff; + ar=newArc(dg->node[ndid],nd); + AttachArc(dg,ar); + } +return dg; +} +DGraph *buildWH(char cls){ +/* + Nodes of the graph must be topologically sorted + to avoid MPI deadlock. +*/ + int i=0,j=0; + int numSources=NUM_SOURCES,maxInDeg=4; + int numLayerNodes=numSources,firstLayerNode=0; + int totComparators=0; + int numPrevLayerNodes=numLayerNodes; + int id=0,sid=0; + DGraph *dg; + DGNode *nd=NULL,*source=NULL,*tmp=NULL,*snd=NULL; + DGArc *ar=NULL; + char nm[BLOCK_SIZE]; + + sprintf(nm,"DT_WH.%c",cls); + dg=newDGraph(nm); + + for(i=0;imaxInDeg){ + numLayerNodes=numLayerNodes/maxInDeg; + if(numLayerNodes*maxInDeg=numPrevLayerNodes) break; + snd=dg->node[firstLayerNode+sid]; + ar=newArc(dg->node[id],snd); + AttachArc(dg,ar); + } + } + firstLayerNode+=numPrevLayerNodes; + numPrevLayerNodes=numLayerNodes; + } + source=newNode("Source"); + AttachNode(dg,source); + for(i=0;inode[firstLayerNode+i]; + ar=newArc(source,nd); + AttachArc(dg,ar); + } + + for(i=0;inumNodes/2;i++){ /* Topological sorting */ + tmp=dg->node[i]; + dg->node[i]=dg->node[dg->numNodes-1-i]; + dg->node[i]->id=i; + dg->node[dg->numNodes-1-i]=tmp; + dg->node[dg->numNodes-1-i]->id=dg->numNodes-1-i; + } +return dg; +} +DGraph *buildBH(char cls){ +/* + Nodes of the graph must be topologically sorted + to avoid MPI deadlock. +*/ + int i=0,j=0; + int numSources=NUM_SOURCES,maxInDeg=4; + int numLayerNodes=numSources,firstLayerNode=0; + DGraph *dg; + DGNode *nd=NULL, *snd=NULL, *sink=NULL; + DGArc *ar=NULL; + int totComparators=0; + int numPrevLayerNodes=numLayerNodes; + int id=0, sid=0; + char nm[BLOCK_SIZE]; + + sprintf(nm,"DT_BH.%c",cls); + dg=newDGraph(nm); + + for(i=0;imaxInDeg){ + numLayerNodes=numLayerNodes/maxInDeg; + if(numLayerNodes*maxInDeg=numPrevLayerNodes) break; + snd=dg->node[firstLayerNode+sid]; + ar=newArc(snd,dg->node[id]); + AttachArc(dg,ar); + } + } + firstLayerNode+=numPrevLayerNodes; + numPrevLayerNodes=numLayerNodes; + } + sink=newNode("Sink"); + AttachNode(dg,sink); + for(i=0;inode[firstLayerNode+i]; + ar=newArc(nd,sink); + AttachArc(dg,ar); + } +return dg; +} + +typedef struct{ + int len; + double* val; +} Arr; +Arr *newArr(int len){ + Arr *arr=(Arr *)malloc(sizeof(Arr)); + arr->len=len; + arr->val=(double *)malloc(len*sizeof(double)); + return arr; +} +void arrShow(Arr* a){ + if(!a) fprintf(stderr,"-- NULL array\n"); + else{ + fprintf(stderr,"-- length=%d\n",a->len); + } +} +double CheckVal(Arr *feat){ + double csum=0.0; + int i=0; + for(i=0;ilen;i++){ + csum+=feat->val[i]*feat->val[i]/feat->len; /* The truncation does not work since + result will be 0 for large len */ + } + return csum; +} +int GetFNumDPar(int* mean, int* stdev){ + *mean=NUM_SAMPLES; + *stdev=STD_DEVIATION; + return 0; +} +int GetFeatureNum(char *mbname,int id){ + double tran=314159265.0; + double A=2*id+1; + double denom=randlc(&tran,&A); + char cval='S'; + int mean=NUM_SAMPLES,stdev=128; + int rtfs=0,len=0; + GetFNumDPar(&mean,&stdev); + rtfs=ipowMod((int)(1/denom)*(int)cval,(long long int) (2*id+1),2*stdev); + if(rtfs<0) rtfs=-rtfs; + len=mean-stdev+rtfs; + return len; +} +Arr* RandomFeatures(char *bmname,int fdim,int id){ + int len=GetFeatureNum(bmname,id)*fdim; + Arr* feat=newArr(len); + int nxg=2,nyg=2,nzg=2,nfg=5; + int nx=421,ny=419,nz=1427,nf=3527; + long long int expon=(len*(id+1))%3141592; + int seedx=ipowMod(nxg,expon,nx), + seedy=ipowMod(nyg,expon,ny), + seedz=ipowMod(nzg,expon,nz), + seedf=ipowMod(nfg,expon,nf); + int i=0; + if(timer_on){ + timer_clear(id+1); + timer_start(id+1); + } + for(i=0;ival[i]=seedx; + feat->val[i+1]=seedy; + feat->val[i+2]=seedz; + feat->val[i+3]=seedf; + } + if(timer_on){ + timer_stop(id+1); + fprintf(stderr,"** RandomFeatures time in node %d = %f\n",id,timer_read(id+1)); + } + return feat; +} +void Resample(Arr *a,int blen){ + long long int i=0,j=0,jlo=0,jhi=0; + double avval=0.0; + double *nval=(double *)malloc(blen*sizeof(double)); + Arr *tmp=newArr(10); + for(i=0;ilen-1;i++){ + jlo=(int)(0.5*(2*i-1)*(blen/a->len)); + jhi=(int)(0.5*(2*i+1)*(blen/a->len)); + + avval=a->val[i]/(jhi-jlo+1); + for(j=jlo;j<=jhi;j++){ + nval[j]+=avval; + } + } + nval[0]=a->val[0]; + nval[blen-1]=a->val[a->len-1]; + free(a->val); + a->val=nval; + a->len=blen; +} +#define fielddim 4 +Arr* WindowFilter(Arr *a, Arr* b,int w){ + int i=0,j=0,k=0; + double rms0=0.0,rms1=0.0,rmsm1=0.0; + double weight=((double) (w+1))/(w+2); + + w+=1; + if(timer_on){ + timer_clear(w); + timer_start(w); + } + if(a->lenlen) Resample(a,b->len); + if(a->len>b->len) Resample(b,a->len); + for(i=fielddim;ilen-fielddim;i+=fielddim){ + rms0=(a->val[i]-b->val[i])*(a->val[i]-b->val[i]) + +(a->val[i+1]-b->val[i+1])*(a->val[i+1]-b->val[i+1]) + +(a->val[i+2]-b->val[i+2])*(a->val[i+2]-b->val[i+2]) + +(a->val[i+3]-b->val[i+3])*(a->val[i+3]-b->val[i+3]); + j=i+fielddim; + rms1=(a->val[j]-b->val[j])*(a->val[j]-b->val[j]) + +(a->val[j+1]-b->val[j+1])*(a->val[j+1]-b->val[j+1]) + +(a->val[j+2]-b->val[j+2])*(a->val[j+2]-b->val[j+2]) + +(a->val[j+3]-b->val[j+3])*(a->val[j+3]-b->val[j+3]); + j=i-fielddim; + rmsm1=(a->val[j]-b->val[j])*(a->val[j]-b->val[j]) + +(a->val[j+1]-b->val[j+1])*(a->val[j+1]-b->val[j+1]) + +(a->val[j+2]-b->val[j+2])*(a->val[j+2]-b->val[j+2]) + +(a->val[j+3]-b->val[j+3])*(a->val[j+3]-b->val[j+3]); + k=0; + if(rms1val[i]=weight*b->val[i]; + a->val[i+1]=weight*b->val[i+1]; + a->val[i+2]=weight*b->val[i+2]; + a->val[i+3]=weight*b->val[i+3]; + }else if(k==1){ + j=i+fielddim; + a->val[i]=weight*b->val[j]; + a->val[i+1]=weight*b->val[j+1]; + a->val[i+2]=weight*b->val[j+2]; + a->val[i+3]=weight*b->val[j+3]; + }else { /*if(k==-1)*/ + j=i-fielddim; + a->val[i]=weight*b->val[j]; + a->val[i+1]=weight*b->val[j+1]; + a->val[i+2]=weight*b->val[j+2]; + a->val[i+3]=weight*b->val[j+3]; + } + } + if(timer_on){ + timer_stop(w); + fprintf(stderr,"** WindowFilter time in node %d = %f\n",(w-1),timer_read(w)); + } + return a; +} + +int SendResults(DGraph *dg,DGNode *nd,Arr *feat){ + int i=0,tag=0; + DGArc *ar=NULL; + DGNode *head=NULL; + if(!feat) return 0; + for(i=0;ioutDegree;i++){ + ar=nd->outArc[i]; + if(ar->tail!=nd) continue; + head=ar->head; + tag=ar->id; + if(head->address!=nd->address){ + MPI_Send(&feat->len,1,MPI_INT,head->address,tag,MPI_COMM_WORLD); + MPI_Send(feat->val,feat->len,MPI_DOUBLE,head->address,tag,MPI_COMM_WORLD); + } + } + return 1; +} +Arr* CombineStreams(DGraph *dg,DGNode *nd){ + Arr *resfeat=newArr(NUM_SAMPLES*fielddim); + int i=0,len=0,tag=0; + DGArc *ar=NULL; + DGNode *tail=NULL; + MPI_Status status; + Arr *feat=NULL,*featp=NULL; + + if(nd->inDegree==0) return NULL; + for(i=0;iinDegree;i++){ + ar=nd->inArc[i]; + if(ar->head!=nd) continue; + tail=ar->tail; + if(tail->address!=nd->address){ + len=0; + tag=ar->id; + MPI_Recv(&len,1,MPI_INT,tail->address,tag,MPI_COMM_WORLD,&status); + feat=newArr(len); + MPI_Recv(feat->val,feat->len,MPI_DOUBLE,tail->address,tag,MPI_COMM_WORLD,&status); + resfeat=WindowFilter(resfeat,feat,nd->id); + free(feat); + }else{ + featp=(Arr *)tail->feat; + feat=newArr(featp->len); + memcpy(feat->val,featp->val,featp->len*sizeof(double)); + resfeat=WindowFilter(resfeat,feat,nd->id); + free(feat); + } + } + for(i=0;ilen;i++) resfeat->val[i]=((int)resfeat->val[i])/nd->inDegree; + nd->feat=resfeat; + return nd->feat; +} +double Reduce(Arr *a,int w){ + double retv=0.0; + if(timer_on){ + timer_clear(w); + timer_start(w); + } + retv=(int)(w*CheckVal(a));/* The casting needed for node + and array dependent verifcation */ + if(timer_on){ + timer_stop(w); + fprintf(stderr,"** Reduce time in node %d = %f\n",(w-1),timer_read(w)); + } + return retv; +} + +double ReduceStreams(DGraph *dg,DGNode *nd){ + double csum=0.0; + int i=0,len=0,tag=0; + DGArc *ar=NULL; + DGNode *tail=NULL; + Arr *feat=NULL; + double retv=0.0; + + for(i=0;iinDegree;i++){ + ar=nd->inArc[i]; + if(ar->head!=nd) continue; + tail=ar->tail; + if(tail->address!=nd->address){ + MPI_Status status; + len=0; + tag=ar->id; + MPI_Recv(&len,1,MPI_INT,tail->address,tag,MPI_COMM_WORLD,&status); + feat=newArr(len); + MPI_Recv(feat->val,feat->len,MPI_DOUBLE,tail->address,tag,MPI_COMM_WORLD,&status); + csum+=Reduce(feat,(nd->id+1)); + free(feat); + }else{ + csum+=Reduce(tail->feat,(nd->id+1)); + } + } + if(nd->inDegree>0)csum=(((long long int)csum)/nd->inDegree); + retv=(nd->id+1)*csum; + return retv; +} + +int ProcessNodes(DGraph *dg,int me){ + double chksum=0.0; + Arr *feat=NULL; + int i=0,verified=0,tag; + DGNode *nd=NULL; + double rchksum=0.0; + MPI_Status status; + + for(i=0;inumNodes;i++){ + nd=dg->node[i]; + if(nd->address!=me) continue; + if(strstr(nd->name,"Source")){ + nd->feat=RandomFeatures(dg->name,fielddim,nd->id); + SendResults(dg,nd,nd->feat); + }else if(strstr(nd->name,"Sink")){ + chksum=ReduceStreams(dg,nd); + tag=dg->numArcs+nd->id; /* make these to avoid clash with arc tags */ + MPI_Send(&chksum,1,MPI_DOUBLE,0,tag,MPI_COMM_WORLD); + }else{ + feat=CombineStreams(dg,nd); + SendResults(dg,nd,feat); + } + } + if(me==0){ /* Report node */ + rchksum=0.0; + chksum=0.0; + for(i=0;inumNodes;i++){ + nd=dg->node[i]; + if(!strstr(nd->name,"Sink")) continue; + tag=dg->numArcs+nd->id; /* make these to avoid clash with arc tags */ + MPI_Recv(&rchksum,1,MPI_DOUBLE,nd->address,tag,MPI_COMM_WORLD,&status); + chksum+=rchksum; + } + verified=verify(dg->name,chksum); + } +return verified; +} + +int main(int argc,char **argv ){ + int my_rank,comm_size; + int i; + DGraph *dg=NULL; + int verified=0, featnum=0; + double bytes_sent=2.0,tot_time=0.0; + + MPI_Init( &argc, &argv ); + MPI_Comm_rank( MPI_COMM_WORLD, &my_rank ); + MPI_Comm_size( MPI_COMM_WORLD, &comm_size ); + + if(argc!=2|| + ( strncmp(argv[1],"BH",2)!=0 + &&strncmp(argv[1],"WH",2)!=0 + &&strncmp(argv[1],"SH",2)!=0 + ) + ){ + if(my_rank==0){ + fprintf(stderr,"** Usage: mpirun -np N ../bin/dt.S GraphName\n"); + fprintf(stderr,"** Where \n - N is integer number of MPI processes\n"); + fprintf(stderr," - S is the class S, W, or A \n"); + fprintf(stderr," - GraphName is the communication graph name BH, WH, or SH.\n"); + fprintf(stderr," - the number of MPI processes N should not be be less than \n"); + fprintf(stderr," the number of nodes in the graph\n"); + } + MPI_Finalize(); + exit(0); + } + if(strncmp(argv[1],"BH",2)==0){ + dg=buildBH(CLASS); + }else if(strncmp(argv[1],"WH",2)==0){ + dg=buildWH(CLASS); + }else if(strncmp(argv[1],"SH",2)==0){ + dg=buildSH(CLASS); + } + + if(timer_on&&dg->numNodes+1>timers_tot){ + timer_on=0; + if(my_rank==0) + fprintf(stderr,"Not enough timers. Node timeing is off. \n"); + } + if(dg->numNodes>comm_size){ + if(my_rank==0){ + fprintf(stderr,"** The number of MPI processes should not be less than \n"); + fprintf(stderr,"** the number of nodes in the graph\n"); + fprintf(stderr,"** Number of MPI processes = %d\n",comm_size); + fprintf(stderr,"** Number nodes in the graph = %d\n",dg->numNodes); + } + MPI_Finalize(); + exit(0); + } + for(i=0;inumNodes;i++){ + dg->node[i]->address=i; + } + if( my_rank == 0 ){ + printf( "\n\n NAS Parallel Benchmarks 3.3 -- DT Benchmark\n\n" ); + graphShow(dg,0); + timer_clear(0); + timer_start(0); + } + verified=ProcessNodes(dg,my_rank); + + featnum=NUM_SAMPLES*fielddim; + bytes_sent=featnum*dg->numArcs; + bytes_sent/=1048576; + if(my_rank==0){ + timer_stop(0); + tot_time=timer_read(0); + c_print_results( dg->name, + CLASS, + featnum, + 0, + 0, + dg->numNodes, + 0, + comm_size, + tot_time, + bytes_sent/tot_time, + "bytes transmitted", + verified, + NPBVERSION, + COMPILETIME, + MPICC, + CLINK, + CMPI_LIB, + CMPI_INC, + CFLAGS, + CLINKFLAGS ); + } + MPI_Finalize(); + return 1; +} diff --git b/NPB3.3-MPI/EP/Makefile a/NPB3.3-MPI/EP/Makefile new file mode 100644 index 0000000..5fa8cc3 --- /dev/null +++ a/NPB3.3-MPI/EP/Makefile @@ -0,0 +1,23 @@ +SHELL=/bin/sh +BENCHMARK=ep +BENCHMARKU=EP + +include ../config/make.def + +OBJS = ep.o ${COMMON}/print_results.o ${COMMON}/${RAND}.o ${COMMON}/timers.o + +include ../sys/make.common + +${PROGRAM}: config ${OBJS} + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB} + + +ep.o: ep.f mpinpb.h npbparams.h + ${FCOMPILE} ep.f + +clean: + - rm -f *.o *~ + - rm -f npbparams.h core + + + diff --git b/NPB3.3-MPI/EP/README a/NPB3.3-MPI/EP/README new file mode 100644 index 0000000..6eb3657 --- /dev/null +++ a/NPB3.3-MPI/EP/README @@ -0,0 +1,6 @@ +This code implements the random-number generator described in the +NAS Parallel Benchmark document RNR Technical Report RNR-94-007. +The code is "embarrassingly" parallel in that no communication is +required for the generation of the random numbers itself. There is +no special requirement on the number of processors used for running +the benchmark. diff --git b/NPB3.3-MPI/EP/ep.f a/NPB3.3-MPI/EP/ep.f new file mode 100644 index 0000000..c112100 --- /dev/null +++ a/NPB3.3-MPI/EP/ep.f @@ -0,0 +1,359 @@ +!-------------------------------------------------------------------------! +! ! +! N A S P A R A L L E L B E N C H M A R K S 3.3 ! +! ! +! E P ! +! ! +!-------------------------------------------------------------------------! +! ! +! This benchmark is part of the NAS Parallel Benchmark 3.3 suite. ! +! It is described in NAS Technical Reports 95-020 and 02-007 ! +! ! +! Permission to use, copy, distribute and modify this software ! +! for any purpose with or without fee is hereby granted. We ! +! request, however, that all derived work reference the NAS ! +! Parallel Benchmarks 3.3. This software is provided "as is" ! +! without express or implied warranty. ! +! ! +! Information on NPB 3.3, including the technical report, the ! +! original specifications, source code, results and information ! +! on how to submit new results, is available at: ! +! ! +! http://www.nas.nasa.gov/Software/NPB/ ! +! ! +! Send comments or suggestions to npb@nas.nasa.gov ! +! ! +! NAS Parallel Benchmarks Group ! +! NASA Ames Research Center ! +! Mail Stop: T27A-1 ! +! Moffett Field, CA 94035-1000 ! +! ! +! E-mail: npb@nas.nasa.gov ! +! Fax: (650) 604-3957 ! +! ! +!-------------------------------------------------------------------------! + + +c--------------------------------------------------------------------- +c +c Authors: P. O. Frederickson +c D. H. Bailey +c A. C. Woo +c R. F. Van der Wijngaart +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- + program EMBAR +c--------------------------------------------------------------------- +C +c This is the MPI version of the APP Benchmark 1, +c the "embarassingly parallel" benchmark. +c +c +c M is the Log_2 of the number of complex pairs of uniform (0, 1) random +c numbers. MK is the Log_2 of the size of each batch of uniform random +c numbers. MK can be set for convenience on a given system, since it does +c not affect the results. + + implicit none + + include 'npbparams.h' + include 'mpinpb.h' + + double precision Mops, epsilon, a, s, t1, t2, t3, t4, x, x1, + > x2, q, sx, sy, tm, an, tt, gc, dum(3), + > timer_read + double precision sx_verify_value, sy_verify_value, sx_err, sy_err + integer mk, mm, nn, nk, nq, np, ierr, node, no_nodes, + > i, ik, kk, l, k, nit, ierrcode, no_large_nodes, + > np_add, k_offset, j + logical verified, timers_enabled + external randlc, timer_read + double precision randlc, qq + character*15 size + + integer fstatus + integer t_total, t_gpairs, t_randn, t_rcomm, t_last + parameter (t_total=1, t_gpairs=2, t_randn=3, t_rcomm=4, t_last=4) + double precision tsum(t_last+2), t1m(t_last+2), + > tming(t_last+2), tmaxg(t_last+2) + character t_recs(t_last+2)*8 + + parameter (mk = 16, mm = m - mk, nn = 2 ** mm, + > nk = 2 ** mk, nq = 10, epsilon=1.d-8, + > a = 1220703125.d0, s = 271828183.d0) + + common/storage/ x(2*nk), q(0:nq-1), qq(10000) + data dum /1.d0, 1.d0, 1.d0/ + + data t_recs/'total', 'gpairs', 'randn', 'rcomm', + > ' totcomp', ' totcomm'/ + + + call mpi_init(ierr) + call mpi_comm_rank(MPI_COMM_WORLD,node,ierr) + call mpi_comm_size(MPI_COMM_WORLD,no_nodes,ierr) + + root = 0 + + if (.not. convertdouble) then + dp_type = MPI_DOUBLE_PRECISION + else + dp_type = MPI_REAL + endif + + if (node.eq.root) then + +c Because the size of the problem is too large to store in a 32-bit +c integer for some classes, we put it into a string (for printing). +c Have to strip off the decimal point put in there by the floating +c point print statement (internal file) + + write(*, 1000) + write(size, '(f15.0)' ) 2.d0**(m+1) + j = 15 + if (size(j:j) .eq. '.') j = j - 1 + write (*,1001) size(1:j) + write(*, 1003) no_nodes + + 1000 format(/,' NAS Parallel Benchmarks 3.3 -- EP Benchmark',/) + 1001 format(' Number of random numbers generated: ', a15) + 1003 format(' Number of active processes: ', 2x, i13, /) + + open (unit=2,file='timer.flag',status='old',iostat=fstatus) + timers_enabled = .false. + if (fstatus .eq. 0) then + timers_enabled = .true. + close(2) + endif + endif + + call mpi_bcast(timers_enabled, 1, MPI_LOGICAL, root, + > MPI_COMM_WORLD, ierr) + + verified = .false. + +c Compute the number of "batches" of random number pairs generated +c per processor. Adjust if the number of processors does not evenly +c divide the total number + + np = nn / no_nodes + no_large_nodes = mod(nn, no_nodes) + if (node .lt. no_large_nodes) then + np_add = 1 + else + np_add = 0 + endif + np = np + np_add + + if (np .eq. 0) then + write (6, 1) no_nodes, nn + 1 format ('Too many nodes:',2i6) + ierrcode = 1 + call mpi_abort(MPI_COMM_WORLD,ierrcode,ierr) + stop + endif + +c Call the random number generator functions and initialize +c the x-array to reduce the effects of paging on the timings. +c Also, call all mathematical functions that are used. Make +c sure these initializations cannot be eliminated as dead code. + + call vranlc(0, dum(1), dum(2), dum(3)) + dum(1) = randlc(dum(2), dum(3)) + do 5 i = 1, 2*nk + x(i) = -1.d99 + 5 continue + Mops = log(sqrt(abs(max(1.d0,1.d0)))) + +c--------------------------------------------------------------------- +c Synchronize before placing time stamp +c--------------------------------------------------------------------- + do i = 1, t_last + call timer_clear(i) + end do + call mpi_barrier(MPI_COMM_WORLD, ierr) + call timer_start(1) + + t1 = a + call vranlc(0, t1, a, x) + +c Compute AN = A ^ (2 * NK) (mod 2^46). + + t1 = a + + do 100 i = 1, mk + 1 + t2 = randlc(t1, t1) + 100 continue + + an = t1 + tt = s + gc = 0.d0 + sx = 0.d0 + sy = 0.d0 + + do 110 i = 0, nq - 1 + q(i) = 0.d0 + 110 continue + +c Each instance of this loop may be performed independently. We compute +c the k offsets separately to take into account the fact that some nodes +c have more numbers to generate than others + + if (np_add .eq. 1) then + k_offset = node * np -1 + else + k_offset = no_large_nodes*(np+1) + (node-no_large_nodes)*np -1 + endif + + do 150 k = 1, np + kk = k_offset + k + t1 = s + t2 = an + +c Find starting seed t1 for this kk. + + do 120 i = 1, 100 + ik = kk / 2 + if (2 * ik .ne. kk) t3 = randlc(t1, t2) + if (ik .eq. 0) goto 130 + t3 = randlc(t2, t2) + kk = ik + 120 continue + +c Compute uniform pseudorandom numbers. + 130 continue + + if (timers_enabled) call timer_start(t_randn) + call vranlc(2 * nk, t1, a, x) + if (timers_enabled) call timer_stop(t_randn) + +c Compute Gaussian deviates by acceptance-rejection method and +c tally counts in concentric square annuli. This loop is not +c vectorizable. + + if (timers_enabled) call timer_start(t_gpairs) + + do 140 i = 1, nk + x1 = 2.d0 * x(2*i-1) - 1.d0 + x2 = 2.d0 * x(2*i) - 1.d0 + t1 = x1 ** 2 + x2 ** 2 + if (t1 .le. 1.d0) then + t2 = sqrt(-2.d0 * log(t1) / t1) + t3 = (x1 * t2) + t4 = (x2 * t2) + l = max(abs(t3), abs(t4)) + q(l) = q(l) + 1.d0 + sx = sx + t3 + sy = sy + t4 + endif + 140 continue + + if (timers_enabled) call timer_stop(t_gpairs) + + 150 continue + + if (timers_enabled) call timer_start(t_rcomm) + call mpi_allreduce(sx, x, 1, dp_type, + > MPI_SUM, MPI_COMM_WORLD, ierr) + sx = x(1) + call mpi_allreduce(sy, x, 1, dp_type, + > MPI_SUM, MPI_COMM_WORLD, ierr) + sy = x(1) + call mpi_allreduce(q, x, nq, dp_type, + > MPI_SUM, MPI_COMM_WORLD, ierr) + if (timers_enabled) call timer_stop(t_rcomm) + + do i = 1, nq + q(i-1) = x(i) + enddo + + do 160 i = 0, nq - 1 + gc = gc + q(i) + 160 continue + + call timer_stop(1) + tm = timer_read(1) + + call mpi_allreduce(tm, x, 1, dp_type, + > MPI_MAX, MPI_COMM_WORLD, ierr) + tm = x(1) + + if (node.eq.root) then + nit=0 + verified = .true. + if (m.eq.24) then + sx_verify_value = -3.247834652034740D+3 + sy_verify_value = -6.958407078382297D+3 + elseif (m.eq.25) then + sx_verify_value = -2.863319731645753D+3 + sy_verify_value = -6.320053679109499D+3 + elseif (m.eq.28) then + sx_verify_value = -4.295875165629892D+3 + sy_verify_value = -1.580732573678431D+4 + elseif (m.eq.30) then + sx_verify_value = 4.033815542441498D+4 + sy_verify_value = -2.660669192809235D+4 + elseif (m.eq.32) then + sx_verify_value = 4.764367927995374D+4 + sy_verify_value = -8.084072988043731D+4 + elseif (m.eq.36) then + sx_verify_value = 1.982481200946593D+5 + sy_verify_value = -1.020596636361769D+5 + elseif (m.eq.40) then + sx_verify_value = -5.319717441530D+05 + sy_verify_value = -3.688834557731D+05 + else + verified = .false. + endif + if (verified) then + sx_err = abs((sx - sx_verify_value)/sx_verify_value) + sy_err = abs((sy - sy_verify_value)/sy_verify_value) + verified = ((sx_err.le.epsilon) .and. (sy_err.le.epsilon)) + endif + Mops = 2.d0**(m+1)/tm/1000000.d0 + + write (6,11) tm, m, gc, sx, sy, (i, q(i), i = 0, nq - 1) + 11 format ('EP Benchmark Results:'//'CPU Time =',f10.4/'N = 2^', + > i5/'No. Gaussian Pairs =',f15.0/'Sums = ',1p,2d25.15/ + > 'Counts:'/(i3,0p,f15.0)) + + call print_results('EP', class, m+1, 0, 0, nit, npm, + > no_nodes, tm, Mops, + > 'Random numbers generated', + > verified, npbversion, compiletime, cs1, + > cs2, cs3, cs4, cs5, cs6, cs7) + + endif + + + if (.not.timers_enabled) goto 999 + + do i = 1, t_last + t1m(i) = timer_read(i) + end do + t1m(t_last+2) = t1m(t_rcomm) + t1m(t_last+1) = t1m(t_total) - t1m(t_last+2) + + call MPI_Reduce(t1m, tsum, t_last+2, dp_type, MPI_SUM, + > 0, MPI_COMM_WORLD, ierr) + call MPI_Reduce(t1m, tming, t_last+2, dp_type, MPI_MIN, + > 0, MPI_COMM_WORLD, ierr) + call MPI_Reduce(t1m, tmaxg, t_last+2, dp_type, MPI_MAX, + > 0, MPI_COMM_WORLD, ierr) + + if (node .eq. 0) then + write(*, 800) no_nodes + do i = 1, t_last+2 + tsum(i) = tsum(i) / no_nodes + write(*, 810) i, t_recs(i), tming(i), tmaxg(i), tsum(i) + end do + endif + 800 format(' nprocs =', i6, 11x, 'minimum', 5x, 'maximum', + > 5x, 'average') + 810 format(' timer ', i2, '(', A8, ') :', 3(2x,f10.4)) + + 999 continue + call mpi_finalize(ierr) + + end diff --git b/NPB3.3-MPI/EP/mpinpb.h a/NPB3.3-MPI/EP/mpinpb.h new file mode 100644 index 0000000..1f13637 --- /dev/null +++ a/NPB3.3-MPI/EP/mpinpb.h @@ -0,0 +1,9 @@ + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + include 'mpif.h' + + integer me, nprocs, root, dp_type + common /mpistuff/ me, nprocs, root, dp_type + diff --git b/NPB3.3-MPI/FT/Makefile a/NPB3.3-MPI/FT/Makefile new file mode 100644 index 0000000..1cc6e14 --- /dev/null +++ a/NPB3.3-MPI/FT/Makefile @@ -0,0 +1,23 @@ +SHELL=/bin/sh +BENCHMARK=ft +BENCHMARKU=FT + +include ../config/make.def + +include ../sys/make.common + +OBJS = ft.o ${COMMON}/${RAND}.o ${COMMON}/print_results.o ${COMMON}/timers.o + +${PROGRAM}: config ${OBJS} + ${FLINK} ${FLINKFLAGS} -o ${PROGRAM} ${OBJS} ${FMPI_LIB} + + + +.f.o: + ${FCOMPILE} $< + +ft.o: ft.f global.h mpinpb.h npbparams.h + +clean: + - rm -f *.o *~ mputil* + - rm -f ft npbparams.h core diff --git b/NPB3.3-MPI/FT/README a/NPB3.3-MPI/FT/README new file mode 100644 index 0000000..ab08b36 --- /dev/null +++ a/NPB3.3-MPI/FT/README @@ -0,0 +1,5 @@ +This code implements the time integration of a three-dimensional +partial differential equation using the Fast Fourier Transform. +Some of the dimension statements are not F77 conforming and will +not work using the g77 compiler. All dimension statements, +however, are legal F90. \ No newline at end of file diff --git b/NPB3.3-MPI/FT/ft.f a/NPB3.3-MPI/FT/ft.f new file mode 100644 index 0000000..8c46f14 --- /dev/null +++ a/NPB3.3-MPI/FT/ft.f @@ -0,0 +1,2034 @@ +!-------------------------------------------------------------------------! +! ! +! N A S P A R A L L E L B E N C H M A R K S 3.3 ! +! ! +! F T ! +! ! +!-------------------------------------------------------------------------! +! ! +! This benchmark is part of the NAS Parallel Benchmark 3.3 suite. ! +! It is described in NAS Technical Reports 95-020 and 02-007 ! +! ! +! Permission to use, copy, distribute and modify this software ! +! for any purpose with or without fee is hereby granted. We ! +! request, however, that all derived work reference the NAS ! +! Parallel Benchmarks 3.3. This software is provided "as is" ! +! without express or implied warranty. ! +! ! +! Information on NPB 3.3, including the technical report, the ! +! original specifications, source code, results and information ! +! on how to submit new results, is available at: ! +! ! +! http://www.nas.nasa.gov/Software/NPB/ ! +! ! +! Send comments or suggestions to npb@nas.nasa.gov ! +! ! +! NAS Parallel Benchmarks Group ! +! NASA Ames Research Center ! +! Mail Stop: T27A-1 ! +! Moffett Field, CA 94035-1000 ! +! ! +! E-mail: npb@nas.nasa.gov ! +! Fax: (650) 604-3957 ! +! ! +!-------------------------------------------------------------------------! + +!TO REDUCE THE AMOUNT OF MEMORY REQUIRED BY THE BENCHMARK WE NO LONGER +!STORE THE ENTIRE TIME EVOLUTION ARRAY "EX" FOR ALL TIME STEPS, BUT +!JUST FOR THE FIRST. ALSO, IT IS STORED ONLY FOR THE PART OF THE GRID +!FOR WHICH THE CALLING PROCESSOR IS RESPONSIBLE, SO THAT THE MEMORY +!USAGE BECOMES SCALABLE. THIS NEW ARRAY IS CALLED "TWIDDLE" (SEE +!NPB3.0-SER) + +!TO AVOID PROBLEMS WITH VERY LARGE ARRAY SIZES THAT ARE COMPUTED BY +!MULTIPLYING GRID DIMENSIONS (CAUSING INTEGER OVERFLOW IN THE VARIABLE +!NTOTAL) AND SUBSEQUENTLY DIVIDING BY THE NUMBER OF PROCESSORS, WE +!COMPUTE THE SIZE OF ARRAY PARTITIONS MORE CONSERVATIVELY AS +!((NX*NY)/NP)*NZ, WHERE NX, NY, AND NZ ARE GRID DIMENSIONS AND NP IS +!THE NUMBER OF PROCESSORS, THE RESULT IS STORED IN "NTDIVNP". FOR THE +!PERFORMANCE CALCULATION WE STORE THE TOTAL NUMBER OF GRID POINTS IN A +!FLOATING POINT NUMBER "NTOTAL_F" INSTEAD OF AN INTEGER. +!THIS FIX WILL FAIL IF THE NUMBER OF PROCESSORS IS SMALL. + +!UGLY HACK OF SUBROUTINE IPOW46: FOR VERY LARGE GRIDS THE SINGLE EXPONENT +!FROM NPB2.3 MAY NOT FIT IN A 32-BIT INTEGER. HOWEVER, WE KNOW THAT THE +!"EXPONENT" ARGUMENT OF THIS ROUTINE CAN ALWAYS BE FACTORED INTO A TERM +!DIVISIBLE BY NX (EXP_1) AND ANOTHER TERM (EXP_2). NX IS USUALLY A POWER +!OF TWO, SO WE CAN KEEP HALVING IT UNTIL THE PRODUCT OF EXP_1 +!AND EXP_2 IS SMALL ENOUGH (NAMELY EXP_2 ITSELF). THIS UPDATED VERSION +!OF IPWO46, WHICH NOW TAKES THE TWO FACTORS OF "EXPONENT" AS SEPARATE +!ARGUMENTS, MAY BREAK DOWN IF EXP_1 DOES NOT CONTAIN A LARGE POWER OF TWO. + +c--------------------------------------------------------------------- +c +c Authors: D. Bailey +c W. Saphir +c R. F. Van der Wijngaart +c +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c FT benchmark +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + program ft + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + + include 'mpif.h' + include 'global.h' + integer i, ierr + +c--------------------------------------------------------------------- +c u0, u1, u2 are the main arrays in the problem. +c Depending on the decomposition, these arrays will have different +c dimensions. To accomodate all possibilities, we allocate them as +c one-dimensional arrays and pass them to subroutines for different +c views +c - u0 contains the initial (transformed) initial condition +c - u1 and u2 are working arrays +c--------------------------------------------------------------------- + + double complex u0(ntdivnp), + > u1(ntdivnp), + > u2(ntdivnp) + double precision twiddle(ntdivnp) +c--------------------------------------------------------------------- +c Large arrays are in common so that they are allocated on the +c heap rather than the stack. This common block is not +c referenced directly anywhere else. Padding is to avoid accidental +c cache problems, since all array sizes are powers of two. +c--------------------------------------------------------------------- + + double complex pad1(3), pad2(3), pad3(3) + common /bigarrays/ u0, pad1, u1, pad2, u2, pad3, twiddle + + integer iter + double precision total_time, mflops + logical verified + character class + + call MPI_Init(ierr) + +c--------------------------------------------------------------------- +c Run the entire problem once to make sure all data is touched. +c This reduces variable startup costs, which is important for such a +c short benchmark. The other NPB 2 implementations are similar. +c--------------------------------------------------------------------- + do i = 1, t_max + call timer_clear(i) + end do + + call timer_start(T_init) + call setup() + call compute_indexmap(twiddle, dims(1,3), dims(2,3), dims(3,3)) + call compute_initial_conditions(u1, dims(1,1), dims(2,1), + > dims(3,1)) + call fft_init (dims(1,1)) + call fft(1, u1, u0) + call timer_stop(T_init) + if (me .eq. 0) then + print *,'Initialization time =',timer_read(T_init) + endif + +c--------------------------------------------------------------------- +c Start over from the beginning. Note that all operations must +c be timed, in contrast to other benchmarks. +c--------------------------------------------------------------------- + do i = 1, t_max + call timer_clear(i) + end do + call MPI_Barrier(MPI_COMM_WORLD, ierr) + + call timer_start(T_total) + if (timers_enabled) call timer_start(T_setup) + + call compute_indexmap(twiddle, dims(1,3), dims(2,3), dims(3,3)) + call compute_initial_conditions(u1, dims(1,1), dims(2,1), + > dims(3,1)) + call fft_init (dims(1,1)) + +! if (timers_enabled) call synchup() + if (timers_enabled) call timer_stop(T_setup) + + if (timers_enabled) call timer_start(T_fft) + call fft(1, u1, u0) + if (timers_enabled) call timer_stop(T_fft) + + do iter = 1, niter + if (timers_enabled) call timer_start(T_evolve) + call evolve(u0, u1, twiddle, dims(1,1), dims(2,1), dims(3,1)) + if (timers_enabled) call timer_stop(T_evolve) + if (timers_enabled) call timer_start(T_fft) + call fft(-1, u1, u2) + if (timers_enabled) call timer_stop(T_fft) +! if (timers_enabled) call synchup() + if (timers_enabled) call timer_start(T_checksum) + call checksum(iter, u2, dims(1,1), dims(2,1), dims(3,1)) + if (timers_enabled) call timer_stop(T_checksum) + end do + + call verify(nx, ny, nz, niter, verified, class) + call timer_stop(t_total) + if (np .ne. np_min) verified = .false. + total_time = timer_read(t_total) + + if( total_time .ne. 0. ) then + mflops = 1.0d-6*ntotal_f * + > (14.8157+7.19641*log(ntotal_f) + > + (5.23518+7.21113*log(ntotal_f))*niter) + > /total_time + else + mflops = 0.0 + endif + if (me .eq. 0) then + call print_results('FT', class, nx, ny, nz, niter, np_min, np, + > total_time, mflops, ' floating point', verified, + > npbversion, compiletime, cs1, cs2, cs3, cs4, cs5, cs6, cs7) + endif + if (timers_enabled) call print_timers() + call MPI_Finalize(ierr) + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine evolve(u0, u1, twiddle, d1, d2, d3) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c evolve u0 -> u1 (t time steps) in fourier space +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer d1, d2, d3 + double precision exi + double complex u0(d1,d2,d3) + double complex u1(d1,d2,d3) + double precision twiddle(d1,d2,d3) + integer i, j, k + + do k = 1, d3 + do j = 1, d2 + do i = 1, d1 + u0(i,j,k) = u0(i,j,k)*(twiddle(i,j,k)) + u1(i,j,k) = u0(i,j,k) + end do + end do + end do + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine compute_initial_conditions(u0, d1, d2, d3) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Fill in array u0 with initial conditions from +c random number generator +c--------------------------------------------------------------------- + implicit none + include 'global.h' + integer d1, d2, d3 + double complex u0(d1, d2, d3) + integer k + double precision x0, start, an, dummy + +c--------------------------------------------------------------------- +c 0-D and 1-D layouts are easy because each processor gets a contiguous +c chunk of the array, in the Fortran ordering sense. +c For a 2-D layout, it's a bit more complicated. We always +c have entire x-lines (contiguous) in processor. +c We can do ny/np1 of them at a time since we have +c ny/np1 contiguous in y-direction. But then we jump +c by z-planes (nz/np2 of them, total). +c For the 0-D and 1-D layouts we could do larger chunks, but +c this turns out to have no measurable impact on performance. +c--------------------------------------------------------------------- + + + start = seed +c--------------------------------------------------------------------- +c Jump to the starting element for our first plane. +c--------------------------------------------------------------------- + call ipow46(a, 2*nx, (zstart(1)-1)*ny + (ystart(1)-1), an) + dummy = randlc(start, an) + call ipow46(a, 2*nx, ny, an) + +c--------------------------------------------------------------------- +c Go through by z planes filling in one square at a time. +c--------------------------------------------------------------------- + do k = 1, dims(3, 1) ! nz/np2 + x0 = start + call vranlc(2*nx*dims(2, 1), x0, a, u0(1, 1, k)) + if (k .ne. dims(3, 1)) dummy = randlc(start, an) + end do + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine ipow46(a, exp_1, exp_2, result) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c compute a^exponent mod 2^46 +c--------------------------------------------------------------------- + + implicit none + double precision a, result, dummy, q, r + integer exp_1, exp_2, n, n2, ierr + external randlc + double precision randlc + logical two_pow +c--------------------------------------------------------------------- +c Use +c a^n = a^(n/2)*a^(n/2) if n even else +c a^n = a*a^(n-1) if n odd +c--------------------------------------------------------------------- + result = 1 + if (exp_2 .eq. 0 .or. exp_1 .eq. 0) return + q = a + r = 1 + n = exp_1 + two_pow = .true. + + do while (two_pow) + n2 = n/2 + if (n2 * 2 .eq. n) then + dummy = randlc(q, q) + n = n2 + else + n = n * exp_2 + two_pow = .false. + endif + end do + + do while (n .gt. 1) + n2 = n/2 + if (n2 * 2 .eq. n) then + dummy = randlc(q, q) + n = n2 + else + dummy = randlc(r, q) + n = n-1 + endif + end do + dummy = randlc(r, q) + result = r + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine setup + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'mpinpb.h' + include 'global.h' + + integer ierr, i, j, fstatus + debug = .FALSE. + + call MPI_Comm_size(MPI_COMM_WORLD, np, ierr) + call MPI_Comm_rank(MPI_COMM_WORLD, me, ierr) + + if (.not. convertdouble) then + dc_type = MPI_DOUBLE_COMPLEX + else + dc_type = MPI_COMPLEX + endif + + + if (me .eq. 0) then + write(*, 1000) + + open (unit=2,file='timer.flag',status='old',iostat=fstatus) + timers_enabled = .false. + if (fstatus .eq. 0) then + timers_enabled = .true. + close(2) + endif + + open (unit=2,file='inputft.data',status='old', iostat=fstatus) + + if (fstatus .eq. 0) then + write(*,233) + 233 format(' Reading from input file inputft.data') + read (2,*) niter + read (2,*) layout_type + read (2,*) np1, np2 + close(2) + +c--------------------------------------------------------------------- +c check to make sure input data is consistent +c--------------------------------------------------------------------- + + +c--------------------------------------------------------------------- +c 1. product of processor grid dims must equal number of processors +c--------------------------------------------------------------------- + + if (np1 * np2 .ne. np) then + write(*, 238) + 238 format(' np1 and np2 given in input file are not valid.') + write(*, 239) np1*np2, np + 239 format(' Product is ', i5, ' and should be ', i5) + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif + +c--------------------------------------------------------------------- +c 2. layout type must be valid +c--------------------------------------------------------------------- + + if (layout_type .ne. layout_0D .and. + > layout_type .ne. layout_1D .and. + > layout_type .ne. layout_2D) then + write(*, 240) + 240 format(' Layout type specified in inputft.data is + > invalid ') + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif + +c--------------------------------------------------------------------- +c 3. 0D layout must be 1x1 grid +c--------------------------------------------------------------------- + + if (layout_type .eq. layout_0D .and. + > (np1 .ne.1 .or. np2 .ne. 1)) then + write(*, 241) + 241 format(' For 0D layout, both np1 and np2 must be 1 ') + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif +c--------------------------------------------------------------------- +c 4. 1D layout must be 1xN grid +c--------------------------------------------------------------------- + + if (layout_type .eq. layout_1D .and. np1 .ne. 1) then + write(*, 242) + 242 format(' For 1D layout, np1 must be 1 ') + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif + + else + write(*,234) + niter = niter_default + if (np .eq. 1) then + np1 = 1 + np2 = 1 + layout_type = layout_0D + else if (np .le. nz) then + np1 = 1 + np2 = np + layout_type = layout_1D + else + np1 = nz + np2 = np/nz + layout_type = layout_2D + endif + endif + + if (np .lt. np_min) then + write(*, 10) np_min + 10 format(' Error: Compiled for ', I5, ' processors. ') + write(*, 11) np + 11 format(' Only ', i5, ' processors found ') + call MPI_Abort(MPI_COMM_WORLD, 1, ierr) + endif + + 234 format(' No input file inputft.data. Using compiled defaults') + write(*, 1001) nx, ny, nz + write(*, 1002) niter + write(*, 1004) np + write(*, 1005) np1, np2 + if (np .ne. np_min) write(*, 1006) np_min + + if (layout_type .eq. layout_0D) then + write(*, 1010) '0D' + else if (layout_type .eq. layout_1D) then + write(*, 1010) '1D' + else + write(*, 1010) '2D' + endif + + 1000 format(//,' NAS Parallel Benchmarks 3.3 -- FT Benchmark',/) + 1001 format(' Size : ', i4, 'x', i4, 'x', i4) + 1002 format(' Iterations : ', 7x, i7) + 1004 format(' Number of processes : ', 7x, i7) + 1005 format(' Processor array : ', 5x, i4, 'x', i4) + 1006 format(' WARNING: compiled for ', i5, ' processes. ', + > ' Will not verify. ') + 1010 format(' Layout type : ', 9x, A5) + endif + + +c--------------------------------------------------------------------- +c Broadcast parameters +c--------------------------------------------------------------------- + call MPI_BCAST(np1, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr) + call MPI_BCAST(np2, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr) + call MPI_BCAST(layout_type, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, + & ierr) + call MPI_BCAST(niter, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr) + call MPI_BCAST(timers_enabled, 1, MPI_LOGICAL, 0, MPI_COMM_WORLD, + & ierr) + + if (np1 .eq. 1 .and. np2 .eq. 1) then + layout_type = layout_0D + else if (np1 .eq. 1) then + layout_type = layout_1D + else + layout_type = layout_2D + endif + + if (layout_type .eq. layout_0D) then + do i = 1, 3 + dims(1, i) = nx + dims(2, i) = ny + dims(3, i) = nz + end do + else if (layout_type .eq. layout_1D) then + dims(1, 1) = nx + dims(2, 1) = ny + dims(3, 1) = nz + + dims(1, 2) = nx + dims(2, 2) = ny + dims(3, 2) = nz + + dims(1, 3) = nz + dims(2, 3) = nx + dims(3, 3) = ny + else if (layout_type .eq. layout_2D) then + dims(1, 1) = nx + dims(2, 1) = ny + dims(3, 1) = nz + + dims(1, 2) = ny + dims(2, 2) = nx + dims(3, 2) = nz + + dims(1, 3) = nz + dims(2, 3) = nx + dims(3, 3) = ny + + endif + do i = 1, 3 + dims(2, i) = dims(2, i) / np1 + dims(3, i) = dims(3, i) / np2 + end do + + +c--------------------------------------------------------------------- +c Determine processor coordinates of this processor +c Processor grid is np1xnp2. +c Arrays are always (n1, n2/np1, n3/np2) +c Processor coords are zero-based. +c--------------------------------------------------------------------- + me2 = mod(me, np2) ! goes from 0...np2-1 + me1 = me/np2 ! goes from 0...np1-1 +c--------------------------------------------------------------------- +c Communicators for rows/columns of processor grid. +c commslice1 is communicator of all procs with same me1, ranked as me2 +c commslice2 is communicator of all procs with same me2, ranked as me1 +c mpi_comm_split(comm, color, key, ...) +c--------------------------------------------------------------------- + call MPI_Comm_split(MPI_COMM_WORLD, me1, me2, commslice1, ierr) + call MPI_Comm_split(MPI_COMM_WORLD, me2, me1, commslice2, ierr) +! if (timers_enabled) call synchup() + + if (debug) print *, 'proc coords: ', me, me1, me2 + +c--------------------------------------------------------------------- +c Determine which section of the grid is owned by this +c processor. +c--------------------------------------------------------------------- + if (layout_type .eq. layout_0d) then + + do i = 1, 3 + xstart(i) = 1 + xend(i) = nx + ystart(i) = 1 + yend(i) = ny + zstart(i) = 1 + zend(i) = nz + end do + + else if (layout_type .eq. layout_1d) then + + xstart(1) = 1 + xend(1) = nx + ystart(1) = 1 + yend(1) = ny + zstart(1) = 1 + me2 * nz/np2 + zend(1) = (me2+1) * nz/np2 + + xstart(2) = 1 + xend(2) = nx + ystart(2) = 1 + yend(2) = ny + zstart(2) = 1 + me2 * nz/np2 + zend(2) = (me2+1) * nz/np2 + + xstart(3) = 1 + xend(3) = nx + ystart(3) = 1 + me2 * ny/np2 + yend(3) = (me2+1) * ny/np2 + zstart(3) = 1 + zend(3) = nz + + else if (layout_type .eq. layout_2d) then + + xstart(1) = 1 + xend(1) = nx + ystart(1) = 1 + me1 * ny/np1 + yend(1) = (me1+1) * ny/np1 + zstart(1) = 1 + me2 * nz/np2 + zend(1) = (me2+1) * nz/np2 + + xstart(2) = 1 + me1 * nx/np1 + xend(2) = (me1+1)*nx/np1 + ystart(2) = 1 + yend(2) = ny + zstart(2) = zstart(1) + zend(2) = zend(1) + + xstart(3) = xstart(2) + xend(3) = xend(2) + ystart(3) = 1 + me2 *ny/np2 + yend(3) = (me2+1)*ny/np2 + zstart(3) = 1 + zend(3) = nz + endif + +c--------------------------------------------------------------------- +c Set up info for blocking of ffts and transposes. This improves +c performance on cache-based systems. Blocking involves +c working on a chunk of the problem at a time, taking chunks +c along the first, second, or third dimension. +c +c - In cffts1 blocking is on 2nd dimension (with fft on 1st dim) +c - In cffts2/3 blocking is on 1st dimension (with fft on 2nd and 3rd dims) + +c Since 1st dim is always in processor, we'll assume it's long enough +c (default blocking factor is 16 so min size for 1st dim is 16) +c The only case we have to worry about is cffts1 in a 2d decomposition. +c so the blocking factor should not be larger than the 2nd dimension. +c--------------------------------------------------------------------- + + fftblock = fftblock_default + fftblockpad = fftblockpad_default + + if (layout_type .eq. layout_2d) then + if (dims(2, 1) .lt. fftblock) fftblock = dims(2, 1) + if (dims(2, 2) .lt. fftblock) fftblock = dims(2, 2) + if (dims(2, 3) .lt. fftblock) fftblock = dims(2, 3) + endif + + if (fftblock .ne. fftblock_default) fftblockpad = fftblock+3 + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine compute_indexmap(twiddle, d1, d2, d3) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c compute function from local (i,j,k) to ibar^2+jbar^2+kbar^2 +c for time evolution exponent. +c--------------------------------------------------------------------- + + implicit none + include 'mpinpb.h' + include 'global.h' + integer d1, d2, d3 + integer i, j, k, ii, ii2, jj, ij2, kk + double precision ap, twiddle(d1, d2, d3) + +c--------------------------------------------------------------------- +c this function is very different depending on whether +c we are in the 0d, 1d or 2d layout. Compute separately. +c basically we want to convert the fortran indices +c 1 2 3 4 5 6 7 8 +c to +c 0 1 2 3 -4 -3 -2 -1 +c The following magic formula does the trick: +c mod(i-1+n/2, n) - n/2 +c--------------------------------------------------------------------- + + ap = - 4.d0 * alpha * pi *pi + + if (layout_type .eq. layout_0d) then ! xyz layout + do i = 1, dims(1,3) + ii = mod(i+xstart(3)-2+nx/2, nx) - nx/2 + ii2 = ii*ii + do j = 1, dims(2,3) + jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2 + ij2 = jj*jj+ii2 + do k = 1, dims(3,3) + kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2 + twiddle(i,j,k) = dexp(ap*dfloat(kk*kk+ij2)) + end do + end do + end do + else if (layout_type .eq. layout_1d) then ! zxy layout + do i = 1,dims(2,3) + ii = mod(i+xstart(3)-2+nx/2, nx) - nx/2 + ii2 = ii*ii + do j = 1,dims(3,3) + jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2 + ij2 = jj*jj+ii2 + do k = 1,dims(1,3) + kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2 + twiddle(k,i,j) = dexp(ap*dfloat(kk*kk+ij2)) + end do + end do + end do + else if (layout_type .eq. layout_2d) then ! zxy layout + do i = 1,dims(2,3) + ii = mod(i+xstart(3)-2+nx/2, nx) - nx/2 + ii2 = ii*ii + do j = 1, dims(3,3) + jj = mod(j+ystart(3)-2+ny/2, ny) - ny/2 + ij2 = jj*jj+ii2 + do k =1,dims(1,3) + kk = mod(k+zstart(3)-2+nz/2, nz) - nz/2 + twiddle(k,i,j) = dexp(ap*dfloat(kk*kk+ij2)) + end do + end do + end do + else + print *, ' Unknown layout type ', layout_type + stop + endif + + return + end + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine print_timers() + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + integer i, ierr + include 'global.h' + include 'mpinpb.h' + character*25 tstrings(T_max+2) + double precision t1(T_max+2), tsum(T_max+2), + > tming(T_max+2), tmaxg(T_max+2) + data tstrings / ' total ', + > ' setup ', + > ' fft ', + > ' evolve ', + > ' checksum ', + > ' fftlow ', + > ' fftcopy ', + > ' transpose ', + > ' transpose1_loc ', + > ' transpose1_glo ', + > ' transpose1_fin ', + > ' transpose2_loc ', + > ' transpose2_glo ', + > ' transpose2_fin ', + > ' sync ', + > ' init ', + > ' totcomp ', + > ' totcomm ' / + + do i = 1, t_max + t1(i) = timer_read(i) + end do + t1(t_max+2) = t1(t_transxzglo) + t1(t_transxyglo) + t1(t_synch) + t1(t_max+1) = t1(t_total) - t1(t_max+2) + + call MPI_Reduce(t1, tsum, t_max+2, MPI_DOUBLE_PRECISION, + > MPI_SUM, 0, MPI_COMM_WORLD, ierr) + call MPI_Reduce(t1, tming, t_max+2, MPI_DOUBLE_PRECISION, + > MPI_MIN, 0, MPI_COMM_WORLD, ierr) + call MPI_Reduce(t1, tmaxg, t_max+2, MPI_DOUBLE_PRECISION, + > MPI_MAX, 0, MPI_COMM_WORLD, ierr) + + if (me .ne. 0) return + write(*, 800) np + do i = 1, t_max+2 + if (tsum(i) .ne. 0.0d0) then + write(*, 810) i, tstrings(i), tming(i), tmaxg(i), tsum(i)/np + endif + end do + 800 format(' nprocs =', i6, 19x, 'minimum', 5x, 'maximum', + > 5x, 'average') + 810 format(' timer ', i2, '(', A16, ') :', 3(2X,F10.4)) + return + end + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine fft(dir, x1, x2) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer dir + double complex x1(ntdivnp), x2(ntdivnp) + + double complex scratch(fftblockpad_default*maxdim*2) + +c--------------------------------------------------------------------- +c note: args x1, x2 must be different arrays +c note: args for cfftsx are (direction, layout, xin, xout, scratch) +c xin/xout may be the same and it can be somewhat faster +c if they are +c note: args for transpose are (layout1, layout2, xin, xout) +c xin/xout must be different +c--------------------------------------------------------------------- + + if (dir .eq. 1) then + if (layout_type .eq. layout_0d) then + call cffts1(1, dims(1,1), dims(2,1), dims(3,1), + > x1, x1, scratch) + call cffts2(1, dims(1,2), dims(2,2), dims(3,2), + > x1, x1, scratch) + call cffts3(1, dims(1,3), dims(2,3), dims(3,3), + > x1, x2, scratch) + else if (layout_type .eq. layout_1d) then + call cffts1(1, dims(1,1), dims(2,1), dims(3,1), + > x1, x1, scratch) + call cffts2(1, dims(1,2), dims(2,2), dims(3,2), + > x1, x1, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_xy_z(2, 3, x1, x2) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(1, dims(1,3), dims(2,3), dims(3,3), + > x2, x2, scratch) + else if (layout_type .eq. layout_2d) then + call cffts1(1, dims(1,1), dims(2,1), dims(3,1), + > x1, x1, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_y(1, 2, x1, x2) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(1, dims(1,2), dims(2,2), dims(3,2), + > x2, x2, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_z(2, 3, x2, x1) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(1, dims(1,3), dims(2,3), dims(3,3), + > x1, x2, scratch) + endif + else + if (layout_type .eq. layout_0d) then + call cffts3(-1, dims(1,3), dims(2,3), dims(3,3), + > x1, x1, scratch) + call cffts2(-1, dims(1,2), dims(2,2), dims(3,2), + > x1, x1, scratch) + call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), + > x1, x2, scratch) + else if (layout_type .eq. layout_1d) then + call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), + > x1, x1, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_yz(3, 2, x1, x2) + if (timers_enabled) call timer_stop(T_transpose) + call cffts2(-1, dims(1,2), dims(2,2), dims(3,2), + > x2, x2, scratch) + call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), + > x2, x2, scratch) + else if (layout_type .eq. layout_2d) then + call cffts1(-1, dims(1,3), dims(2,3), dims(3,3), + > x1, x1, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_z(3, 2, x1, x2) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(-1, dims(1,2), dims(2,2), dims(3,2), + > x2, x2, scratch) + if (timers_enabled) call timer_start(T_transpose) + call transpose_x_y(2, 1, x2, x1) + if (timers_enabled) call timer_stop(T_transpose) + call cffts1(-1, dims(1,1), dims(2,1), dims(3,1), + > x1, x2, scratch) + endif + endif + return + end + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine cffts1(is, d1, d2, d3, x, xout, y) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + + include 'global.h' + integer is, d1, d2, d3, logd1 + double complex x(d1,d2,d3) + double complex xout(d1,d2,d3) + double complex y(fftblockpad, d1, 2) + integer i, j, k, jj + + logd1 = ilog2(d1) + + do k = 1, d3 + do jj = 0, d2 - fftblock, fftblock + if (timers_enabled) call timer_start(T_fftcopy) + do j = 1, fftblock + do i = 1, d1 + y(j,i,1) = x(i,j+jj,k) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + + if (timers_enabled) call timer_start(T_fftlow) + call cfftz (is, logd1, d1, y, y(1,1,2)) + if (timers_enabled) call timer_stop(T_fftlow) + + if (timers_enabled) call timer_start(T_fftcopy) + do j = 1, fftblock + do i = 1, d1 + xout(i,j+jj,k) = y(j,i,1) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + enddo + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine cffts2(is, d1, d2, d3, x, xout, y) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + + include 'global.h' + integer is, d1, d2, d3, logd2 + double complex x(d1,d2,d3) + double complex xout(d1,d2,d3) + double complex y(fftblockpad, d2, 2) + integer i, j, k, ii + + logd2 = ilog2(d2) + + do k = 1, d3 + do ii = 0, d1 - fftblock, fftblock + if (timers_enabled) call timer_start(T_fftcopy) + do j = 1, d2 + do i = 1, fftblock + y(i,j,1) = x(i+ii,j,k) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + + if (timers_enabled) call timer_start(T_fftlow) + call cfftz (is, logd2, d2, y, y(1, 1, 2)) + if (timers_enabled) call timer_stop(T_fftlow) + + if (timers_enabled) call timer_start(T_fftcopy) + do j = 1, d2 + do i = 1, fftblock + xout(i+ii,j,k) = y(i,j,1) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + enddo + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine cffts3(is, d1, d2, d3, x, xout, y) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + + include 'global.h' + integer is, d1, d2, d3, logd3 + double complex x(d1,d2,d3) + double complex xout(d1,d2,d3) + double complex y(fftblockpad, d3, 2) + integer i, j, k, ii + + logd3 = ilog2(d3) + + do j = 1, d2 + do ii = 0, d1 - fftblock, fftblock + if (timers_enabled) call timer_start(T_fftcopy) + do k = 1, d3 + do i = 1, fftblock + y(i,k,1) = x(i+ii,j,k) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + + if (timers_enabled) call timer_start(T_fftlow) + call cfftz (is, logd3, d3, y, y(1, 1, 2)) + if (timers_enabled) call timer_stop(T_fftlow) + + if (timers_enabled) call timer_start(T_fftcopy) + do k = 1, d3 + do i = 1, fftblock + xout(i+ii,j,k) = y(i,k,1) + enddo + enddo + if (timers_enabled) call timer_stop(T_fftcopy) + enddo + enddo + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine fft_init (n) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c compute the roots-of-unity array that will be used for subsequent FFTs. +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + + integer m,n,nu,ku,i,j,ln + double precision t, ti + + +c--------------------------------------------------------------------- +c Initialize the U array with sines and cosines in a manner that permits +c stride one access at each FFT iteration. +c--------------------------------------------------------------------- + nu = n + m = ilog2(n) + u(1) = m + ku = 2 + ln = 1 + + do j = 1, m + t = pi / ln + + do i = 0, ln - 1 + ti = i * t + u(i+ku) = dcmplx (cos (ti), sin(ti)) + enddo + + ku = ku + ln + ln = 2 * ln + enddo + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine cfftz (is, m, n, x, y) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Computes NY N-point complex-to-complex FFTs of X using an algorithm due +c to Swarztrauber. X is both the input and the output array, while Y is a +c scratch array. It is assumed that N = 2^M. Before calling CFFTZ to +c perform FFTs, the array U must be initialized by calling CFFTZ with IS +c set to 0 and M set to MX, where MX is the maximum value of M for any +c subsequent call. +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + + integer is,m,n,i,j,l,mx + double complex x, y + + dimension x(fftblockpad,n), y(fftblockpad,n) + +c--------------------------------------------------------------------- +c Check if input parameters are invalid. +c--------------------------------------------------------------------- + mx = u(1) + if ((is .ne. 1 .and. is .ne. -1) .or. m .lt. 1 .or. m .gt. mx) + > then + write (*, 1) is, m, mx + 1 format ('CFFTZ: Either U has not been initialized, or else'/ + > 'one of the input parameters is invalid', 3I5) + stop + endif + +c--------------------------------------------------------------------- +c Perform one variant of the Stockham FFT. +c--------------------------------------------------------------------- + do l = 1, m, 2 + call fftz2 (is, l, m, n, fftblock, fftblockpad, u, x, y) + if (l .eq. m) goto 160 + call fftz2 (is, l + 1, m, n, fftblock, fftblockpad, u, y, x) + enddo + + goto 180 + +c--------------------------------------------------------------------- +c Copy Y to X. +c--------------------------------------------------------------------- + 160 do j = 1, n + do i = 1, fftblock + x(i,j) = y(i,j) + enddo + enddo + + 180 continue + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine fftz2 (is, l, m, n, ny, ny1, u, x, y) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + +c--------------------------------------------------------------------- +c Performs the L-th iteration of the second variant of the Stockham FFT. +c--------------------------------------------------------------------- + + implicit none + + integer is,k,l,m,n,ny,ny1,n1,li,lj,lk,ku,i,j,i11,i12,i21,i22 + double complex u,x,y,u1,x11,x21 + dimension u(n), x(ny1,n), y(ny1,n) + + +c--------------------------------------------------------------------- +c Set initial parameters. +c--------------------------------------------------------------------- + + n1 = n / 2 + lk = 2 ** (l - 1) + li = 2 ** (m - l) + lj = 2 * lk + ku = li + 1 + + do i = 0, li - 1 + i11 = i * lk + 1 + i12 = i11 + n1 + i21 = i * lj + 1 + i22 = i21 + lk + if (is .ge. 1) then + u1 = u(ku+i) + else + u1 = dconjg (u(ku+i)) + endif + +c--------------------------------------------------------------------- +c This loop is vectorizable. +c--------------------------------------------------------------------- + do k = 0, lk - 1 + do j = 1, ny + x11 = x(j,i11+k) + x21 = x(j,i12+k) + y(j,i21+k) = x11 + x21 + y(j,i22+k) = u1 * (x11 - x21) + enddo + enddo + enddo + + return + end + +c--------------------------------------------------------------------- + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + integer function ilog2(n) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + integer n, nn, lg + if (n .eq. 1) then + ilog2=0 + return + endif + lg = 1 + nn = 2 + do while (nn .lt. n) + nn = nn*2 + lg = lg+1 + end do + ilog2 = lg + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_yz(l1, l2, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer l1, l2 + double complex xin(ntdivnp), xout(ntdivnp) + + call transpose2_local(dims(1,l1),dims(2, l1)*dims(3, l1), + > xin, xout) + + call transpose2_global(xout, xin) + + call transpose2_finish(dims(1,l1),dims(2, l1)*dims(3, l1), + > xin, xout) + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_xy_z(l1, l2, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer l1, l2 + double complex xin(ntdivnp), xout(ntdivnp) + + call transpose2_local(dims(1,l1)*dims(2, l1),dims(3, l1), + > xin, xout) + call transpose2_global(xout, xin) + call transpose2_finish(dims(1,l1)*dims(2, l1),dims(3, l1), + > xin, xout) + + return + end + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose2_local(n1, n2, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'mpinpb.h' + include 'global.h' + integer n1, n2 + double complex xin(n1, n2), xout(n2, n1) + + double complex z(transblockpad, transblock) + + integer i, j, ii, jj + + if (timers_enabled) call timer_start(T_transxzloc) + +c--------------------------------------------------------------------- +c If possible, block the transpose for cache memory systems. +c How much does this help? Example: R8000 Power Challenge (90 MHz) +c Blocked version decreases time spend in this routine +c from 14 seconds to 5.2 seconds on 8 nodes class A. +c--------------------------------------------------------------------- + + if (n1 .lt. transblock .or. n2 .lt. transblock) then + if (n1 .ge. n2) then + do j = 1, n2 + do i = 1, n1 + xout(j, i) = xin(i, j) + end do + end do + else + do i = 1, n1 + do j = 1, n2 + xout(j, i) = xin(i, j) + end do + end do + endif + else + do j = 0, n2-1, transblock + do i = 0, n1-1, transblock + +c--------------------------------------------------------------------- +c Note: compiler should be able to take j+jj out of inner loop +c--------------------------------------------------------------------- + do jj = 1, transblock + do ii = 1, transblock + z(jj,ii) = xin(i+ii, j+jj) + end do + end do + + do ii = 1, transblock + do jj = 1, transblock + xout(j+jj, i+ii) = z(jj,ii) + end do + end do + + end do + end do + endif + if (timers_enabled) call timer_stop(T_transxzloc) + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose2_global(xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + include 'mpinpb.h' + double complex xin(ntdivnp) + double complex xout(ntdivnp) + integer ierr + +! if (timers_enabled) call synchup() + + if (timers_enabled) call timer_start(T_transxzglo) + call mpi_alltoall(xin, ntdivnp/np, dc_type, + > xout, ntdivnp/np, dc_type, + > commslice1, ierr) + if (timers_enabled) call timer_stop(T_transxzglo) + + return + end + + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose2_finish(n1, n2, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer n1, n2, ioff + double complex xin(n2, n1/np2, 0:np2-1), xout(n2*np2, n1/np2) + + integer i, j, p + + if (timers_enabled) call timer_start(T_transxzfin) + do p = 0, np2-1 + ioff = p*n2 + do j = 1, n1/np2 + do i = 1, n2 + xout(i+ioff, j) = xin(i, j, p) + end do + end do + end do + if (timers_enabled) call timer_stop(T_transxzfin) + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_z(l1, l2, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer l1, l2 + double complex xin(ntdivnp), xout(ntdivnp) + + call transpose_x_z_local(dims(1,l1),dims(2,l1),dims(3,l1), + > xin, xout) + call transpose_x_z_global(dims(1,l1),dims(2,l1),dims(3,l1), + > xout, xin) + call transpose_x_z_finish(dims(1,l2),dims(2,l2),dims(3,l2), + > xin, xout) + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_z_local(d1, d2, d3, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer d1, d2, d3 + double complex xin(d1,d2,d3) + double complex xout(d3,d2,d1) + integer block1, block3 + integer i, j, k, kk, ii, i1, k1 + + double complex buf(transblockpad, maxdim) + if (timers_enabled) call timer_start(T_transxzloc) + if (d1 .lt. 32) goto 100 + block3 = d3 + if (block3 .eq. 1) goto 100 + if (block3 .gt. transblock) block3 = transblock + block1 = d1 + if (block1*block3 .gt. transblock*transblock) + > block1 = transblock*transblock/block3 +c--------------------------------------------------------------------- +c blocked transpose +c--------------------------------------------------------------------- + do j = 1, d2 + do kk = 0, d3-block3, block3 + do ii = 0, d1-block1, block1 + + do k = 1, block3 + k1 = k + kk + do i = 1, block1 + buf(k, i) = xin(i+ii, j, k1) + end do + end do + + do i = 1, block1 + i1 = i + ii + do k = 1, block3 + xout(k+kk, j, i1) = buf(k, i) + end do + end do + + end do + end do + end do + goto 200 + + +c--------------------------------------------------------------------- +c basic transpose +c--------------------------------------------------------------------- + 100 continue + + do j = 1, d2 + do k = 1, d3 + do i = 1, d1 + xout(k, j, i) = xin(i, j, k) + end do + end do + end do + +c--------------------------------------------------------------------- +c all done +c--------------------------------------------------------------------- + 200 continue + + if (timers_enabled) call timer_stop(T_transxzloc) + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_z_global(d1, d2, d3, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + include 'mpinpb.h' + integer d1, d2, d3 + double complex xin(d3,d2,d1) + double complex xout(d3,d2,d1) ! not real layout, but right size + integer ierr + +! if (timers_enabled) call synchup() + +c--------------------------------------------------------------------- +c do transpose among all processes with same 1-coord (me1) +c--------------------------------------------------------------------- + if (timers_enabled)call timer_start(T_transxzglo) + call mpi_alltoall(xin, d1*d2*d3/np2, dc_type, + > xout, d1*d2*d3/np2, dc_type, + > commslice1, ierr) + if (timers_enabled) call timer_stop(T_transxzglo) + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_z_finish(d1, d2, d3, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer d1, d2, d3 + double complex xin(d1/np2, d2, d3, 0:np2-1) + double complex xout(d1,d2,d3) + integer i, j, k, p, ioff + if (timers_enabled) call timer_start(T_transxzfin) +c--------------------------------------------------------------------- +c this is the most straightforward way of doing it. the +c calculation in the inner loop doesn't help. +c do i = 1, d1/np2 +c do j = 1, d2 +c do k = 1, d3 +c do p = 0, np2-1 +c ii = i + p*d1/np2 +c xout(ii, j, k) = xin(i, j, k, p) +c end do +c end do +c end do +c end do +c--------------------------------------------------------------------- + + do p = 0, np2-1 + ioff = p*d1/np2 + do k = 1, d3 + do j = 1, d2 + do i = 1, d1/np2 + xout(i+ioff, j, k) = xin(i, j, k, p) + end do + end do + end do + end do + if (timers_enabled) call timer_stop(T_transxzfin) + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_y(l1, l2, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer l1, l2 + double complex xin(ntdivnp), xout(ntdivnp) + +c--------------------------------------------------------------------- +c xy transpose is a little tricky, since we don't want +c to touch 3rd axis. But alltoall must involve 3rd axis (most +c slowly varying) to be efficient. So we do +c (nx, ny/np1, nz/np2) -> (ny/np1, nz/np2, nx) (local) +c (ny/np1, nz/np2, nx) -> ((ny/np1*nz/np2)*np1, nx/np1) (global) +c then local finish. +c--------------------------------------------------------------------- + + + call transpose_x_y_local(dims(1,l1),dims(2,l1),dims(3,l1), + > xin, xout) + call transpose_x_y_global(dims(1,l1),dims(2,l1),dims(3,l1), + > xout, xin) + call transpose_x_y_finish(dims(1,l2),dims(2,l2),dims(3,l2), + > xin, xout) + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_y_local(d1, d2, d3, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer d1, d2, d3 + double complex xin(d1, d2, d3) + double complex xout(d2, d3, d1) + integer i, j, k + if (timers_enabled) call timer_start(T_transxyloc) + + do k = 1, d3 + do i = 1, d1 + do j = 1, d2 + xout(j,k,i)=xin(i,j,k) + end do + end do + end do + if (timers_enabled) call timer_stop(T_transxyloc) + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_y_global(d1, d2, d3, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + include 'mpinpb.h' + integer d1, d2, d3 +c--------------------------------------------------------------------- +c array is in form (ny/np1, nz/np2, nx) +c--------------------------------------------------------------------- + double complex xin(d2,d3,d1) + double complex xout(d2,d3,d1) ! not real layout but right size + integer ierr + +! if (timers_enabled) call synchup() + +c--------------------------------------------------------------------- +c do transpose among all processes with same 1-coord (me1) +c--------------------------------------------------------------------- + if (timers_enabled) call timer_start(T_transxyglo) + call mpi_alltoall(xin, d1*d2*d3/np1, dc_type, + > xout, d1*d2*d3/np1, dc_type, + > commslice2, ierr) + if (timers_enabled) call timer_stop(T_transxyglo) + + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine transpose_x_y_finish(d1, d2, d3, xin, xout) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + integer d1, d2, d3 + double complex xin(d1/np1, d3, d2, 0:np1-1) + double complex xout(d1,d2,d3) + integer i, j, k, p, ioff + if (timers_enabled) call timer_start(T_transxyfin) +c--------------------------------------------------------------------- +c this is the most straightforward way of doing it. the +c calculation in the inner loop doesn't help. +c do i = 1, d1/np1 +c do j = 1, d2 +c do k = 1, d3 +c do p = 0, np1-1 +c ii = i + p*d1/np1 +c note order is screwy bcz we have (ny/np1, nz/np2, nx) -> (ny, nx/np1, nz/np2) +c xout(ii, j, k) = xin(i, k, j, p) +c end do +c end do +c end do +c end do +c--------------------------------------------------------------------- + + do p = 0, np1-1 + ioff = p*d1/np1 + do k = 1, d3 + do j = 1, d2 + do i = 1, d1/np1 + xout(i+ioff, j, k) = xin(i, k, j, p) + end do + end do + end do + end do + if (timers_enabled) call timer_stop(T_transxyfin) + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine checksum(i, u1, d1, d2, d3) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + include 'mpinpb.h' + integer i, d1, d2, d3 + double complex u1(d1, d2, d3) + integer j, q,r,s, ierr + double complex chk,allchk + chk = (0.0,0.0) + + do j=1,1024 + q = mod(j, nx)+1 + if (q .ge. xstart(1) .and. q .le. xend(1)) then + r = mod(3*j,ny)+1 + if (r .ge. ystart(1) .and. r .le. yend(1)) then + s = mod(5*j,nz)+1 + if (s .ge. zstart(1) .and. s .le. zend(1)) then + chk=chk+u1(q-xstart(1)+1,r-ystart(1)+1,s-zstart(1)+1) + end if + end if + end if + end do + chk = chk/ntotal_f + + if (timers_enabled) call timer_start(T_synch) + call MPI_Reduce(chk, allchk, 1, dc_type, MPI_SUM, + > 0, MPI_COMM_WORLD, ierr) + if (timers_enabled) call timer_stop(T_synch) + if (me .eq. 0) then + write (*, 30) i, allchk + 30 format (' T =',I5,5X,'Checksum =',1P2D22.12) + endif + +c sums(i) = allchk +c If we compute the checksum for diagnostic purposes, we let i be +c negative, so the result will not be stored in an array + if (i .gt. 0) sums(i) = allchk + + return + end + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine synchup + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + include 'mpinpb.h' + integer ierr + call timer_start(T_synch) + call mpi_barrier(MPI_COMM_WORLD, ierr) + call timer_stop(T_synch) + return + end + + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + subroutine verify (d1, d2, d3, nt, verified, class) + +c--------------------------------------------------------------------- +c--------------------------------------------------------------------- + + implicit none + include 'global.h' + include 'mpinpb.h' + integer d1, d2, d3, nt + character class + logical verified + integer ierr, size, i + double precision err, epsilon + +c--------------------------------------------------------------------- +c Reference checksums +c--------------------------------------------------------------------- + double complex csum_ref(25) + + + class = 'U' + + if (me .ne. 0) return + + epsilon = 1.0d-12 + verified = .FALSE. + + if (d1 .eq. 64 .and. + > d2 .eq. 64 .and. + > d3 .eq. 64 .and. + > nt .eq. 6) then +c--------------------------------------------------------------------- +c Sample size reference checksums +c--------------------------------------------------------------------- + class = 'S' + csum_ref(1) = dcmplx(5.546087004964D+02, 4.845363331978D+02) + csum_ref(2) = dcmplx(5.546385409189D+02, 4.865304269511D+02) + csum_ref(3) = dcmplx(5.546148406171D+02, 4.883910722336D+02) + csum_ref(4) = dcmplx(5.545423607415D+02, 4.901273169046D+02) + csum_ref(5) = dcmplx(5.544255039624D+02, 4.917475857993D+02) + csum_ref(6) = dcmplx(5.542683411902D+02, 4.932597244941D+02) + + else if (d1 .eq. 128 .and. + > d2 .eq. 128 .and. + > d3 .eq. 32 .and. + > nt .eq. 6) then +c--------------------------------------------------------------------- +c Class W size reference checksums +c--------------------------------------------------------------------- + class = 'W' + csum_ref(1) = dcmplx(5.673612178944D+02, 5.293246849175D+02) + csum_ref(2) = dcmplx(5.631436885271D+02, 5.282149986629D+02) + csum_ref(3) = dcmplx(5.594024089970D+02, 5.270996558037D+02) + csum_ref(4) = dcmplx(5.560698047020D+02, 5.260027904925D+02) + csum_ref(5) = dcmplx(5.530898991250D+02, 5.249400845633D+02) + csum_ref(6) = dcmplx(5.504159734538D+02, 5.239212247086D+02) + + else if (d1 .eq. 256 .and. + > d2 .eq. 256 .and. + > d3 .eq. 128 .and. + > nt .eq. 6) then +c--------------------------------------------------------------------- +c Class A size reference checksums +c--------------------------------------------------------------------- + class = 'A' + csum_ref(1) = dcmplx(5.046735008193D+02, 5.114047905510D+02) + csum_ref(2) = dcmplx(5.059412319734D+02, 5.098809666433D+02) + csum_ref(3) = dcmplx(5.069376896287D+02, 5.098144042213D+02) + csum_ref(4) = dcmplx(5.077892868474D+02, 5.101336130759D+02) + csum_ref(5) = dcmplx(5.085233095391D+02, 5.104914655194D+02) + csum_ref(6) = dcmplx(5.091487099959D+02, 5.107917842803D+02) + + else if (d1 .eq. 512 .and. + > d2 .eq. 256 .and. + > d3 .eq. 256 .and. + > nt .eq. 20) then +c--------------------------------------------------------------------- +c Class B size reference checksums +c--------------------------------------------------------------------- + class = 'B' + csum_ref(1) = dcmplx(5.177643571579D+02, 5.077803458597D+02) + csum_ref(2) = dcmplx(5.154521291263D+02, 5.088249431599D+02) + csum_ref(3) = dcmplx(5.146409228649D+02, 5.096208912659D+02) + csum_ref(4) = dcmplx(5.142378756213D+02, 5.101023387619D+02) + csum_ref(5) = dcmplx(5.139626667737D+02, 5.103976610617D+02) + csum_ref(6) = dcmplx(5.137423460082D+02, 5.105948019802D+02) + csum_ref(7) = dcmplx(5.135547056878D+02, 5.107404165783D+02) + csum_ref(8) = dcmplx(5.133910925466D+02, 5.108576573661D+02) + csum_ref(9) = dcmplx(5.132470705390D+02, 5.109577278523D+02) + csum_ref(10) = dcmplx(5.131197729984D+02, 5.110460304483D+02) + csum_ref(11) = dcmplx(5.130070319283D+02, 5.111252433800D+02) + csum_ref(12) = dcmplx(5.129070537032D+02, 5.111968077718D+02) + csum_ref(13) = dcmplx(5.128182883502D+02, 5.112616233064D+02) + csum_ref(14) = dcmplx(5.127393733383D+02, 5.113203605551D+02) + csum_ref(15) = dcmplx(5.126691062020D+02, 5.113735928093D+02) + csum_ref(16) = dcmplx(5.126064276004D+02, 5.114218460548D+02) + csum_ref(17) = dcmplx(5.125504076570D+02, 5.114656139760D+02) + csum_ref(18) = dcmplx(5.125002331720D+02, 5.115053595966D+02) + csum_ref(19) = dcmplx(5.124551951846D+02, 5.115415130407D+02) + csum_ref(20) = dcmplx(5.124146770029D+02, 5.115744692211D+02) + + else if (d1 .eq. 512 .and. + > d2 .eq. 512 .and. + > d3 .eq. 512 .and. + > nt .eq. 20) then +c--------------------------------------------------------------------- +c Class C size reference checksums +c--------------------------------------------------------------------- + class = 'C' + csum_ref(1) = dcmplx(5.195078707457D+02, 5.149019699238D+02) + csum_ref(2) = dcmplx(5.155422171134D+02, 5.127578201997D+02) + csum_ref(3) = dcmplx(5.144678022222D+02, 5.122251847514D+02) + csum_ref(4) = dcmplx(5.140150594328D+02, 5.121090289018D+02) + csum_ref(5) = dcmplx(5.137550426810D+02, 5.121143685824D+02) + csum_ref(6) = dcmplx(5.135811056728D+02, 5.121496764568D+02) + csum_ref(7) = dcmplx(5.134569343165D+02, 5.121870921893D+02) + csum_ref(8) = dcmplx(5.133651975661D+02, 5.122193250322D+02) + csum_ref(9) = dcmplx(5.132955192805D+02, 5.122454735794D+02) + csum_ref(10) = dcmplx(5.132410471738D+02, 5.122663649603D+02) + csum_ref(11) = dcmplx(5.131971141679D+02, 5.122830879827D+02) + csum_ref(12) = dcmplx(5.131605205716D+02, 5.122965869718D+02) + csum_ref(13) = dcmplx(5.131290734194D+02, 5.123075927445D+02) + csum_ref(14) = dcmplx(5.131012720314D+02, 5.123166486553D+02) + csum_ref(15) = dcmplx(5.130760908195D+02, 5.123241541685D+02) + csum_ref(16) = dcmplx(5.130528295923D+02, 5.123304037599D+02) + csum_ref(17) = dcmplx(5.130310107773D+02, 5.123356167976D+02) + csum_ref(18) = dcmplx(5.130103090133D+02, 5.123399592211D+02) + csum_ref(19) = dcmplx(5.129905029333D+02, 5.123435588985D+02) + csum_ref(20) = dcmplx(5.129714421109D+02, 5.123465164008D+02) + + else if (d1 .eq. 2048 .and. + > d2 .eq. 1024 .and. + > d3 .eq. 1024 .and. + > nt .eq. 25) then +c--------------------------------------------------------------------- +c Class D size reference checksums +c--------------------------------------------------------------------- + class = 'D' + csum_ref(1) = dcmplx(5.122230065252D+02, 5.118534037109D+02) + csum_ref(2) = dcmplx(5.120463975765D+02, 5.117061181082D+02) + csum_ref(3) = dcmplx(5.119865766760D+02, 5.117096364601D+02) + csum_ref(4) = dcmplx(5.119518799488D+02, 5.117373863950D+02) + csum_ref(5) = dcmplx(5.119269088223D+02, 5.117680347632D+02) + csum_ref(6) = dcmplx(5.119082416858D+02, 5.117967875532D+02) + csum_ref(7) = dcmplx(5.118943814638D+02, 5.118225281841D+02) + csum_ref(8) = dcmplx(5.118842385057D+02, 5.118451629348D+02) + csum_ref(9) = dcmplx(5.118769435632D+02, 5.118649119387D+02) + csum_ref(10) = dcmplx(5.118718203448D+02, 5.118820803844D+02) + csum_ref(11) = dcmplx(5.118683569061D+02, 5.118969781011D+02) + csum_ref(12) = dcmplx(5.118661708593D+02, 5.119098918835D+02) + csum_ref(13) = dcmplx(5.118649768950D+02, 5.119210777066D+02) + csum_ref(14) = dcmplx(5.118645605626D+02, 5.119307604484D+02) + csum_ref(15) = dcmplx(5.118647586618D+02, 5.119391362671D+02) + csum_ref(16) = dcmplx(5.118654451572D+02, 5.119463757241D+02) + csum_ref(17) = dcmplx(5.118665212451D+02, 5.119526269238D+02) + csum_ref(18) = dcmplx(5.118679083821D+02, 5.119580184108D+02) + csum_ref(19) = dcmplx(5.118695433664D+02, 5.119626617538D+02) + csum_ref(20) = dcmplx(5.118713748264D+02, 5.119666538138D+02) + csum_ref(21) = dcmplx(5.118733606701D+02, 5.119700787219D+02) + csum_ref(22) = dcmplx(5.118754661974D+02, 5.119730095953D+02) + csum_ref(23) = dcmplx(5.118776626738D+02, 5.119755100241D+02) + csum_ref(24) = dcmplx(5.118799262314D+02, 5.119776353561D+02) + csum_ref(25) = dcmplx(5.118822370068D+02, 5.119794338060D+02) + + else if (d1 .eq. 4096 .and. + > d2 .eq. 2048 .and. + > d3 .eq. 2048 .and. + > nt .eq. 25) then +c--------------------------------------------------------------------- +c Class E size reference checksums +c--------------------------------------------------------------------- + class = 'E' + csum_ref(1) = dcmplx(5.121601045346D+02, 5.117395998266D+02) + csum_ref(2) = dcmplx(5.120905403678D+02, 5.118614716182D+02) + csum_ref(3) = dcmplx(5.120623229306D+02, 5.119074203747D+02) + csum_ref(4) = dcmplx(5.120438418997D+02, 5.119345900733D+02) + csum_ref(5) = dcmplx(5.120311521872D+02, 5.119551325550D+02) + csum_ref(6) = dcmplx(5.120226088809D+02, 5.119720179919D+02) + csum_ref(7) = dcmplx(5.120169296534D+02, 5.119861371665D+02) + csum_ref(8) = dcmplx(5.120131225172D+02, 5.119979364402D+02) + csum_ref(9) = dcmplx(5.120104767108D+02, 5.120077674092D+02) + csum_ref(10) = dcmplx(5.120085127969D+02, 5.120159443121D+02) + csum_ref(11) = dcmplx(5.120069224127D+02, 5.120227453670D+02) + csum_ref(12) = dcmplx(5.120055158164D+02, 5.120284096041D+02) + csum_ref(13) = dcmplx(5.120041820159D+02, 5.120331373793D+02) + csum_ref(14) = dcmplx(5.120028605402D+02, 5.120370938679D+02) + csum_ref(15) = dcmplx(5.120015223011D+02, 5.120404138831D+02) + csum_ref(16) = dcmplx(5.120001570022D+02, 5.120432068837D+02) + csum_ref(17) = dcmplx(5.119987650555D+02, 5.120455615860D+02) + csum_ref(18) = dcmplx(5.119973525091D+02, 5.120475499442D+02) + csum_ref(19) = dcmplx(5.119959279472D+02, 5.120492304629D+02) + csum_ref(20) = dcmplx(5.119945006558D+02, 5.120506508902D+02) + csum_ref(21) = dcmplx(5.119930795911D+02, 5.120518503782D+02) + csum_ref(22) = dcmplx(5.119916728462D+02, 5.120528612016D+02) + csum_ref(23) = dcmplx(5.119902874185D+02, 5.120537101195D+02) + csum_ref(24) = dcmplx(5.119889291565D+02, 5.120544194514D+02) + csum_ref(25) = dcmplx(5.119876028049D+02, 5.120550079284D+02) + + endif + + + if (class .ne. 'U') then + + do i = 1, nt + err = abs( (sums(i) - csum_ref(i)) / csum_ref(i) ) + if (.not.(err .le. epsilon)) goto 100 + end do + verified = .TRUE. + 100 continue + + endif + + call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) + if (size .ne. np) then + write(*, 4010) np + write(*, 4011) + write(*, 4012) +c--------------------------------------------------------------------- +c multiple statements because some Fortran compilers have +c problems with long strings. +c--------------------------------------------------------------------- + 4010 format( ' Warning: benchmark was compiled for ', i5, + > 'processors') + 4011 format( ' Must be run on this many processors for official', + > ' verification') + 4012 format( ' so memory access is repeatable') + verified = .false. + endif + + if (class .ne. 'U') then + if (verified) then + write(*,2000) + 2000 format(' Result verification successful') + else + write(*,2001) + 2001 format(' Result verification failed') + endif + endif + print *, 'class = ', class + + return + end + + diff --git b/NPB3.3-MPI/FT/global.h a/NPB3.3-MPI/FT/global.h new file mode 100644 index 0000000..29a8656 --- /dev/null +++ a/NPB3.3-MPI/FT/global.h @@ -0,0 +1,134 @@ + include 'npbparams.h' + +c 2D processor array -> 2D grid decomposition (by pencils) +c If processor array is 1xN or -> 1D grid decomposition (by planes) +c If processor array is 1x1 -> 0D grid decomposition +c For simplicity, do not treat Nx1 (np2 = 1) specially + integer np1, np2, np + +c basic decomposition strategy + integer layout_type + integer layout_0D, layout_1D, layout_2D + parameter (layout_0D = 0, layout_1D = 1, layout_2D = 2) + + common /procgrid/ np1, np2, layout_type, np + + +c Cache blocking params. These values are good for most +c RISC processors. +c FFT parameters: +c fftblock controls how many ffts are done at a time. +c The default is appropriate for most cache-based machines +c On vector machines, the FFT can be vectorized with vector +c length equal to the block size, so the block size should +c be as large as possible. This is the size of the smallest +c dimension of the problem: 128 for class A, 256 for class B and +c 512 for class C. +c Transpose parameters: +c transblock is the blocking factor for the transposes when there +c is a 1-D layout. On vector machines it should probably be +c large (largest dimension of the problem). + + + integer fftblock_default, fftblockpad_default + parameter (fftblock_default=16, fftblockpad_default=18) + integer transblock, transblockpad + parameter(transblock=32, transblockpad=34) + + integer fftblock, fftblockpad + common /blockinfo/ fftblock, fftblockpad + +c we need a bunch of logic to keep track of how +c arrays are laid out. +c coords of this processor + integer me, me1, me2 + common /coords/ me, me1, me2 +c need a communicator for row/col in processor grid + integer commslice1, commslice2 + common /comms/ commslice1, commslice2 + + + +c There are basically three stages +c 1: x-y-z layout +c 2: after x-transform (before y) +c 3: after y-transform (before z) +c The computation proceeds logically as + +c set up initial conditions +c fftx(1) +c transpose (1->2) +c ffty(2) +c transpose (2->3) +c fftz(3) +c time evolution +c fftz(3) +c transpose (3->2) +c ffty(2) +c transpose (2->1) +c fftx(1) +c compute residual(1) + +c for the 0D, 1D, 2D strategies, the layouts look like xxx +c +c 0D 1D 2D +c 1: xyz xyz xyz +c 2: xyz xyz yxz +c 3: xyz zyx zxy + +c the array dimensions are stored in dims(coord, phase) + integer dims(3, 3) + integer xstart(3), ystart(3), zstart(3) + integer xend(3), yend(3), zend(3) + common /layout/ dims, + > xstart, ystart, zstart, + > xend, yend, zend + + integer T_total, T_setup, T_fft, T_evolve, T_checksum, + > T_fftlow, T_fftcopy, T_transpose, + > T_transxzloc, T_transxzglo, T_transxzfin, + > T_transxyloc, T_transxyglo, T_transxyfin, + > T_synch, T_init, T_max + parameter (T_total = 1, T_setup = 2, T_fft = 3, + > T_evolve = 4, T_checksum = 5, + > T_fftlow = 6, T_fftcopy = 7, T_transpose = 8, + > T_transxzloc = 9, T_transxzglo = 10, T_transxzfin = 11, + > T_transxyloc = 12, T_transxyglo = 13, + > T_transxyfin = 14, T_synch = 15, T_init = 16, + > T_max = 16) + + + + logical timers_enabled + + + external timer_read + double precision timer_read + external ilog2 + integer ilog2 + + external randlc + double precision randlc + + +c other stuff + logical debug, debugsynch + common /dbg/ debug, debugsynch, timers_enabled + + double precision seed, a, pi, alpha + parameter (seed = 314159265.d0, a = 1220703125.d0, + > pi = 3.141592653589793238d0, alpha=1.0d-6) + +c roots of unity array +c relies on x being largest dimension? + double complex u(nx) + common /ucomm/ u + + +c for checksum data + double complex sums(0:niter_default) + common /sumcomm/ sums + +c number of iterations + integer niter + common /iter/ niter diff --git b/NPB3.3-MPI/FT/inputft.data.sample a/NPB3.3-MPI/FT/inputft.data.sample new file mode 100644 index 0000000..448ac42 --- /dev/null +++ a/NPB3.3-MPI/FT/inputft.data.sample @@ -0,0 +1,3 @@ +6 ! number of iterations +2 ! layout type. 0 = 0d, 1 = 1d, 2 = 2d +2 4 ! processor layout. 0d must be "1 1"; 1d must be "1 N" diff --git b/NPB3.3-MPI/FT/mpinpb.h a/NPB3.3-MPI/FT/mpinpb.h new file mode 100644 index 0000000..e43e552 --- /dev/null +++ a/NPB3.3-MPI/FT/mpinpb.h @@ -0,0 +1,4 @@ + include 'mpif.h' +c mpi data types + integer dc_type + common /mpistuff/ dc_type diff --git b/NPB3.3-MPI/IS/Makefile a/NPB3.3-MPI/IS/Makefile new file mode 100644 index 0000000..0ac4ae9 --- /dev/null +++ a/NPB3.3-MPI/IS/Makefile @@ -0,0 +1,23 @@ +SHELL=/bin/sh +BENCHMARK=is +BENCHMARKU=IS + +include ../config/make.def + +include ../sys/make.common + +OBJS = is.o ${COMMON}/c_print_results.o ${COMMON}/c_timers.o + + +${PROGRAM}: config ${OBJS} + ${CLINK} ${CLINKFLAGS} -o ${PROGRAM} ${OBJS} ${CMPI_LIB} + +.c.o: + ${CCOMPILE} $< + +is.o: is.c npbparams.h + + +clean: + - rm -f *.o *~ mputil* + - rm -f is npbparams.h core diff --git b/NPB3.3-MPI/IS/is.c a/NPB3.3-MPI/IS/is.c new file mode 100644 index 0000000..39e64ab --- /dev/null +++ a/NPB3.3-MPI/IS/is.c @@ -0,0 +1,1150 @@ +/************************************************************************* + * * + * N A S P A R A L L E L B E N C H M A R K S 3.3 * + * * + * I S * + * * + ************************************************************************* + * * + * This benchmark is part of the NAS Parallel Benchmark 3.3 suite. * + * It is described in NAS Technical Report 95-020. * + * * + * Permission to use, copy, distribute and modify this software * + * for any purpose with or without fee is hereby granted. We * + * request, however, that all derived work reference the NAS * + * Parallel Benchmarks 3.3. This software is provided "as is" * + * without express or implied warranty. * + * * + * Information on NPB 3.3, including the technical report, the * + * original specifications, source code, results and information * + * on how to submit new results, is available at: * + * * + * http://www.nas.nasa.gov/Software/NPB * + * * + * Send comments or suggestions to npb@nas.nasa.gov * + * Send bug reports to npb-bugs@nas.nasa.gov * + * * + * NAS Parallel Benchmarks Group * + * NASA Ames Research Center * + * Mail Stop: T27A-1 * + * Moffett Field, CA 94035-1000 * + * * + * E-mail: npb@nas.nasa.gov * + * Fax: (650) 604-3957 * + * * + ************************************************************************* + * * + * Author: M. Yarrow * + * H. Jin * + * * + *************************************************************************/ + +#include "mpi.h" +#include "npbparams.h" +#include +#include + +/******************/ +/* default values */ +/******************/ +#ifndef CLASS +#define CLASS 'S' +#define NUM_PROCS 1 +#endif +#define MIN_PROCS 1 + + +/*************/ +/* CLASS S */ +/*************/ +#if CLASS == 'S' +#define TOTAL_KEYS_LOG_2 16 +#define MAX_KEY_LOG_2 11 +#define NUM_BUCKETS_LOG_2 9 +#endif + + +/*************/ +/* CLASS W */ +/*************/ +#if CLASS == 'W' +#define TOTAL_KEYS_LOG_2 20 +#define MAX_KEY_LOG_2 16 +#define NUM_BUCKETS_LOG_2 10 +#endif + +/*************/ +/* CLASS A */ +/*************/ +#if CLASS == 'A' +#define TOTAL_KEYS_LOG_2 23 +#define MAX_KEY_LOG_2 19 +#define NUM_BUCKETS_LOG_2 10 +#endif + + +/*************/ +/* CLASS B */ +/*************/ +#if CLASS == 'B' +#define TOTAL_KEYS_LOG_2 25 +#define MAX_KEY_LOG_2 21 +#define NUM_BUCKETS_LOG_2 10 +#endif + + +/*************/ +/* CLASS C */ +/*************/ +#if CLASS == 'C' +#define TOTAL_KEYS_LOG_2 27 +#define MAX_KEY_LOG_2 23 +#define NUM_BUCKETS_LOG_2 10 +#endif + + +/*************/ +/* CLASS D */ +/*************/ +#if CLASS == 'D' +#define TOTAL_KEYS_LOG_2 29 +#define MAX_KEY_LOG_2 27 +#define NUM_BUCKETS_LOG_2 10 +#undef MIN_PROCS +#define MIN_PROCS 4 +#endif + + +#define TOTAL_KEYS (1 << TOTAL_KEYS_LOG_2) +#define MAX_KEY (1 << MAX_KEY_LOG_2) +#define NUM_BUCKETS (1 << NUM_BUCKETS_LOG_2) +#define NUM_KEYS (TOTAL_KEYS/NUM_PROCS*MIN_PROCS) + +/*****************************************************************/ +/* On larger number of processors, since the keys are (roughly) */ +/* gaussian distributed, the first and last processor sort keys */ +/* in a large interval, requiring array sizes to be larger. Note */ +/* that for large NUM_PROCS, NUM_KEYS is, however, a small number*/ +/* The required array size also depends on the bucket size used. */ +/* The following values are validated for the 1024-bucket setup. */ +/*****************************************************************/ +#if NUM_PROCS < 256 +#define SIZE_OF_BUFFERS 3*NUM_KEYS/2 +#elif NUM_PROCS < 512 +#define SIZE_OF_BUFFERS 5*NUM_KEYS/2 +#elif NUM_PROCS < 1024 +#define SIZE_OF_BUFFERS 4*NUM_KEYS +#else +#define SIZE_OF_BUFFERS 13*NUM_KEYS/2 +#endif + +/*****************************************************************/ +/* NOTE: THIS CODE CANNOT BE RUN ON ARBITRARILY LARGE NUMBERS OF */ +/* PROCESSORS. THE LARGEST VERIFIED NUMBER IS 1024. INCREASE */ +/* MAX_PROCS AT YOUR PERIL */ +/*****************************************************************/ +#if CLASS == 'S' +#define MAX_PROCS 128 +#else +#define MAX_PROCS 1024 +#endif + +#define MAX_ITERATIONS 10 +#define TEST_ARRAY_SIZE 5 + + +/***********************************/ +/* Enable separate communication, */ +/* computation timing and printout */ +/***********************************/ +#define TIMING_ENABLED +#ifdef NO_MTIMERS +#undef TIMINIG_ENABLED +#define TIMER_START( x ) +#define TIMER_STOP( x ) +#else +#define TIMER_START( x ) if (timeron) timer_start( x ) +#define TIMER_STOP( x ) if (timeron) timer_stop( x ) +#define T_TOTAL 0 +#define T_RANK 1 +#define T_RCOMM 2 +#define T_VERIFY 3 +#define T_LAST 3 +#endif +int timeron; + + +/*************************************/ +/* Typedef: if necessary, change the */ +/* size of int here by changing the */ +/* int type to, say, long */ +/*************************************/ +typedef int INT_TYPE; +typedef long INT_TYPE2; +#define MP_KEY_TYPE MPI_INT + + + +/********************/ +/* MPI properties: */ +/********************/ +int my_rank, + comm_size; + + +/********************/ +/* Some global info */ +/********************/ +INT_TYPE *key_buff_ptr_global, /* used by full_verify to get */ + total_local_keys, /* copies of rank info */ + total_lesser_keys; + + +int passed_verification; + + + +/************************************/ +/* These are the three main arrays. */ +/* See SIZE_OF_BUFFERS def above */ +/************************************/ +INT_TYPE key_array[SIZE_OF_BUFFERS], + key_buff1[SIZE_OF_BUFFERS], + key_buff2[SIZE_OF_BUFFERS], + bucket_size[NUM_BUCKETS+TEST_ARRAY_SIZE], /* Top 5 elements for */ + bucket_size_totals[NUM_BUCKETS+TEST_ARRAY_SIZE], /* part. ver. vals */ + bucket_ptrs[NUM_BUCKETS], + process_bucket_distrib_ptr1[NUM_BUCKETS+TEST_ARRAY_SIZE], + process_bucket_distrib_ptr2[NUM_BUCKETS+TEST_ARRAY_SIZE]; +int send_count[MAX_PROCS], recv_count[MAX_PROCS], + send_displ[MAX_PROCS], recv_displ[MAX_PROCS]; + + +/**********************/ +/* Partial verif info */ +/**********************/ +INT_TYPE2 test_index_array[TEST_ARRAY_SIZE], + test_rank_array[TEST_ARRAY_SIZE], + + S_test_index_array[TEST_ARRAY_SIZE] = + {48427,17148,23627,62548,4431}, + S_test_rank_array[TEST_ARRAY_SIZE] = + {0,18,346,64917,65463}, + + W_test_index_array[TEST_ARRAY_SIZE] = + {357773,934767,875723,898999,404505}, + W_test_rank_array[TEST_ARRAY_SIZE] = + {1249,11698,1039987,1043896,1048018}, + + A_test_index_array[TEST_ARRAY_SIZE] = + {2112377,662041,5336171,3642833,4250760}, + A_test_rank_array[TEST_ARRAY_SIZE] = + {104,17523,123928,8288932,8388264}, + + B_test_index_array[TEST_ARRAY_SIZE] = + {41869,812306,5102857,18232239,26860214}, + B_test_rank_array[TEST_ARRAY_SIZE] = + {33422937,10244,59149,33135281,99}, + + C_test_index_array[TEST_ARRAY_SIZE] = + {44172927,72999161,74326391,129606274,21736814}, + C_test_rank_array[TEST_ARRAY_SIZE] = + {61147,882988,266290,133997595,133525895}, + + D_test_index_array[TEST_ARRAY_SIZE] = + {1317351170,995930646,1157283250,1503301535,1453734525}, + D_test_rank_array[TEST_ARRAY_SIZE] = + {1,36538729,1978098519,2145192618,2147425337}; + + + +/***********************/ +/* function prototypes */ +/***********************/ +double randlc( double *X, double *A ); + +void full_verify( void ); + +void c_print_results( char *name, + char class, + int n1, + int n2, + int n3, + int niter, + int nprocs_compiled, + int nprocs_total, + double t, + double mops, + char *optype, + int passed_verification, + char *npbversion, + char *compiletime, + char *mpicc, + char *clink, + char *cmpi_lib, + char *cmpi_inc, + char *cflags, + char *clinkflags ); + +void timer_clear( int n ); +void timer_start( int n ); +void timer_stop( int n ); +double timer_read( int n ); + + + +/* + * FUNCTION RANDLC (X, A) + * + * This routine returns a uniform pseudorandom double precision number in the + * range (0, 1) by using the linear congruential generator + * + * x_{k+1} = a x_k (mod 2^46) + * + * where 0 < x_k < 2^46 and 0 < a < 2^46. This scheme generates 2^44 numbers + * before repeating. The argument A is the same as 'a' in the above formula, + * and X is the same as x_0. A and X must be odd double precision integers + * in the range (1, 2^46). The returned value RANDLC is normalized to be + * between 0 and 1, i.e. RANDLC = 2^(-46) * x_1. X is updated to contain + * the new seed x_1, so that subsequent calls to RANDLC using the same + * arguments will generate a continuous sequence. + * + * This routine should produce the same results on any computer with at least + * 48 mantissa bits in double precision floating point data. On Cray systems, + * double precision should be disabled. + * + * David H. Bailey October 26, 1990 + * + * IMPLICIT DOUBLE PRECISION (A-H, O-Z) + * SAVE KS, R23, R46, T23, T46 + * DATA KS/0/ + * + * If this is the first call to RANDLC, compute R23 = 2 ^ -23, R46 = 2 ^ -46, + * T23 = 2 ^ 23, and T46 = 2 ^ 46. These are computed in loops, rather than + * by merely using the ** operator, in order to insure that the results are + * exact on all systems. This code assumes that 0.5D0 is represented exactly. + */ + + +/*****************************************************************/ +/************* R A N D L C ************/ +/************* ************/ +/************* portable random number generator ************/ +/*****************************************************************/ + +double randlc( double *X, double *A ) +{ + static int KS=0; + static double R23, R46, T23, T46; + double T1, T2, T3, T4; + double A1; + double A2; + double X1; + double X2; + double Z; + int i, j; + + if (KS == 0) + { + R23 = 1.0; + R46 = 1.0; + T23 = 1.0; + T46 = 1.0; + + for (i=1; i<=23; i++) + { + R23 = 0.50 * R23; + T23 = 2.0 * T23; + } + for (i=1; i<=46; i++) + { + R46 = 0.50 * R46; + T46 = 2.0 * T46; + } + KS = 1; + } + +/* Break A into two parts such that A = 2^23 * A1 + A2 and set X = N. */ + + T1 = R23 * *A; + j = T1; + A1 = j; + A2 = *A - T23 * A1; + +/* Break X into two parts such that X = 2^23 * X1 + X2, compute + Z = A1 * X2 + A2 * X1 (mod 2^23), and then + X = 2^23 * Z + A2 * X2 (mod 2^46). */ + + T1 = R23 * *X; + j = T1; + X1 = j; + X2 = *X - T23 * X1; + T1 = A1 * X2 + A2 * X1; + + j = R23 * T1; + T2 = j; + Z = T1 - T23 * T2; + T3 = T23 * Z + A2 * X2; + j = R46 * T3; + T4 = j; + *X = T3 - T46 * T4; + return(R46 * *X); +} + + + +/*****************************************************************/ +/************ F I N D _ M Y _ S E E D ************/ +/************ ************/ +/************ returns parallel random number seq seed ************/ +/*****************************************************************/ + +/* + * Create a random number sequence of total length nn residing + * on np number of processors. Each processor will therefore have a + * subsequence of length nn/np. This routine returns that random + * number which is the first random number for the subsequence belonging + * to processor rank kn, and which is used as seed for proc kn ran # gen. + */ + +double find_my_seed( int kn, /* my processor rank, 0<=kn<=num procs */ + int np, /* np = num procs */ + long nn, /* total num of ran numbers, all procs */ + double s, /* Ran num seed, for ex.: 314159265.00 */ + double a ) /* Ran num gen mult, try 1220703125.00 */ +{ + + long i; + + double t1,t2,t3,an; + long mq,nq,kk,ik; + + + + nq = nn / np; + + for( mq=0; nq>1; mq++,nq/=2 ) + ; + + t1 = a; + + for( i=1; i<=mq; i++ ) + t2 = randlc( &t1, &t1 ); + + an = t1; + + kk = kn; + t1 = s; + t2 = an; + + for( i=1; i<=100; i++ ) + { + ik = kk / 2; + if( 2 * ik != kk ) + t3 = randlc( &t1, &t2 ); + if( ik == 0 ) + break; + t3 = randlc( &t2, &t2 ); + kk = ik; + } + + return( t1 ); + +} + + + + +/*****************************************************************/ +/************* C R E A T E _ S E Q ************/ +/*****************************************************************/ + +void create_seq( double seed, double a ) +{ + double x; + int i, k; + + k = MAX_KEY/4; + + for (i=0; i 0 ) + MPI_Irecv( &k, + 1, + MP_KEY_TYPE, + my_rank-1, + 1000, + MPI_COMM_WORLD, + &request ); + if( my_rank < comm_size-1 ) + MPI_Send( &key_array[last_local_key], + 1, + MP_KEY_TYPE, + my_rank+1, + 1000, + MPI_COMM_WORLD ); + if( my_rank > 0 ) + MPI_Wait( &request, &status ); + +/* Confirm that neighbor's greatest key value + is not greater than my least key value */ + j = 0; + if( my_rank > 0 && total_local_keys > 0 ) + if( k > key_array[0] ) + j++; + + +/* Confirm keys correctly sorted: count incorrectly sorted keys, if any */ + for( i=1; i key_array[i] ) + j++; + + + if( j != 0 ) + { + printf( "Processor %d: Full_verify: number of keys out of sort: %d\n", + my_rank, j ); + } + else + passed_verification++; + + TIMER_STOP( T_VERIFY ); + +} + + + + +/*****************************************************************/ +/************* R A N K ****************/ +/*****************************************************************/ + + +void rank( int iteration ) +{ + + INT_TYPE i, k; + + INT_TYPE shift = MAX_KEY_LOG_2 - NUM_BUCKETS_LOG_2; + INT_TYPE key; + INT_TYPE2 bucket_sum_accumulator, j, m; + INT_TYPE local_bucket_sum_accumulator; + INT_TYPE min_key_val, max_key_val; + INT_TYPE *key_buff_ptr; + + + + TIMER_START( T_RANK ); + +/* Iteration alteration of keys */ + if(my_rank == 0 ) + { + key_array[iteration] = iteration; + key_array[iteration+MAX_ITERATIONS] = MAX_KEY - iteration; + } + + +/* Initialize */ + for( i=0; i> shift]++; + + +/* Accumulative bucket sizes are the bucket pointers */ + bucket_ptrs[0] = 0; + for( i=1; i< NUM_BUCKETS; i++ ) + bucket_ptrs[i] = bucket_ptrs[i-1] + bucket_size[i-1]; + + +/* Sort into appropriate bucket */ + for( i=0; i> shift]++] = key; + } + + TIMER_STOP( T_RANK ); + TIMER_START( T_RCOMM ); + +/* Get the bucket size totals for the entire problem. These + will be used to determine the redistribution of keys */ + MPI_Allreduce( bucket_size, + bucket_size_totals, + NUM_BUCKETS+TEST_ARRAY_SIZE, + MP_KEY_TYPE, + MPI_SUM, + MPI_COMM_WORLD ); + + TIMER_STOP( T_RCOMM ); + TIMER_START( T_RANK ); + +/* Determine Redistibution of keys: accumulate the bucket size totals + till this number surpasses NUM_KEYS (which the average number of keys + per processor). Then all keys in these buckets go to processor 0. + Continue accumulating again until supassing 2*NUM_KEYS. All keys + in these buckets go to processor 1, etc. This algorithm guarantees + that all processors have work ranking; no processors are left idle. + The optimum number of buckets, however, does not result in as high + a degree of load balancing (as even a distribution of keys as is + possible) as is obtained from increasing the number of buckets, but + more buckets results in more computation per processor so that the + optimum number of buckets turns out to be 1024 for machines tested. + Note that process_bucket_distrib_ptr1 and ..._ptr2 hold the bucket + number of first and last bucket which each processor will have after + the redistribution is done. */ + + bucket_sum_accumulator = 0; + local_bucket_sum_accumulator = 0; + send_displ[0] = 0; + process_bucket_distrib_ptr1[0] = 0; + for( i=0, j=0; i= (j+1)*NUM_KEYS ) + { + send_count[j] = local_bucket_sum_accumulator; + if( j != 0 ) + { + send_displ[j] = send_displ[j-1] + send_count[j-1]; + process_bucket_distrib_ptr1[j] = + process_bucket_distrib_ptr2[j-1]+1; + } + process_bucket_distrib_ptr2[j++] = i; + local_bucket_sum_accumulator = 0; + } + } + +/* When NUM_PROCS approaching NUM_BUCKETS, it is highly possible + that the last few processors don't get any buckets. So, we + need to set counts properly in this case to avoid any fallouts. */ + while( j < comm_size ) + { + send_count[j] = 0; + process_bucket_distrib_ptr1[j] = 1; + j++; + } + + TIMER_STOP( T_RANK ); + TIMER_START( T_RCOMM ); + +/* This is the redistribution section: first find out how many keys + each processor will send to every other processor: */ + MPI_Alltoall( send_count, + 1, + MPI_INT, + recv_count, + 1, + MPI_INT, + MPI_COMM_WORLD ); + +/* Determine the receive array displacements for the buckets */ + recv_displ[0] = 0; + for( i=1; i