Second Code Improvement, Milestone G, code PARK TITLE OF AGREEMENT: Numerical Simulations For Active Tectonic Processes: Increasing Interoperability And Performance AGREEMENT NUMBER: JPL Task Plan No. 83-6791 TEXT OF MILESTONE: code PARK on 1024 CPU machine with 400,000 elements, 50,000 time steps in 5 times the Baseline code Uses MPI for parallel earthquake code and parallel multipole library. PROBLEM BEING SOLVED: Compute the history of slip, slip velocity, and stress on a vertical strike-slip fault that results from using state-of-the-art rate and state frictional constitutive laws on the fault for a specific geographic setting at Parkfield, California. The boundary conditions are those appropriate for Parkfield and the distribution of constitutive properties on the fault zone are as realistic as our ability to characterize the subsurface properties of the fault there allow. The methods developed in solving this problem can be generalized to other geologic settings in which the fault geometry, the boundary conditions are not so simple and multiple faults are involved. DESCRIPTION OF COMPUTER CODES USED: The main program is a boundary element program that determines the stress on every element of the fault surface due to slip on every other element, using a Greens function approach. The fault constitutive law is used to determine what the slip velocity will be for that stress and this velocity multiplied by the time step gives the slip to be used to calculate the stress in the next time increment. This involves the forward time integration of coupled ordinary differential equations. The integration for previous milestones was done with a fifth order Runge-Kutta scheme with adaptive step size control. Because the time-steps range over ten orders of magnitude, depending on whether the fault is slipping very slowly in the interseismic period or very fast during an earthquake, the adaptive step-size control is an essential element in the solution. For this milestone the fifth order Runge Kutta routine that involved six stages, namely the determination of derivatives six times for every time step, was replaced by a faster second order routine, that involves only two stages and reuses the derivatives calculated at the end of the previous time step to begin the integration at the beginning of the current time step. The adaptive step size scheme is retained. Tests to date show that, as expected, this second order routine is 3 times faster per time step than the fifth order one. The disadvantage is that each time step is smaller, but tests to date suggest that a given model time is reached in essentially the same CPU time as with the fifth order routine. For future users who may want to also try the fifth order routine that uses two copyrighted subroutines from Numerical Recipes, an alternative version of the park.f program is provided that works with those subroutines and contains instructions on how to obtain and modify them to work in this application. The main program calls a variety of subroutines and the one of these subroutines that calculates the derivatives used in the forward time integration itself calls a Fast Multipole library that is suitable for such Green's functions problems. The Multipole approach allows a number of computations to scale as N log N rather than N^2 as would otherwise be the case. The particular Fast Multipole approach being used allows determination of the degree of grouping of the remote cells based on an analytical approximation to the Greens function. In order to reduce computation time it also renumbers the elements so that those that are near in space are also near in memory. The main program and most of its subroutines are written in Fortran 90. The Fast Multipole library and its interface program to the main program and its subroutines are written in C. All of these programs run in parallel using MPI. DOCUMENTATION: Contained herein and in the subdirectories under the 2nd_Code_Improv_Milestone directory in which this file is found. SCALING ANALYSIS: In the directory scaling, found in the same 2nd_Code_Improv_Milestone directory in which this file is found, are files that show how the job scales with number of processors. Sixty six scaling runs were done, on several different machines, including the Compaq-HP AlphaServer named halem at the NCCS at Goddard, the DEC altix machines, Columbia2 and Columbia17 at NASA Ames, and the Dell Linux cluster at JPL. An attempt was made to do runs for 1, 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 procdessors in as many of those machines as possible, but this was only possible on the JPL Dell cluster, 256 CPUs being the maximum number tested on halem and columbia2 and 512 being the maximum number on colubia17. The acaling tests were done on models with 150,000 elements on halem and columbia2, and 400,000 on Columbia17 and the JPL cluster. The number of time steps used in these scaling runs varied from 17 to 100, depending on the machine and time available for the tests, some of which were done with the entire machine dedicated to the tests, whereas the full 400,000 element, Second Code Improvement Milestone run was done for 50,000 time steps. In the scaling directory is a data table giving the time per step for all the scaling runs, as well as plots showing dependence of time per step, efficiency, and overhead on number of processors. The scaling data show that not much speedup is gained by going from one to two processors, a problem that existed in the scaling in the First Code Improvement Milestone that we have not been able to solve, although extensive efforts were made by the scientists working on this code and by several experts at all three of the sites where the code was run. Whether this problem could be fixed in the future is not clear. It may be an intrinsic problem in the Fast Multipole Library, although for the astrophysics application the library was originally created for this problem has not been seen. Some continued work on this problem may be warranted, since solving it would decrease run times by a factor of two. Efficiency and overhead are nearly constant from 2-64 processors on all machines tested, but it falls off at the largest number of processors for all machines tested, due to too much time spent in interprocess communication. The behavior is most easily seen on the overhead plots, as well as by examination of the tables. For halem it rises somewhat going to 128 CPUs and in going to 256 CPUs it increaseds drastincaly, the time to run on 256 being greater than on 128. For Columbia2 and Columbia17 the scaling is good to 256 CPUs but on Columbia17 for which a 512 CPU run was done, it increases drastically for 512 CPUs, taking more time than on 256. Similar behavior was found on the JPL Dell cluster,in which the time on 512 and 256 CPUs was the same. The behavior on 1024 CPUs was even worse, it taking considerably more time than on 512 CPUs. As a result of this behavior, the 50,000 step Second Code Improvement run was actually done on 512 processors on the JPL Dell cluster. The milestone could equally well have been met on 256 CPUs. Due to the poor scaling above 256 CPUs on all machines tested, the only way the milestone was able to be met was to combine the advantages of 1) using the second order Runge Kutta integration scheme and 2) running on the fastest of all the machines tested, namely the JPL Dell cluster. The relative speed of the different machines can be see both in the tables and in the plot of time per step vs. number of CPUs. The implications of these scaling runs is that it is important to do scaling tests on any new machine the code may be ported to and, with the results of that available, to run it on a small enough number of processors that the interprocess communication does not dominate the behavior. With their current hardware, on Columbia and the JPL cluster the largest number of processors that should be used is 256. The usable number of processors might increase if many more than 400,000 elements were used, but the behavior is similar for both 150,000 and 400,000 elements. LOCATION FOR CODES, DATA, ETC.: Code and documentation can be found at the web site (http://www.servogrid.org/slide/GEM/PARK/). Within the appropriately named subdirectories under the 2nd_Code_Improv_Milestone directory in which this file is found can be found all the necessary material that describes the Second Code Improvement Milestone and gives instructions that would allow one to duplicate it. Included in the "in" and "out" directories are all the materials from the Second Code Improvement Milestone run with 400000 elements and 512 processors for 50,000 time steps. For code testing purposes on one's own system it is useful to set the number of time steps in the prk.dat.400003 file to a smaller number than 50000 for the initial run; even 1 or 2 would be reasonable for the first run. The materials in these directories include: Milestone_Certification_Data.txt - a file that gives the time required for the Second Code Improvement run and describes various parameters of the run. README-setting_up_input_files.txt - a file that tells one how to understand the input files including an explanation of how the elements are created from the input files. README-Compile.txt - a file that tells how to create both the multipole library and the PARK fault files using the appropriate Makefiles. in - a directory that contains the input files that were used in the Second Code Improvement run. out - a directory that contains the output files that were generated in the Second Code Improvement run. src-bin - a directory that contains the PARK and related fault application files used in the Second Code Improvement run. scaling - a directory that contains data and plots showing how execution time depends on number of elelents and processors. One file is a table giving the walltime per step for many scaling runs. One file is a Microsoft Word file with 2 imbedded plots that show the dependence on number of processors of 1) execution time,2) efficiency, and 3) overhead. In addition these 2 plots are also contained, one each, in 2 separate WindowsMetafile files. downloads - a directory that contains a unix-compressed tar file, PARK_Package_2nd_Improv.tar.Z, that allows one to generate the files needed for the Second Code Improvement run. This compressed file is the only place on the website where all the files for the Fast Multipole Library can be found, since these are too large for convenient storage or downloading in their uncompressed form. All of these library files are in the t17-7 directory. If one wants to use the copyrighted Numerical Recipes routines that use the fifth order Runge Kutta scheme, please see either the README-src-bin.txt file in the src-bin directory or the header for the park.f file to learn what needs to be done to create the Numerical Recipes subroutines. t17-7 - A directory containing the Fast Multipole Library. This is not explicitly included in the directory structure on the web site, but is in the directory structure that will be created when the compressed tar file is obtained, uncompressed and extracted. The tar files were created in the following way. While in the directory containing the 2nd_Code_Improv_Milestone directory, the following command was issued: tar -cvf PARK_Package_2nd_Improv.tar 2nd_Code_Improv_Milestone Then the .tar files were compressed by issuing: compress PARK_Package_2nd_Improv.tar to produce the PARK_Package_2nd_Improv.tar.Z file. These can be uncompressed and the directory str4ucture and files restored on unix systems by issuing: uncompress PARK_Package_2nd_Improv.tar.Z and then tar -xvf PARK_Package_2nd_Improv.tar SIGNIFICANCE OF ACHIEVING THE MILESTONES: Achieving the Second Code Improvement Milestone is significant because it opens the way to run significant sized problems since the earthquake code as well as the Fast Multipole library run in parallel under MPI. It presents to the scientific community fast parallel codes that allow creating simulations of the entire earthquake cycle in a 3D model that uses the most accurate description of fault friction, rate and state friction, and the quasi-dynamic radiation damping approximation to full elastodynamics. We have shown that the code can be run using a very large number of elements in the model compared to what could be done in the past. This means that enough elements can now be used that is it possible to represent a reasonably sized fault with elements that are small enough that they can properly represent the behavior of a continuum. Larger numbers of elements also allow occurrence in the simulation of earthquakes with a large range of sizes. This means that it will be possible to study in the simulations in what situations small earthquakes occur in isolation and in what situations they may cascade or grow into larger ones. This could help gain an understanding of whether patterns of microseismicity might be used to help predict earthquakes. The attainment of this milestone not only represents an advance in our computational ability to simulate earthquakes, it will allow us to understand the earthquake process better by creating data sets that can be compared with data on real earthquakes.