Space decomposition based parallelization solutions for the combined fi nite-discrete element method in 2D

2014-03-20 03:04:24LukasSchiavaAlbanoMunjiza

Journal of Rock Mechanics and Geotechnical Engineering 2014年6期

T.Lukas,G.G.Schiava D’Albano,A.Munjiza

School of Engineering and Materials Science,Queen Mary,University of London,UK

T.Lukas*,G.G.Schiava D’Albano,A.Munjiza

School of Engineering and Materials Science,Queen Mary,University of London,UK

A R T I C L E I N F O

Article history:

Received 1 July 2014

Received in revised form

6 October 2014

Accepted 8 October 2014

Available online 29 October 2014

Parallelization

The combined f i nite-discrete element method(FDEM)belongs to a family of methods of computational mechanics of discontinua.The method is suitable for problems of discontinua,where particles are deformable and can fracture or fragment.The applications of FDEM have spread over a number of disciplines including rock mechanics,where problems like mining,mineral processing or rock blasting can be solved by employing FDEM.In this work,a novel approach for the parallelization of two-dimensional (2D)FDEM aiming at clusters and desktop computers is developed.Dynamic domain decomposition based parallelization solvers covering all aspects of FDEM have been developed.These have been implemented into the open source Y2D software package and have been tested on a PC cluster.The overall performance and scalability of the parallel code have been studied using numerical examples.The results obtained con f i rm the suitability of the parallel implementation for solving large scale problems. ?2014 Institute of Rock and Soil Mechanics,Chinese Academy of Sciences.Production and hosting by Elsevier B.V.All rights reserved.

1.Introduction

The combined f i nite-discrete element method(FDEM),pioneered by Munjiza(2004),has become a tool of choice for the problems of discontinua,where particles are deformable and can fracture or fragment.This capability is especially useful to solve problems in rock engineering(Mahabadi et al.,2010a,b;Mahabadi, 2012;Grasselli et al.,2014;Lisjak et al.,2014;Rougier et al.,2014).

The limitation of FDEM is that it is CPU-intensive and,as a consequence,is dif f i cult to analyze large scale problems on sequential CPU hardware.Thus,the use of high-performance parallel computers is indispensable.

Parallelization strategies for both f i nite element method(FEM) (Smith et al.,2013)and discrete element method(DEM)(Sawley and Cleary,1999;Hendrickson and Devine,2000)are well established.FDEM combines f i nite element based analysis of continua with discrete element based transient dynamics,contact detection (CD)and contact interaction solutions.As a consequence,parallelization strategies developed for FEM or DEM alone cannot be directly applied to FDEM.

The parallelization of FDEM itself is somewhat less explored.A computational procedure for f i nite-discrete element simulation on shared-memory parallel computers was presented by Owen et al. (2000)and Owen and Feng(2001).Parallelization for distributedmemory parallel computers following multiple-instructions/ multiple-data(MIMD)paradigm was done by Wang et al.(2004). In all those studies,a master/slave approach was adopted.One master processor was performing domain decomposition and load balancing(LB)tasks,then distributing work to slave processors.

Some general strategies for parallelization of FDEM are described by Munjiza et al.(2012).Static domain decomposition based parallelization of FDEM is presented by Schiava D’Albano (2014).Lei et al.(2014)developed a hardware independent FDEM parallelization framework by using virtual parallel machine(PVM). PVM creates a single virtual parallel machine from a heterogeneous system of computers and the execution of the parallel program is coordinated through sending and receiving messages(Geist et al., 1994).

In recent years,the use of graphics processing units(GPUs)is also being explored for both FEM and DEM.The GPU implementation of FDEM was presented by Zhang et al.(2013).The GPU parallelization of the coupled FEM/DEM approach(CDEM)was described by Wang et al.(2013).

The parallelization strategy described in this paper aims at performing all tasks(domain decomposition,LB)concurrently on all processors.As such,the authors hope that it will provide some additional contribution to the development in the f i eld. In the rest of the paper,the detailed description of the proposedtwo-dimensional(2D)parallelization solutions is provided together with numerical results.

2.The combined f i nite-discrete element method

FDEM couples DEM and FEM by generating a f i nite element mesh separately for each particle(discrete element)located within the computational domain.Thus,f i nite element based analysis of continua is combined with discrete element based transient dynamics,CD and contact interaction solutions.

The equation of motion is solved for each element separately. The governing equation is

where M is the mass matrix,C is the damping matrix,¨u is the acceleration vector,˙u is the velocity vector,Fintis the vector of internal forces,Fextis the vector of external forces(including contact forces),and F joint is the vector of forces calculated from joint elements.

The joint element acts as a bond between edges of two triangular elements.The bond is a representation of a fracture mechanism termed as a“combined single and smeared crack model”developed by Munjiza et al.(1999).The crack model is based on experimental stress-strain curves.For instance,if the forces pulling both triangular elements apart exceed the ultimate tensile strength,a crack is created by breaking the bond.Thus,transition from continua to discontinua is introduced.

An explicit time integration scheme based on a central difference method is employed to solve the equation of motion.The following calculations must be performed in each time step:

(1)Evaluation of internal forces based on deformation of particles.

(2)Evaluation of joint forces based on the deformation of joint elements.

(3)Fracture of joints.

(4)The CD.

(5)Contact interaction(evaluation of contact forces).

(6)Application of external forces.

(7)Solution of the equation of motion for each discrete element separately.

It should be noted that even though the CD could be done in each time step,it would be very expensive.Thus,a buffer controlling the CD frequency is introduced and the CD is done only if the maximum traveled distance exceeds the size of this buffer.

The CD is the process of f i nding all contacting couples,i.e.all pairs of interacting discrete elements(DEs).The computational domain is overlaid by a CD grid of a chosen cell size(Fig.5).Each cell can contain one or more DEs,i.e.a list of DEs for each cell is assembled.It is worth mentioning that the linear increase in the cell size results in the quadratic increase of the CD cost.

The CD algorithms developed for FDEM include Munjiza-NBS algorithm(Munjiza et al.,1995),MR algorithm(Munjiza et al., 2006)and MS algorithm(Schiava D’Albano et al.,2013).Each algorithm f i nds contacting couples by processing the data saved in the CD grid in a different way.

Contact interaction is the mathematical model to compute the penetration of a DE into other DE.The contact force evaluation from the calculated penetration is based on a penalty function method (Munjiza and Andrews,2000).

The FDEM code Y2D was originally written to illustrate concepts described in FDEM book written by Munjiza(2004).Y2D employs an NBS CD algorithm.The cell size of the CD grid is equal to the diameter dCDof the circumscribed circle of the largest triangular element(Fig.5).

It is out of the scope of this paper to provide detailed description of the above principles.The details on each can be found in the FDEM book(Munjiza,2004).

3.Parallelization strategy adopted in this work

Parallelization strategies usually attempt to divide the large problem(computational domain)into a number of smaller subproblems(sub-domains).Domain decomposition is one way to accomplish this.A good parallel implementation must meet two, often competing,requirements;each processor must be kept busy doing useful work and communication between processors must be kept to a minimum.For typical problems in FDEM,the f i rst requirement can be achieved only by employing dynamic domain decomposition and LB since objects(discrete and f i nite elements) migrate from one sub-domain to another,thus creating a workload imbalance.

Partitioning methods used for domain decomposition can be either geometric or topological(Hendrickson and Devine,2000). Geometric methods divide the computational domain by exploiting the location of objects in the simulation.Topological methods are employed to partition the domain depending on the connectivity of interactions instead of geometric positions.The connectivity is generally described as a graph.

One of the most commonly used topological methods is a partitioner termed METIS(Karypis,2013).Its message-passing interface(MPI)implementation is called ParMETIS(Karypis and Schloegel,2013).Shapes of sub-domains generated by ParMETIS are irregular and,thus,add to the complexity of a parallel implementation(Owen et al.,2000;Owen and Feng,2001).Therefore,a geometric method,based on the recursive coordinate bisection (RCB)algorithm f i rst introduced by Berger and Bokhari(1987),has been adopted in this work.The main two advantages of an RCB algorithm are an ease of implementation due to the simple rectangular shape of generated sub-domains and the fact that the RCB algorithm is incremental.A small change in the position of elements within the sub-domain will cause a small change in the partitioning and,thus,only a fraction of elements will be redistributed among processors.

To demonstrate the parallelization strategy used,the computational domain is discretized into a rectangular grid of subdomains by using a modi f i ed RCB algorithm(Srinivasan et al., 1997)(see Section 3.2 for details).Each sub-domain is then assigned to a single processor in the PC cluster.The parallel version of a sequential FDEM program(Y2D code)is implemented by using the MPI(Pacheco,1997;Gropp et al.,1999).

3.1.Classi f i cation of elements

In order to adopt the chosen partitioning algorithm,each subdomain is con f i ned to a rectangular shape.A buffer-zone is introduced around the borders of each sub-domain(see Fig.1).The buffer-zone around borders of sub-domain controls the frequency of domain decomposition(migration of elements from one processor to another).The presence of new elements within the subdomain means that the CD must be performed.Thus,for a typical FDEM simulation,the size of the buffer-zone is usually set equal to that of the buffer controlling the frequency of the CD.

A bigger size of the buffer-zone means a higher number of elements shared among processors.This increases the communication overhead within each time step(updating nodal forces),but the migration,which is a very expensive operation,is performed less often.Thus,for a highly dynamic example,the size of thebuffer-zone can be set bigger while for a quasi-static problem it can be set smaller.

Elements are classi f i ed into several categories(statuses) depending on their location within the sub-domain.This is based on a modi f i ed approach of Wang et al.(2004).Constant strain triangular elements located inside a sub-domain(Fig.1)are marked as internal(A).Elements located at the border between two subdomains are shared by two processors and marked as interfacial (B)(Fig.1).Lastly,elements located at the borders between three/ four sub-domains are shared by three/four processors(C3/C4) (Fig.1).It is assumed that the size of the biggest element is smaller than the size of the sub-domain.

Joint elements are not classi f i ed by their location in the subdomain but by the combination of triangular elements(A-A,A-B, A-C3,etc.)to which they are attached.For instance,a combination B-C3/C4 gives a status of joint element as B(Fig.2)since both triangular elements are located on only two processors.Combination A-B/C3/C4 gives status A(Fig.2)etc.The jointelement attached to only one triangular element is deleted(Fig.2)since the same joint element is located on a different processor where both its triangular elements are presented.

Fig.1.Classi f i cation of constant strain triangular elements into internal(A)and interfacial(B,C3,C4)elements depending on their position.

3.2.Domain partitioning

Computational domain is partitioned by a modi f i ed RCB algorithm(Srinivasan et al.,1997)originally developed for molecular dynamics simulations.The main advantage of this algorithm is a reduced complexity of parallel implementation.Partitioning is performed hierarchically,which means the domain is systematically partitioned at different levels.The number of partition levels equals the dimensionality of the domain(two for a 2D problem).

Fig.2.Classi f i cation of joint elements into internal(A)and interfacial(B,C3,C4) elements depending on the statuses of triangular elements to which the joint element is attached.

For instance,partitioning for 16 processors(Fig.3a)is performed in two steps.In the f i rst step,the domain is divided in x direction into four columns containing an equal number of elements and,in the second step,each column is again divided in y direction into four rows.For the same example,the original RCB algorithm would require four steps since the domain is divided in half in x direction. Next,each partition is divided in half in y direction,and then each partition is again divided in x direction etc.(Fig.3b).

3.3.Nodal forces

Nodal forces are a summation of internal forces,joint forces and contact forces.Internal forces(triangular elements)and joint forces are calculated for both internal and interfacial elements,but any force of an interfacial element is divided by 2,3 or 4 depending on the number of processors sharing the element.

Contact forces between triangular elements are divided by a number depending on the classi f i cation of those elements.For instance,an internal element in contact with any interfacial element produces a unique contact force and therefore this force is divided by 1.0.If two interfacial elements(B-C4)are in contact,the contact force is calculated on only two out of four processors and must be therefore divided by 2.0.In general,the contact forces between interfacial elements must be divided by the lower number of processors derived from the classi f i cation of both elements.Two exceptions to this rule exist.Firstly,if one interfacial element B is located at a horizontal border and the second interfacial element B at a vertical border,the contact force is unique because corresponding counterparts of these two elements are located on two different processors(Fig.4a).Secondly,contact force between two interfacial elements C3,one located at the corner and the second located at the border,must be divided by 2.0(Fig.4b).

When a calculation of nodal forces for all elements is f i nished, nodal forces of interfacial elements are exchanged between corresponding processors in each time step.

3.4.Contact detection

It follows from the rules outlined in the nodal and contact forces section that any sequential CD algorithm can be used directly in parallel implementation of the code without any modi f i cation. Instead of performing a global contact search and parallelizing it, the CD is performed locally on each sub-domain independently from other sub-domains.The advantage of this approach is a simpli f i cation of parallel programming.

Singly connected lists of interfacial and internal elements located in the proximity of each border/corner are assembled during the CD(Fig.5).For the example in Fig.5,this makes 16 lists in total.These lists are later used to assemble messages during force exchange and also to check new positions of elements during the migration of elements.

3.5.Migration of elements between processors

As mentioned above,elements are migrating from one subdomain to another.Since the buffer zone(calculated from the CD buffer)is introduced around borders of the sub-domain(Fig.1),the migration occurs only if the maximum traveled distance is bigger than or equal to the buffer zone.Thus,migration of elements is not performed in every time step,which would be very expensive in terms of CPU time and increased communication overhead.

When forces are exchanged,the equation of motion(Eq.(1))for each element is solved resulting in new positions of elements.The status of all interfacial and internal elements located close to the borders of the sub-domain must therefore be updated.

Fig.3.Grid of 16 processors partitioned by:(a)Hierarchical RCB algorithm;(b)Original RCB algorithm.

Fig.4.Contact force between:(a)Two interfacial elements B located at perpendicular borders;(b)Two interfacial elements C3 located at a border and a corner.

Fig.5.Singly connected lists of elements located in the proximity of borders of sub-domain assembled during the CD.

Fig.6.(a)Horizontal send-receive communication between 8 processors in one time slot performed in 3 steps.(b)Vertical communication performed in two time slots.

Singly connected lists assembled during the CD(Fig.5)are employed to check new positions of elements and the status of each element is updated if necessary.This reduces the cost of migration signi f i cantly since only the position of elements located in the proximity of borders of the sub-domain is checked.The approximate position of an element is already known beforehand since a separate list for each border/corner is assembled.

Elements which leave the sub-domain must be deleted from the database and new elements that moved into the sub-domain from neighboring processors are added to the database.It is therefore necessary to perform a new CD search in the next time step.It is worth noting that the CD is not done from scratch but singly connected lists of contacting couples are only updated taking into account deleted/received elements.

3.6.Communication between processors

All main communications(nodal force exchange,migration of elements,redistribution of elements during the LB)are performed in two separate stages.Horizontal messages(assembled for right and left borders)are exchanged in the f i rst stage.In the second stage,messages assembled for top and bottom borders are exchanged vertically.Information for a neighboring diagonal processor(if needed)is f i rst sent in the horizontal message and then sent again from the receiving processor in the vertical message. Hence there is no communication diagonally.

Both horizontal and vertical communications are divided into two time slots.In the f i rst time slot,messages are exchanged between processors in columns/rows 0-1,2-3,4-5,etc.,and in thesecond time slot between columns/rows 1-2,3-4,etc.Fig.6b shows both time slots for vertical communication.

Fig.7.Partitioning of global LB grid in x and y directions.

The horizontal communication in each time slot is done in several steps to avoid interlocking(Fig.6a).The number in column (NIC)for each processor is calculated and communication pairs for horizontal communication are assembled.In the f i rst step,messages between processors with equal NIC are exchanged(Fig.6a).In the second step,messages between processors with NIC±1 are sent,and in the third step,communication between processors with NIC±2 is performed,etc.The vertical communication in each time slot is done in one step since each processor has only one neighbor at the top/bottom border(Fig.6b).Thus all vertical communication is f i nished in two steps in total,regardless of the number of processors.

It should be noted that the communication described above employs a so-called blocking communication.If the computation on the receiving processor is completed faster than the computation on the sending processor,the receiving processor must wait for the sending processor to send the message.This provides the user with means to synchronize the execution of the program on different processors.As a consequence,overlapping communication and computation are not possible.

Fig.8.A box f i lled by 32,400 discrete elements each comprising 6 f i nite elements.The box is f i xed in both x and y directions.Elements are moving in the diagonal direction.Domain decomposition for 16 processors at time:(a)0 s;(b)2 s;(c)7.5 s.

4.Load balancing

Each sub-domain is assigned to a single processor in the PC cluster.Elements migrate between processors(size of sub-domain does not change)until an imbalance in the workload exceeds a value speci f i ed in the input f i le.Then the re-partitioning(size of each sub-domain is updated)and LB is performed.This is done in the following steps.

Step 1.The LB grid is assembled during the CD on each processor locally.Each cell in the LB grid contains the list of elements and the count of elements.Since the count of elements is not enough to estimate the workload in each cell,the number of contacting couples for each element located in the cell is added to the count.After adding the“weight”of contacting couples,the counts of elements in local LB grids are exchanged between all processors and the global LB grid is assembled.

The cell size of the CD grid is given by the maximum element diameter dCD(see Munjiza et al.(1995)).The cell size of the LB grid dLBis then given by

Table 1Recorded CPU time and calculated speedup for the box f i lled by 32,400 discrete elements.

where ILBis a parameter larger than 1.Hence the LB grid has a f i ner resolution than the CD grid in order to achieve better repartitioning and an even distribution of the workload.The limitation of setting a f i ner LB grid is an increased communication overhead and higher RAM requirements.Thus,the reasonable range of ILBis from 2 to 10.ILBis set to 4 in the numerical examples presented in this paper.

Step 2.Re-partitioning is performed by using the global LB grid. The partitioning procedure comprises:

(1)Calculation of sums of each column in the LB grid.

(2)Partitioning in x direction to speci f i ed number of columns (Step 1 in Fig.7).

(3)Calculation of sums of each row in the LB grid for each column separately.

(4)Partitioning each column in y direction to speci f i ed number of rows(Step 2 in Fig.7).

Partitioning to four processors is illustrated by an example in Fig.7.

Fig.9.Speedup up to 32 processors for the box f i lled by 32,400 discrete elements.

Step 3.Positions of elements saved in cells in the proximity of borders/corners are checked and status of each element is updated.Elements are then redistributed among processors.Since the RCB algorithm is an incremental partitioner,only a small change in the size of each sub-domain is needed to perform the LB.This signi f icantly reduces the cost of re-partitioning since only a small number of elements need to be redistributed among processors.

It is worth noting that borders are moving in increments.The size of increment is equal to the cell size of the LB grid dLB.Thus a small imbalance is created during each re-partitioning.This can be minimized by setting a smaller dLB.

Fig.10.Material distribution for the Barre Granite Brazilian disc where green represents quartz,blue represents feldspar and orange represents biotite.

5.CPU performance tests

Parallel code was tested on a PC cluster with 3592 nodes.Each node contains two 8-core 2.60 GHz IntelXeon E5-2670 CPUs and 32 GB DDR3 1600 MHz RAM memory.

5.1.A box f i lled by 32,400 discrete elements

To illustrate the LB and re-partitioning procedures,a box f i lled by 32,400 discrete elements,each comprising 6 f i nite elements, with initial velocity 100 m/s in diagonal direction,was tested on up to 32 processors.The box is f i xed in both x and y directions. Properties of each element are as follows:modulus of elasticity E=990 MPa,Poisson’s ratioν=0.5 and contact penalty is 1.32 GPa. The time step is 0.1 ms and the simulation was run for 75,000 time steps.The domain decomposition for 16 processors at different times is shown in Fig.8.

Recorded CPU time and calculated speedup are summarized in Table 1 and speedup is plotted in Fig.9.The ideal speedup in Fig.9 means that the speedup is equal to the number of processors.

This performance test can be considered a worst case scenario from the LB point of view since all discrete elements are moving in the same direction across the box.Moreover,the number of contact interactions,which is very expensive in terms of CPU time,is limited for the majority of the simulation time.Therefore,the performance is dominated by a communication overhead,caused mainly by element migration and also by redistribution of elements during the LB.This is especially true for 2 processors since the size of messages(number of elements located at the border between processors)is quite big.The performance improves with higher number of processors(Table 1).The results suggest that the communication cost scales well with the increasing number of processors.

5.2.Brazilian disc test

The Barre Granite Brazilian disc test is numerically simulated on up to 32 processors.The Barre Granite is a heterogeneous rock comprising approximately 24%quartz,68%feldspar and 8%biotite (Nasseri et al.,2006).The input f i le was generated using Y-GUI (Mahabadi et al.,2010b)as shown in Fig.10.The material properties for the test are summarized in Table 2(Mahabadi,2012).The shear strength is set to a high number to prevent fracturing in Mode II (fracturing caused by a shear stress).The radius of the disc is 40 mm with a unit thickness.The disc comprises 52,308 triangular elements and 75,466 joint elements.The loading rate of platens is 0.5 mm/s.

Recorded simulation time for different numbers of processors and calculated speedup are summarized in Table 3 and speedup is plotted in Fig.11.It can be seen from Fig.11 that the speedup has an almost linear trend.Decrease in speedup for a higher number of processors is expected as the ratio of computation(number of elements assigned to each sub-domain)to communication decreases.

Stressσyyand fracture patterns for sequential and parallel solutions are shown in Fig.12a-c.Kinetic energy of the system for the whole simulation time for different numbers of processors is plotted in Fig.12d.

The fracture patterns obtained for different numbers of processors show a good correspondence.The small differences are caused by the presence of rounding errors.Rounding errors are introduced due to the limited amount of memory available forstoring real numbers(Goldberg,1991).Numbers likeπwould need an in f i nite amount of memory,thus only the approximate values of real numbers are stored.Different fracture patterns are re f l ected in the relative changes in the kinetic energy(Fig.12d).

Table 2Material properties for the Barre Granite Brazilian disc test(Mahabadi,2012).

Table 3Recorded CPU time and calculated speedup for the Brazilian disc test.

The simple Coulomb friction model implemented in Y2D is not suitable for quasi-static problems(Brazilian disc).Aversion of the Y code named Y-Geo(Mahabadi et al.,2012)addresses this problem among others.It should be noted that the proposed parallelization strategy is directly applicable to the Y-Geo as well as other versions of the Y code.

Fig.11.Speedup up to 32 processors for the Brazilian disc test.

6.Conclusions

A dynamic domain decomposition parallelization strategy for FDEM has been presented in this work.Performance tests of the current parallel implementation con f i rm suitability of this approach for parallelizing FDEM.

The speedup calculated for Brazilian disc test simulation scales well with an increasing number of processors.Decreasing performance with an increasing number of processors can be observed as the ratio of computation to communication decreases.Thus performance is expected to improve with increasing problem sizes.

The speedup calculated for a box f i lled by discrete elements is scaling linearly and is approximately equal to half the number of processors used since the simulation time is dominated by thecommunication cost.This example shows that the communication cost scales with a higher number of processors.

Fig.12.Results obtained for a Brazilian disc test:(a)Stressσyy(Pa),1 processor at 0.15 s;(b)Stressσyy(Pa),16 processors at 0.15 s;(c)Stressσyy(Pa),32 processors at 0.15 s;(d) Kinetic energy of the system for 1,16 and 32 processors.

The performance tests show the suitability of the communication pattern proposed as the number of horizontal communication steps increases moderately while the vertical communication steps remain constant and are independent of an increasing number of processors.

The current version of the parallel implementation employs a blocking communication.Further improvementof the performance should be achieved by employing non-blocking communication, thus performing communication concurrently with computations. The actual implementation of the parallelization is available in Y2D open source format.

Con f l ict of interest

The authors wish to con f i rm that there are no known con f l icts of interest associated with this publication and there has been no signi f i cant f i nancial support for this work that could have in f l uenced its outcome.

Acknowledgments

The authors would like to express their thanks to Mohammad Saadatfar from Australian National University for the access to NCI’s Raijin supercomputer.

Berger MJ,Bokhari SH.A partitioning strategy for nonuniform problems on multiprocessors.IEEE Transactions on Computers 1987;C-36(5):570-80.

Geist A,Beguelin A,Dongarra J,Jiang W,Manchek R,Sunderam V.PVM:Parallel Virtual Machine:a user’s guide and tutorial for networked parallel computing. Cambridge,Massachusetts:MIT Press;1994.

Goldberg D.What every computer scientist should know about f l oating-point arithmetic.ACM Computing Surveys 1991;23:5-48.

Grasselli G,Lisjak A,Mahabadi OK,Tatone BSA.In f l uence of pre-existing discontinuities and bedding planes on hydraulic fracturing initiation.European Journal of Environmental and Civil Engineering 2014.http://dx.doi.org/10.1080/ 19648189.2014.906367[in press].

Gropp W,Lusk E,Thakur R.Using MPI-2 advanced features of the message-passing interface.Cambridge,Massachusetts:MIT Press;1999.

Hendrickson B,Devine K.Dynamic load balancing in computational mechanics. Computer Methods in Applied Mechanics and Engineering 2000;184(2-4): 485-500.

Karypis G,Schloegel K.ParMETIS 4.0:parallel graph partitioning and sparse matrix ordering library.Minneapolis,MN,USA:Department of Computer Science and Engineering,University of Minnesota;2013.

Karypis G.METIS 5.1.0:a software package for partitioning unstructured graphs, partitioning meshes,and computing f i ll-reducing ordering of sparse matrices. Minneapolis,MN,USA:Department of Computer Science and Engineering, University of Minnesota;2013.

Lei Z,Rougier E,Knight EE,Munjiza A.A framework for grand scale parallelization of the combined f i nite discrete element method in 2D.Computational Particle Mechanics 2014;1(3):307-19.

Lisjak A,Grasselli G,Vietor T.Continuum-discontinuum analysis of failure mechanisms around unsupported circular excavations in anisotropic clay shales.International Journal of Rock Mechanics and Mining Sciences 2014;65:96-115.

Mahabadi OK,Cottrell BE,Grasselli G.An example of reallistic modelling of rock dynamics problems:FEM/DEM simulation of dynamic Brazilian test on Barre Granite.Rock Mechanics and Rock Engineering 2010a;43(6):707-16.

Mahabadi OK,Grasselli G,Munjiza AY-GUI.A graphical user interface and preprocessor for the combined f i nite-discrete element code,Y2D,incorporating material heterogeneity.Computers and Geosciences 2010b;36(2):241-52.

Mahabadi OK,Lisjak A,Grasselli G,Munjiza A.Y-Geo:a new combined f i nitediscrete element numerical code for geomechanical applications.International Journal of Geomechanics 2012;12(6):676-88.

Mahabadi OK.Investigating the in f l uence of micro-scale heterogeneity and microstructure on the failure and mechanical behaviour of geomaterials.PhD Thesis.Toronto,Canada:University of Toronto;2012.

Munjiza A,Andrews KRF,White JK.Combined single and smeared crack model in combined f i nite-discrete element analysis.International Journal for Numerical Methods in Engineering 1999;44(1):41-57.

Munjiza A,Andrews KRF.Penalty function method for combined f i nite discrete element systems comprising large number of separate bodies.International Journal for Numerical Methods in Engineering 2000;49(11):1377-96.

Munjiza A,Knight EE,Rougier E.Computational mechanics of discontinua. Chichester,UK:John Wiley&Sons,Inc.;2012.

Munjiza A,Owen DRJ,Bicanic N.A combined f i nite-discrete element method in transient dynamics of fracturing solids.Engineering Computations 1995;12(2): 145-74.

Munjiza A,Rougier E,John NWM.MR linear contact detection algorithm.International Journal for Numerical Methods in Engineering 2006;66(1):46-71.

Munjiza A.The combined f i nite-discrete element method.Chichester,UK:John Wiley&Sons,Inc.;2004.

Nasseri MHB,Mohanty B,Young B.Fracture toughness measurements and acoustic emission activity in brittle rocks.Pure and Applied Geophysics 2006;163(5-6): 917-45.

Owen DRJ,Feng YT,Han K,Peric D.Dynamic domain decomposition and load balancing in parallel simulation of f i nite/discrete elements.In:European Congress on Computational Methods in Applied Sciences and Engineering. Barcelona,Spain:ECCOMAS;2000.p.11-4.

Owen DRJ,Feng YT.Parallelised f i nite/discrete element simulation of multifracturing solids and discrete systems.Engineering Computations 2001;18(3-4):557-76.

Pacheco PS.Parallel programming with MPI.San Francisco,CA,USA:Morgan Kaufmann Publishers;1997.

Rougier E,Knight EE,Broome ST,Sussman AJ,Munjiza A.Validation of a threedimensional f i nite-discrete element method using experimental results of the Split Hopkinson Pressure Bar test.International Journal of Rock Mechanics and Mining Sciences 2014;70:101-8.

Sawley M,Cleary P.A parallel discrete element method for industrial granular f l ow simulations.EPFL Supercomputing Review 1999;11:23-9.

Schiava D’Albano GG,Munjiza A,Lukas T.Novel MS(MunjizaSchiava)contact detection algorithm for multi-core architectures.In:Particles 2013,Stuttgart, Germany;2013.

Schiava D’Albano GG.Computational and algorithmic solutions for large scale combined f i nite-discrete elements simulations.PhD Thesis.London,UK:Queen Mary,University of London;2014.

Smith IM,Grif f i ths DV,Margetts L.Programming the f i nite element method.5th ed. Chichester,UK:John Wiley&Sons,Inc.;2013.

Srinivasan SG,Ashok I,Jonsson H,Kalonji G,Zahorjan J.Dynamic-domain-decomposition parallel molecular dynamics.Computer Physics Communication 1997;102(1-3):44-58.

Wang F,Feng YT,Owen DRJ.Parallelization for f i nite-discrete element analysis in a distributed-memory environment.International Journal of Computational Engineering Science 2004;5(1):1-23.

Wang L,Li S,Zhang G,Ma Z,Zhang L.A GPU-based parallel procedure for nonlinear analysis of complex structures using a coupled FEM/DEM approach.Mathematical Problems in Engineering 2013:1-15.http://dx.doi.org/10.1155/2013/ 618980.

Zhang L,Quigley SF,Chan AHC.A fast scalable implementation of the two-dimensional triangular discrete element method on a GPU platform. Advances in Engineering Software 2013;60-61:70-80.

Tomas Lukas obtained an MSc in Mechanical Engineering at VSB-Technical University of Ostrava after which he worked at the same university as a research assistant in the f i eld of the Rotordynamics.After two years,he started PhD studies at Queen Mary,University of London(QMUL) under the supervision of Professor Antonio Munjiza where he conducted research on the parallelization of the Combined Finite-Discrete Element Method.During his PhD studies,he worked as a part time research assistant and he was also employed as a teaching assistant at QMUL. He obtained his PhD degree in Autumn 2014.

*Corresponding author.Tel.:+44(0)20 7882 5300.

E-mail address:t.lukas@qmul.ac.uk(T.Lukas).

Peer review under responsibility of Institute of Rock and Soil Mechanics, Chinese Academy of Sciences.

http://dx.doi.org/10.1016/j.jrmge.2014.10.001

Load balancing

PC cluster

Combined f i nite-discrete element method (FDEM)

Journal of Rock Mechanics and Geotechnical Engineering2014年6期

Journal of Rock Mechanics and Geotechnical Engineering的其它文章: Three-dimensional FDEM numerical simulation of failure processes observed in Opalinus Clay laboratory samples; Work f l ow to numerically reproduce laboratory ultrasonic datasets; Numerical simulation of hydraulic fracturing and associated microseismicity using f i nite-discrete element method; Modelling of blast-induced damage in tunnels using a hybrid fi nite-discrete numerical approach; Numerical analyses in the design of umbrella arch systems; Numerical modelling of f l ow and transport in rough fractures