Parallel Processing - The-Eye.eu!

Section 31.3: Starting Parallel FLUENT on a Linux/UNIX System. • Section 31.4: .... The available MPIs for Windows are shown in Table 31.2.2. • -cnf=hosts file ...... number of compute nodes, but is evenly divisible by the number of compute nodes, then .... There is a time penalty associated with load balancing itself, and so it ...
614KB taille 1 téléchargements 351 vues
Chapter 31.

Parallel Processing

The following sections describe the parallel-processing features of FLUENT. • Section 31.1: Introduction to Parallel Processing • Section 31.2: Starting Parallel FLUENT on a Windows System • Section 31.3: Starting Parallel FLUENT on a Linux/UNIX System • Section 31.4: Checking Network Connectivity • Section 31.5: Partitioning the Grid • Section 31.6: Checking and Improving Parallel Performance

31.1

Introduction to Parallel Processing

The FLUENT serial solver manages file input and output, data storage, and flow field calculations using a single solver process on a single computer. FLUENT’s parallel solver allows you to compute a solution by using multiple processes that may be executing on the same computer, or on different computers in a network. Figures 31.1.1 and 31.1.2 illustrate the serial and parallel FLUENT architectures. Parallel processing in FLUENT involves an interaction between FLUENT, a host process, and a set of compute-node processes. FLUENT interacts with the host process and the collection of compute nodes using a utility called cortex that manages FLUENT’s user interface and basic graphical functions. Parallel FLUENT splits up the grid and data into multiple partitions, then assigns each grid partition to a different compute process (or node). The number of partitions is an integral multiple of the number of compute nodes available to you (e.g., 8 partitions for 1, 2, 4, or 8 compute nodes). The compute-node processes can be executed on a massivelyparallel computer, a multiple-CPU workstation, or a network cluster of computers.

i

In general, as the number of compute nodes increases, turnaround time for the solution will decrease. However, parallel efficiency decreases as the ratio of communication to computation increases, so you should be careful to choose a large enough problem for the parallel machine.

FLUENT uses a host process that does not contain any grid data. Instead, the host process only interprets commands from FLUENT’s graphics-related interface, cortex.

c Fluent Inc. September 29, 2006

31-1

Parallel Processing

CORTEX

Solver File Input/Output

Disk

Data: Cell Face Node

Figure 31.1.1: Serial FLUENT Architecture

CORTEX

HOST

File Input/Output

Disk

FLUENT MPI

COMPUTE NODES Compute Node 0

Data: Cell Face Node

Compute Node 1

FLUENT MPI

Socket

FLUENT MPI

Data: Cell Face Node

MP Data: Cell Face Node

FLUENT MPI

Compute Node 2

FLUENT MPI

Data: Cell Face Node

Compute Node 3

Figure 31.1.2: Parallel FLUENT Architecture

31-2

c Fluent Inc. September 29, 2006

31.1 Introduction to Parallel Processing

The host distributes those commands to the other compute nodes via a socket interconnect to a single designated compute node called compute-node-0. This specialized compute node distributes the host commands to the other compute nodes. Each compute node simultaneously executes the same program on its own data set. Communication from the compute nodes to the host is possible only through compute-node-0 and only when all compute nodes have synchronized with each other. Each compute node is virtually connected to every other compute node, and relies on inter-process communication to perform such functions as sending and receiving arrays, synchronizing, and performing global operations (such as summations over all cells). Inter-process communication is managed by a message-passing library. For example, the message-passing library could be a vendor implementation of the Message Passing Interface (MPI) standard, as depicted in Figure 31.1.2. All of the parallel FLUENT processes (as well as the serial process) are identified by a unique integer ID. The host collects messages from compute-node-0 and performs operations (such as printing, displaying messages, and writing to a file) on all of the data, in the same way as the serial solver.

Recommended Usage of Parallel FLUENT The recommended procedure for using parallel FLUENT is as follows: 1. Start up the parallel solver. See Section 31.2: Starting Parallel FLUENT on a Windows System and Section 31.3: Starting Parallel FLUENT on a Linux/UNIX System for details. 2. Read your case file and have FLUENT partition the grid automatically upon loading it. It is best to partition after the problem is set up, since partitioning has some model dependencies (e.g., adaption on non-conformal interfaces, sliding-mesh and shell-conduction encapsulation). Note that there are other approaches for partitioning, including manual partitioning in either the serial or the parallel solver. See Section 31.5: Partitioning the Grid for details. 3. Review the partitions and perform partitioning again, if necessary. See Section 31.5.6: Checking the Partitions for details on checking your partitions. 4. Calculate a solution. See Section 31.6: Checking and Improving Parallel Performance for information on checking and improving the parallel performance.

c Fluent Inc. September 29, 2006

31-3

Parallel Processing

31.2

Starting Parallel FLUENT on a Windows System

You can run FLUENT on a Windows system using either command line options or the graphical user interface. Information about starting FLUENT on a Windows system is provided in the following sections: • Section 31.2.1: Starting Parallel FLUENT on a Windows System Using Command Line Options • Section 31.2.2: Starting Parallel FLUENT on a Windows System Using the Graphical User Interface • Section 31.2.3: Starting Parallel FLUENT with the Fluent Launcher • Section 31.2.4: Starting Parallel FLUENT with the Microsoft Job Scheduler (win64 Only)

i

See the separate installation instructions for more information about installing parallel FLUENT for Windows. The startup instructions below assume that you have properly set up the necessary software, based on the appropriate installation instructions.

Additional information about installation issues can also be found in the Frequently Asked Questions section of the Fluent Inc. User Services Center (www.fluentusers.com).

31.2.1

Starting Parallel FLUENT on a Windows System Using Command Line Options

To start the parallel version of FLUENT using command line options, you can use the following syntax in a Command Prompt window: fluent version -tnprocs [-pinterconnect] [-mpi=mpi type] -cnf=hosts file -path\\computer name\share name where • version must be replaced by the version of FLUENT you want to run (2d, 3d, 2ddp, or 3ddp). • -path\\computer name\share name specifies the computer name and the shared network name for the Fluent.Inc directory in UNC form. For example, if FLUENT has been installed on computer1 and shared as fluent.inc, then you should replace share name by the UNC name for the shared directory, \\computer1\fluent.inc.

31-4

c Fluent Inc. September 29, 2006

31.2 Starting Parallel FLUENT on a Windows System

• -pinterconnect (optional) specifies the type of interconnect. The ethernet interconnect is used by default if the option is not explicitly specified. See Table 31.2.1, Table 31.2.2, and Table 31.2.3 for more information. • -mpi=mpi type (optional) specifies the type of MPI. If the option is not specified, the default MPI for the given interconnect will be used (the use of the default MPI is recommended). The available MPIs for Windows are shown in Table 31.2.2. • -cnf=hosts file specifies the hosts file, which contains a list of the computers on which you want to run the parallel job. If the hosts file is not located in the directory where you are typing the startup command, you will need to supply the full pathname to the file. You can use a plain text editor such as Notepad to create the hosts file. The only restriction on the filename is that there should be no spaces in it. For example, hosts.txt is an acceptable hosts file name, but my hosts.txt is not. Your hosts file (e.g., hosts.txt) might contain the following entries: computer1 computer2

i

The last entry must be followed by a blank line.

If a computer in the network is a multiprocessor, you can list it more than once. For example, if computer1 has 2 CPUs, then, to take advantage of both CPUs, the hosts.txt file should list computer1 twice: computer1 computer1 computer2 • -tnprocs specifies the number of processes to use. When the -cnf option is present, the hosts file argument is used to determine which computers to use for the parallel job. For example, if there are 8 computers listed in the hosts file and you want to run a job with 4 processes, set nprocs to 4 (i.e., -t4) and FLUENT will use the first 4 machines listed in the hosts file. For example, the full command line to start a 3d parallel job on the first 4 computers listed in a hosts file called hosts.txt is as follows: fluent 3d -t4 -cnf=hosts.txt -path\\computer1\fluent.inc

c Fluent Inc. September 29, 2006

31-5

Parallel Processing

The default interconnect (ethernet) and the default communication library (mpich2) will be used since these options are not specified.

i

The first time that you try to run FLUENT in parallel, a separate Command Prompt will open prompting you to verify the current Windows account that you are logged into. Press the key if the account is correct. If you have a new account password, enter in your password and press the key, then verify your password and press the key. Once the username and password have been verified and encrypted into the Windows Registry, then FLUENT parallel will launch.

The supported interconnects for dedicated parallel ntx86 and win64 Windows machines, the associated MPIs for them, and the corresponding syntax are listed in Tables 31.2.131.2.3: Table 31.2.1: Supported Interconnects for the Windows Platform

Platform Windows

Processor 32-bit 64-bit

Architecture ntx86 win64

Interconnects* ethernet (default) ethernet (default), infiniband, myrinet

(*) Node processes on the same machine communicate by shared memory.

Table 31.2.2: Available MPIs for Windows Platforms MPI mpich2 ms net

Syntax (flag) -mpi=mpich2 -mpi=ms -mpi=net

Communication Library MPICH2 MPI Microsoft MPI socket

Notes (1), (2) (1), (2) (1), (2)

(1) Used with Shared Memory Machine (SSM) where the memory is shared between the processors on a single machine. (2) Used with Distributed Memory Machine (DMM) where each processor has it’s own memory associated with it.

31-6

c Fluent Inc. September 29, 2006

31.2 Starting Parallel FLUENT on a Windows System

Table 31.2.3: Supported MPIs for Windows Architectures (Per Interconnect)

Architecture ntx86 win64

31.2.2

Ethernet mpich2 (default), net mpich2 (default), ms, net

Myrinet -

Infiniband -

ms

ms

Starting Parallel FLUENT on a Windows System Using the Graphical User Interface

To run parallel FLUENT using the graphical user interface, type the usual startup command without a version (i.e., fluent -t2), and then use the Select Solver panel (Figure 31.3.1) to specify the parallel architecture and version information. File −→Run...

Figure 31.2.1: The Select Solver Panel

c Fluent Inc. September 29, 2006

31-7

Parallel Processing

Perform the following actions: 1. Under Versions, specify the 3D or 2D single- or double-precision version by turning the 3D and Double Precision options on or off, and turn on the Parallel option. 2. Under Options, select the interconnect or system in the Interconnect drop-down list. The Default setting is recommended, because it selects the interconnect that should provide the best overall parallel performance for your dedicated parallel machine. For a symmetric multi-processor (SMP) system, the Default setting uses shared memory for communication. If you prefer to select a specific interconnect, you can choose either Ethernet/Shared Memory MPI, Myrinet, Infiniband, or Ethernet via sockets. For more information about these interconnects, see Table 31.2.1, Table 31.2.2, and Table 31.2.3. 3. Set the number of CPUs in the Processes field. 4. (optional) Specify the name of a file containing a list of machines, one per line, in the Hosts File field. 5. Click the Run button to start the parallel version. No additional setup is required once the solver starts.

i

31-8

The first time that you try to run FLUENT in parallel, a separate Command Prompt will open prompting you to verify the current Windows account that you are logged into. Press the key if the account is correct. If you have a new account password, enter in your password and press the key, then verify your password and press the key. Once the username and password have been verified and encrypted into the Windows Registry, then FLUENT parallel will launch.

c Fluent Inc. September 29, 2006

31.2 Starting Parallel FLUENT on a Windows System

31.2.3

Starting Parallel FLUENT with the Fluent Launcher

The Fluent Launcher (Figure 31.2.2), is a stand-alone Windows application that allows you to launch FLUENT jobs from a computer with a Windows operating system to a cluster of computers. Settings made in the Fluent Launcher panel (Figure 31.2.2) are used to create a FLUENT parallel command. This command will then be distributed to your network where typically another application may manage the session(s). You can create a shortcut on your desktop pointing to the Fluent Launcher executable at FLUENT_INC\fluent6.x\launcher\launcher.exe where FLUENT INC is the root path to where FLUENT is installed, (i.e., usually the FLUENT INC environment variable) and x indicates the release version of FLUENT).

Figure 31.2.2: The Fluent Launcher Panel

c Fluent Inc. September 29, 2006

31-9

Parallel Processing

The Fluent Launcher allows you to perform the following: • Set options for your FLUENT executable, such as indicating a specific release or a version number. • Set parallel options, such as indicating the number of parallel processes (or if you want to run a serial process), and an MPI type to use for parallel computations. • Set additional options such as specifying the name and location of the current working folder or a journal file. When you are ready to launch your serial or parallel application, you can check the validity of the settings using the Check button (messages are displayed in the Log Information window). When you are satisfied with the settings, click the Launch button to start the parallel processes. To return to your default settings for the Fluent Launcher, based on your current FLUENT installation, click the Default button. The fields in the Fluent Launcher panel will return to their original settings. When you are finished using the Fluent Launcher, click the Close button. Any settings that you have made in the panel are preserved when you re-open the Fluent Launcher.

31-10

c Fluent Inc. September 29, 2006

31.2 Starting Parallel FLUENT on a Windows System

Setting the Path to FLUENT You need to specify the location of where FLUENT is installed on your system using the Fluent.Inc Path field, or click ... to browse through your directory structure to locate the installation folder (trying to use the UNC path if applicable). Once set, various fields in the Fluent Launcher (e.g., Release, MPI Types, etc.) are automatically populated with the available options, depending on the FLUENT installations that are available.

Setting Executable Options with the Fluent Launcher Under Executable Options, you can indicate the release number, as well as the version of the FLUENT executable that you want to run. Specifying a FLUENT Release Depending on what FLUENT releases are available in the Fluent.Inc Path, you can specify the number associated with a given release in the Release list. The list is populated with the FLUENT release numbers that are available in the Fluent Inc. Path field. Specifying the Version of FLUENT You can specify the dimensionality and the precision of the FLUENT product using the Version list. There are four possible choices: 2d, 2ddp, 3d, or 3ddp. The 2d and 3d options provide single-precision results for two-dimensional or three-dimensional problems, respectively. The 2ddp and 3ddp options provide double-precision results for twodimensional or three-dimensional problems, respectively.

Setting Parallel Options with the Fluent Launcher Under Parallel Options, you can indicate the number of FLUENT processes, the specific computer architecture you want to run the processes on, the type of MPI, as well as a listing of computer nodes that you want to use in the calculations. Specifying the Number of FLUENT Processes You can specify the number of FLUENT processes in the Number of Processes field. You can use the drop-down list to select from pre-set values of serial, 1, 2, 4, 8, 16, 32, or 64, or you can manually enter the number into the field yourself (e.g., 3, 10, etc.). The range of parallel processes ranges from 1 to 1024. If Number of Processes is equal to 1, you might want to consider running the FLUENT job using the serial setting.

c Fluent Inc. September 29, 2006

31-11

Parallel Processing

Specifying the Computer Architecture You can specify the computer architecture using the Architecture drop-down list. Depending on the selected release, the available options are ntx86 and win64. Specifying the MPI Type You can specify the MPI to use for the parallel computations using the MPI Types field. The list of MPI types varies depending on the selected release and the selected architecture. There are several options, based on the operating system of the parallel cluster. For more information about the available MPI types, see Tables 31.2.1-31.2.2. Specifying the List of Machines to Run FLUENT Specify the hosts file using the Machine List or File field. You can use the ... button to browse for a hosts file, or you can enter the machine names directly into the text field. Machine names can be separated either by a comma or a space.

Setting Additional Options with the Fluent Launcher Under Additional Options, you can specify a working folder and/or a journal file. In addition, you can specify whether to use the Microsoft Scheduler or whether to use benchmarking options. Specifying the Working Folder You can specify the path of your current working directory using the Working Folder field or click ... to browse through your directory structure. Note that a UNC path cannot be set as a working folder. Specifying a Journal File You can specify the path and name of a journal file using the Journal File field or click ... to browse through your directory structure to locate the file. Using the journal file, you can automatically load the case, compile any user-defined functions, iterate until the solution converges, and write results to a output file.

31-12

c Fluent Inc. September 29, 2006

31.2 Starting Parallel FLUENT on a Windows System

Specifying Whether or Not to Use the Microsoft Job Scheduler (win64 MS MPI Only) For the Windows 64-bit MS MPI only, you can specify that you want to use the Microsoft Job Scheduler (see Section 31.2.4: Starting Parallel FLUENT with the Microsoft Job Scheduler (win64 Only)) by selecting the Use Microsoft Scheduler check box. Once selected, you can then enter a machine name in the with Head Node text field. If you are running FLUENT on the head node, then you can keep the field empty. This option translates into the proper parallel command line syntax for using the Microsoft Job Scheduler. Specifying Whether or Not to Use the Fluent Launcher for Benchmarking If you are creating benchmark cases using parallel FLUENT, you can enable the Benchmark check box. This option involves having several benchmarking-related files available on your machine. If you are missing any of the files, the Fluent Launcher informs you of which files you need and how to locate them.

Fluent Launcher Example The Fluent Launcher takes the options that you have specified and uses those settings to create a FLUENT parallel command. This command (displayed in the Log Information window) will then be distributed to your network where typically another application may manage the session(s). For example, if, in the Fluent Launcher panel, you specified your Fluent.Inc Path to be \\my_computer\Fluent.Inc and under Executable Options, you selected 6.3.20 for the Release, and 3d for the Version. Then, under Parallel Options, you selected 2 for the number of Processes, win64 for the Architecture, selected mpich2 in the MPI Types field, then entered the location of a Z:\fluent.host file in the Machine List or File field. If you click the Check button, the command is displayed in the Log Information window. When you click the Launch button, the Fluent Launcher would then generate the following parallel command: \\my_computer\Fluent.Inc\ntbin\win64\fluent.exe 3d -r6.3.20 -t2 -mpi=mpich2 -cnf=Z:\fluent.hosts -awin64

c Fluent Inc. September 29, 2006

31-13

Parallel Processing

31.2.4

Starting Parallel FLUENT with the Microsoft Job Scheduler (win64 Only)

The Microsoft Job Scheduler allows you to manage multiple jobs and tasks, allocate computer resources, send tasks to compute nodes, and monitor jobs, tasks, and compute nodes. FLUENT currently supports Windows XP as well as the Windows Server operating system (win64 only). The Windows Server operating system includes a “compute cluster package” (CCP) that combines the Microsoft MPI type (msmpi) and Microsoft Job Scheduler. FLUENT provides a means of using the Microsoft Job Scheduler using the following flag in the parallel command: -ccp head-node-name where -ccp indicates the use of the compute cluster package, and head-node-name indicates the name of the head node of the computer cluster. For example, if you want to use the Job Scheduler, the corresponding command syntax would be: fluent 3d -t2 -ccp head-node-name Likewise, if you do not want to use the Job Scheduler, the following command syntax can be used with msmpi: fluent 3d -t2 -pmsmpi -cnf=host

i

The first time that you try to run FLUENT in parallel, a separate Command Prompt will open prompting you to verify the current Windows account that you are logged into. If you have a new account password, enter in your password and press the key. If you want FLUENT to remember your password on this machine, press the Y key and press the key. Once the username and password have been verified and encrypted into the Windows Registry, then FLUENT parallel will launch.

i

If you do not want to use the Microsoft Job Scheduler, but you still want to use msmpi, you will need to stop the Microsoft Compute Cluster MPI Service through the Control Panel, and you need to start your own version of SMPD (the process manager for msmpi on Windows) using the following command on each host on which you want to run FLUENT: start smpd -d 0

31-14

c Fluent Inc. September 29, 2006

31.3 Starting Parallel FLUENT on a Linux/UNIX System

31.3

Starting Parallel FLUENT on a Linux/UNIX System

You can run FLUENT on a Linux/UNIX system using either command line options or the graphical user interface. Information about starting FLUENT on a Linux/UNIX system is provided in the following sections: • Section 31.3.1: Starting Parallel FLUENT on a Linux/UNIX System Using Command Line Options • Section 31.3.2: Starting Parallel FLUENT on a Linux/UNIX System Using the Graphical User Interface • Section 31.3.3: Setting Up Your Remote Shell and Secure Shell Clients

31.3.1

Starting Parallel FLUENT on a Linux/UNIX System Using Command Line Options

To start the parallel version of FLUENT using command line options, you can use the following syntax in a command prompt window: fluent version -tnprocs [-pinterconnect] [-mpi=mpi type] -cnf=hosts file where • version must be replaced by the version of FLUENT you want to run (2d, 3d, 2ddp, or 3ddp). • -pinterconnect (optional) specifies the type of interconnect. The ethernet interconnect is used by default if the option is not explicitly specified. See Table 31.3.1, Table 31.3.2, and Table 31.3.3 for more information. • -mpi=mpi type (optional) specifies the type of MPI. If the option is not specified, the default MPI for the given interconnect will be used (the use of the default MPI is recommended). The available MPIs for Linux/UNIX are shown in Table 31.3.2. • -cnf=hosts file specifies the hosts file, which contains a list of the computers on which you want to run the parallel job. If the hosts file is not located in the directory where you are typing the startup command, you will need to supply the full pathname to the file. You can use a plain text editor to create the hosts file. The only restriction on the filename is that there should be no spaces in it. For example, hosts.txt is an acceptable hosts file name, but my hosts.txt is not.

c Fluent Inc. September 29, 2006

31-15

Parallel Processing

Your hosts file (e.g., hosts.txt) might contain the following entries: computer1 computer2

i

The last entry must be followed by a blank line.

If a computer in the network is a multiprocessor, you can list it more than once. For example, if computer1 has 2 CPUs, then, to take advantage of both CPUs, the hosts.txt file should list computer1 twice: computer1 computer1 computer2 • -tnprocs specifies the number of processes to use. When the -cnf option is present, the hosts file argument is used to determine which computers to use for the parallel job. For example, if there are 10 computers listed in the hosts file and you want to run a job with 5 processes, set nprocs to 5 (i.e., -t5) and FLUENT will use the first 5 machines listed in the hosts file. For example, to use the Myrinet interconnect, and to start the 3D solver with 4 compute nodes on the machines defined in the text file called fluent.hosts, you can enter the following in the command prompt: fluent 3d -t4 -pmyrinet -cnf=fluent.hosts Note that if the optional -cnf=hosts file is specified, a compute node will be spawned on each machine listed in the file hosts file. (If you enter this optional argument, do not include the square brackets.) The supported interconnects for parallel Linux/UNIX machines are listed below (Table 31.3.1, Table 31.3.2, and Table 31.3.3), along with their associated communication libraries, the corresponding syntax, and the supported architectures:

31-16

c Fluent Inc. September 29, 2006

31.3 Starting Parallel FLUENT on a Linux/UNIX System

Table 31.3.1: Supported Interconnects for Linux/UNIX Platforms (Per Platform)

Platform Linux

Sun SGI HP

IBM

Processor 32-bit

Architecture lnx86

64-bit

lnamd64

64-bit Itanium

lnia64

32-bit 64-bit 32-bit 64-bit 32-bit 64-bit PA-RISC 64-bit Itanium 32-bit 64-bit

ultra ultra 64 irix65 mips4 irix65 mips4 64 hpux11 hpux11 64 hpux11 ia64 aix51 aix51 64

Interconnects/Systems* ethernet (default), infiniband, myrinet ethernet (default), infiniband, myrinet, crayx ethernet (default), infiniband, myrinet, altix vendor** (default), ethernet vendor** (default), ethernet vendor** (default), ethernet vendor** (default), ethernet vendor** (default), ethernet vendor** (default), ethernet vendor** (default), ethernet vendor** (default), ethernet vendor** (default), ethernet

(*) Node processes on the same machine communicate by shared memory. (**) vendor indicates a proprietary vendor interconnect. The specific proprietary interconnects that are supported are dictated by those that the vendor’s MPI supports.

c Fluent Inc. September 29, 2006

31-17

Parallel Processing

Table 31.3.2: Available MPIs for Linux/UNIX Platforms MPI

31-18

hp

Syntax (flag) -mpi=hp

Communication Library HP MPI

intel

-mpi=intel

Intel MPI

mpich2

-mpi=mpich2

MPICH2

mpich

-mpi=mpich

MPICH1

mpichmx

-mpi=mpichmx MPICH-MX

mvapich sgi

-mpi=mvapich MVAPICH -mpi=sgi SGI MPI for Altix

cray vendor net

-mpi=cray -mpi=vendor -mpi=net

Cray MPI for XD1 Vendor MPI socket

Notes General purpose for SMPs and clusters General purpose for SMPs and clusters MPI-2 implementation from Argonne National Laboratory. For both SMPs and Ethernet clusters Legacy MPI from Argonne National Laboratory Only for Myrinet MX clusters Only for Infiniband clusters Only for SGI Altix systems (SMP); must start FLUENT on a system where parallel node processes are to run Only for Cray XD1 systems

c Fluent Inc. September 29, 2006

31.3 Starting Parallel FLUENT on a Linux/UNIX System

Table 31.3.3: Supported MPIs for Linux/UNIX Architectures (Per Interconnect)

Architecture

Ethernet

Myrinet

Infiniband

lnx86

hp (default), mpich2, net hp (default), intel, net

hp

hp

hp (default), mpich-mx

hp (default), intel, net vendor (default), mpich, net vendor (default), net vendor (default), net vendor (default), mpich, net vendor (default), mpich, net vendor (default), mpich, net vendor (default), mpich, net vendor (default), mpich, net vendor (default), mpich, net

hp -

hp (default), intel, mvapich hp (default), intel -

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

lnamd64

lnia64 aix51 64 hpux11 64 hpux11 ia64 irix65 mpis4 64 ultra 64 aix51 hpux11 irix65 mpis4 ultra

c Fluent Inc. September 29, 2006

Proprietary Systems cray [for -pcrayx]

sgi [for -paltix] vendor [for -pvendor] vendor [for -pvendor] vendor [for -pvendor] vendor [for -pvendor] vendor [for -pvendor] vendor [for -pvendor] vendor [for -pvendor] vendor [for -pvendor] vendor [for -pvendor]

31-19

Parallel Processing

31.3.2

Starting Parallel FLUENT on a Linux/UNIX System Using the Graphical User Interface

To run parallel FLUENT using the graphical user interface, type the usual startup command without a version (i.e., fluent), and then use the Select Solver panel (Figure 31.3.1) to specify the parallel architecture and version information. File −→Run...

Figure 31.3.1: The Select Solver Panel

31-20

c Fluent Inc. September 29, 2006

31.3 Starting Parallel FLUENT on a Linux/UNIX System

Perform the following steps: 1. Under Versions, specify the 3D or 2D single- or double-precision version by turning the 3D and Double Precision options on or off, and turn on the Parallel option. 2. Under Options, select the interconnect or system in the Interconnect drop-down list. The Default setting is recommended, because it selects the interconnect that should provide the best overall parallel performance for your dedicated parallel machine. For a symmetric multi-processor (SMP) system, the Default setting uses shared memory for communication. If you prefer to select a specific interconnect, you can choose either Ethernet/Shared Memory MPI, Myrinet, Infiniband, Altix, Cray, or Ethernet via sockets. For more information about these interconnects, see Table 31.3.1, Table 31.3.2, and Table 31.3.3. 3. Set the number of CPUs in the Processes field. 4. (optional) Specify the name of a file containing a list of machines, one per line, in the Hosts File field. 5. Click the Run button to start the parallel version. No additional setup is required once the solver starts.

31.3.3

Setting Up Your Remote Shell and Secure Shell Clients

For cluster computing on Linux or UNIX systems, most parallel versions of FLUENT will need the user account set up such that you can connect to all nodes on the cluster (using either the remote shell (rsh) client or the secure shell (ssh) client) without having to enter a password each time for each machine. Provided that the appropriate server daemons (either rshd or sshd) are running, this section briefly describes how you can configure your system in order to use FLUENT for parallel computing.

c Fluent Inc. September 29, 2006

31-21

Parallel Processing

Configuring the rsh Client The remote shell client (rsh), is widely deployed and used. It is generally easy to configure, and involves adding all the machine names, each on a single line, to the .rhosts file in your home directory. If you refer to the machine you are currently logged on as the ‘client’, and if you refer to the remote machine to which you seek password-less login as the ‘server’, then on the server, you can add the name of your client machine to the .rhosts file. The name could be a local name or a fully qualified name with the domain suffix. Similarly, you can add other clients from which you require similar access to this server. These machines are then “trusted” and remote access is allowed without the further need for a password. This setup assumes you have the same userid on all the machines. Otherwise, each line in the .rhosts file would need to contain the machine name as well as the userid for the client that you want access to. Please refer to your system documentation for further usage options. Note that for security purposes, the .rhosts file must be readable only by the user.

Configuring the ssh Client The secure shell client (ssh), is a more secure alternative to rsh and is also used widely. Depending on the specific protocol and the version deployed, configuration involves a few steps. SSH1 and SSH2 are two current protocols. OpenSSH is an open implementation of the SSH2 protocol and is backwards compatible with the SSH1 protocol. To add a client machine, with respect to user configuration, the following steps are involved: 1. Generate a public-private key pair using ssh-keygen (or using a graphical user interface client). For example: % ssk-keygen -t dsa where it creates a Digital Signature Authority (DSA) type key pair. 2. Place your public key on the remote host. • For SSH1, insert the contents of the client (~/.ssh/identity.pub) into the server (~/.ssh/authorized_keys). • For SSH2, insert the contents of the client (~/.ssh/id_dsa.pub) into the server (~/.ssh/authorized_keys2). The client machine is now added to the access list and the user is no longer required to type in a password each time. For additional information, consult your system administrator or refer to your system documentation.

31-22

c Fluent Inc. September 29, 2006

31.4 Checking Network Connectivity

31.4

Checking Network Connectivity

For any compute node, you can print network connectivity information that includes the hostname, architecture, process ID, and ID of the selected compute node and all machines connected to it. The ID of the selected compute node is marked with an asterisk. The ID for the FLUENT host process is always host. The compute nodes are numbered sequentially starting from node-0. All compute nodes are completely connected. In addition, compute node 0 is connected to the host process. To obtain connectivity information for a compute node, you can use the Parallel Connectivity panel (Figure 31.4.1). Parallel −→Show Connectivity...

Figure 31.4.1: The Parallel Connectivity Panel

Indicate the compute node ID for which connectivity information is desired in the Compute Node field, and then click the Print button. Sample output for compute node 0 is shown below: -----------------------------------------------------------------------------ID Comm. Hostname O.S. PID Mach ID HW ID Name -----------------------------------------------------------------------------host net balin Linux-32 17272 0 7 Fluent Host n3 hp balin Linux-32 17307 1 10 Fluent Node n2 hp filio Linux-32 17306 0 -1 Fluent Node n1 hp bofur Linux-32 17305 0 1 Fluent Node n0* hp balin Linux-32 17273 2 11 Fluent Node

O.S is the architecture, Comm. is the communication library (i.e., MPI type), PID is the process ID number, Mach ID is the compute node ID, and HW ID is an identifier specific to the interconnect used.

c Fluent Inc. September 29, 2006

31-23

Parallel Processing

31.5

Partitioning the Grid

Information about grid partitioning is provided in the following sections: • Section 31.5.1: Overview of Grid Partitioning • Section 31.5.2: Preparing Hexcore Meshes for Partitioning • Section 31.5.3: Partitioning the Grid Automatically • Section 31.5.4: Partitioning the Grid Manually • Section 31.5.5: Grid Partitioning Methods • Section 31.5.6: Checking the Partitions • Section 31.5.7: Load Distribution

31.5.1

Overview of Grid Partitioning

When you use the parallel solver in FLUENT, you need to partition or subdivide the grid into groups of cells that can be solved on separate processors (see Figure 31.5.1). You can either use the automatic partitioning algorithms when reading an unpartitioned grid into the parallel solver (recommended approach, described in Section 31.5.3: Partitioning the Grid Automatically), or perform the partitioning yourself in the serial solver or after reading a mesh into the parallel solver (as described in Section 31.5.4: Partitioning the Grid Manually). In either case, the available partitioning methods are those described in Section 31.5.5: Grid Partitioning Methods. You can partition the grid before or after you set up the problem (by defining models, boundary conditions, etc.), although it is better to partition after the setup, due to some model dependencies (e.g., adaption on non-conformal interfaces, sliding-mesh and shell-conduction encapsulation).

i

If your case file contains sliding meshes, or non-conformal interfaces on which you plan to perform adaption during the calculation, you will have to partition it in the serial solver. See Sections 31.5.3 and 31.5.4 for more information.

i

If your case file contains a mesh generated by the GAMBIT Hex Core meshing scheme or the TGrid Mesh/Hexcore menu option (hexcore mesh), you must filter the mesh using the tpoly utility or TGrid prior to partitioning the grid. See Section 31.5.2: Preparing Hexcore Meshes for Partitioning for more information.

Note that the relative distribution of cells among compute nodes will be maintained during grid adaption, except if non-conformal interfaces are present, so repartitioning after adaption is not required. See Section 31.5.7: Load Distribution for more information.

31-24

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

If you use the serial solver to set up the problem before partitioning, the machine on which you perform this task must have enough memory to read in the grid. If your grid is too large to be read into the serial solver, you can read the unpartitioned grid directly into the parallel solver (using the memory available in all the defined hosts) and have it automatically partitioned. In this case you will set up the problem after an initial partition has been made. You will then be able to manually repartition the case if necessary. See Sections 31.5.3 and 31.5.4 for additional details and limitations, and Section 31.5.6: Checking the Partitions for details about checking the partitions.

Domain

Before Partitioning

Interface Boundary

After Partitioning

Partition 0

Partition 1

Figure 31.5.1: Partitioning the Grid

c Fluent Inc. September 29, 2006

31-25

Parallel Processing

31.5.2

Preparing Hexcore Meshes for Partitioning

If you generate meshes using the GAMBIT Hex Core meshing scheme or the TGrid Mesh/Hexcore menu option (hexcore meshes), you often have features that can interfere with partitioning. Such features include hanging nodes and overlapping parent-child faces, and are located at the transition between the core of hexahedral cells and the surrounding bodyfitted mesh. To remove these features before you partition your hexcore meshes, you must convert the transitional hexahedral cells into polyhedra. The dimensions of each of these transitional cells remains the same after conversion, but these transitional cells will have more than the original 6 faces. The conversion to polyhedra must take place prior to reading the mesh into FLUENT, and can be done using either the tpoly utility or TGrid. When you use the tpoly utility, you must specify an input case file that contains a hexcore mesh. This file can either be in ASCII or Binary format, and the file should be unzipped. If the input file does not contain a hexcore mesh, then none of the cells are converted to polyhedra. When you use the tpoly utility, you should specify an output case file name. Once the input file has been processed by the tpoly filter, an ASCII output file is generated.

i

The output case file resulting from a tpoly conversion only contains mesh information. None of the solver-related data of the input file is retained.

To convert a file using the tpoly filter, before starting FLUENT, type the following: utility tpoly input filename output filename You can also use TGrid to convert the transitional cells to polyhedra. You must either read in or create the hexcore mesh in TGrid, and then save the mesh as a case file with polyhedra. To do this, use the File/Write/Case... menu option, being sure to enable the Write As Polyhedra option in the Select File dialog box.

Limitations Converted hexcore meshes have the following limitations: • The following grid manipulation tools are not available on polyhedral meshes: – extrude-face-zone under the modify-zone option – fuse – skewness smoothing – swapping (will not affect polyhedral cells) • The polyhedral cells that result from the conversion are not eligible for adaption. For more information about adaption, see Chapter 26: Adapting the Grid.

31-26

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

31.5.3

Partitioning the Grid Automatically

For automatic grid partitioning, you can select the bisection method and other options for creating the grid partitions before reading a case file into the parallel version of the solver. For some of the methods, you can perform pretesting to ensure that the best possible partition is performed. See Section 31.5.5: Grid Partitioning Methods for information about the partitioning methods available in FLUENT. Note that if your case file contains sliding meshes, or non-conformal interfaces on which you plan to perform adaption during the calculation, you will need to partition it in the serial solver, and then read it into the parallel solver, with the Case File option turned on in the Auto Partition Grid panel (the default setting). The procedure for partitioning automatically in the parallel solver is as follows: 1. (optional) Set the partitioning parameters in the Auto Partition Grid panel (Figure 31.5.2). Parallel −→Auto Partition...

Figure 31.5.2: The Auto Partition Grid Panel

If you are reading in a mesh file or a case file for which no partition information is available, and you keep the Case File option turned on, FLUENT will partition the grid using the method displayed in the Method drop-down list. If you want to specify the partitioning method and associated options yourself, the procedure is as follows: (a) Turn off the Case File option. The other options in the panel will become available. (b) Select the bisection method in the Method drop-down list. The choices are the techniques described in Section 31.5.5: Bisection Methods. (c) You can choose to independently apply partitioning to each cell zone, or you can allow partitions to cross zone boundaries using the Across Zones check button. It is recommended that you not partition cells zones independently

c Fluent Inc. September 29, 2006

31-27

Parallel Processing

(by turning off the Across Zones check button) unless cells in different zones will require significantly different amounts of computation during the solution phase (e.g., if the domain contains both solid and fluid zones). (d) If you have chosen the Principal Axes or Cartesian Axes method, you can improve the partitioning by enabling the automatic testing of the different bisection directions before the actual partitioning occurs. To use pretesting, turn on the Pre-Test option. Pretesting is described in Section 31.5.5: Pretesting. (e) Click OK. If you have a case file where you have already partitioned the grid, and the number of partitions divides evenly into the number of compute nodes, you can keep the default selection of Case File in the Auto Partition Grid panel. This instructs FLUENT to use the partitions in the case file. 2. Read the case file. File −→ Read −→Case...

Reporting During Auto Partitioning As the grid is automatically partitioned, some information about the partitioning process will be printed in the text (console) window. If you want additional information, you can print a report from the Partition Grid panel after the partitioning is completed. Parallel −→Partition... When you click the Print Active Partitions or Print Stored Partitions button in the Partition Grid panel, FLUENT will print the partition ID, number of cells, faces, and interfaces, and the ratio of interfaces to faces for each active or stored partition in the console window. In addition, it will print the minimum and maximum cell, face, interface, and faceratio variations. See Section 31.5.6: Interpreting Partition Statistics for details. You can examine the partitions graphically by following the directions in Section 31.5.6: Checking the Partitions.

31.5.4

Partitioning the Grid Manually

Automatic partitioning in the parallel solver (described in Section 31.5.3: Partitioning the Grid Automatically) is the recommended approach to grid partitioning, but it is also possible to partition the grid manually in either the serial solver or the parallel solver. After automatic or manual partitioning, you will be able to inspect the partitions created (see Section 31.5.6: Checking the Partitions) and optionally repartition the grid, if necessary. Again, you can do so within the serial or the parallel solver, using the Partition Grid panel. A partitioned grid may also be used in the serial solver without any loss in performance.

31-28

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

Guidelines for Partitioning the Grid The following steps are recommended for partitioning a grid manually: 1. Partition the grid using the default bisection method (Principal Axes) and optimization (Smooth). 2. Examine the partition statistics, which are described in Section 31.5.6: Interpreting Partition Statistics. Your aim is to achieve small values of Interface ratio variation and Global interface ratio while maintaining a balanced load (Cell variation). If the statistics are not acceptable, try one of the other bisection methods. 3. Once you determine the best bisection method for your problem, you can turn on Pre-Test (see Section 31.5.5: Pretesting) to improve it further, if desired. 4. You can also improve the partitioning using the Merge optimization, if desired. Instructions for manual partitioning are provided below.

Using the Partition Grid Panel For grid partitioning, you need to select the bisection method for creating the grid partitions, set the number of partitions, select the zones and/or registers, and choose the optimizations to be used. For some methods, you can also perform pretesting to ensure that the best possible bisection is performed. Once you have set all the parameters in the Partition Grid panel to your satisfaction, click the Partition button to subdivide the grid into the selected number of partitions using the prescribed method and optimization(s). See above for recommended partitioning strategies. You can set the relevant inputs in the Partition Grid panel (Figure 31.5.3 in the parallel solver, or Figure 31.5.4 in the serial solver) in the following manner: Parallel −→Partition... 1. Select the bisection method in the Method drop-down list. The choices are the techniques described in Section 31.5.5: Bisection Methods. 2. Set the desired number of grid partitions in the Number integer number field. You can use the counter arrows to increase or decrease the value, instead of typing in the box. The number of grid partitions must be an integral multiple of the number of processors available for parallel computing.

c Fluent Inc. September 29, 2006

31-29

Parallel Processing

Figure 31.5.3: The Partition Grid Panel in the Parallel Solver

Figure 31.5.4: The Partition Grid Panel in the Serial Solver

31-30

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

3. You can choose to independently apply partitioning to each cell zone, or you can allow partitions to cross zone boundaries using the Across Zones check button. It is recommended that you not partition cells zones independently (by turning off the Across Zones check button) unless cells in different zones will require significantly different amounts of computation during the solution phase (e.g., if the domain contains both solid and fluid zones). 4. You can select Encapsulate Grid Interfaces if you would like the cells surrounding all non-conformal grid interfaces in your mesh to reside in a single partition at all times during the calculation. If your case file contains non-conformal interfaces on which you plan to perform adaption during the calculation, you will have to partition it in the serial solver, with the Encapsulate Grid Interfaces and Encapsulate for Adaption options turned on. 5. If you have enabled the Encapsulate Grid Interfaces option in the serial solver, the Encapsulate for Adaption option will also be available. When you select this option, additional layers of cells are encapsulated such that transfer of cells will be unnecessary during parallel adaption. 6. You can activate and control the desired optimization methods (described in Section 31.5.5: Optimizations) using the items under Optimizations. You can activate the Merge and Smooth schemes by turning on the Do check button next to each one. For each scheme, you can also set the number of Iterations. Each optimization scheme will be applied until appropriate criteria are met, or the maximum number of iterations has been executed. If the Iterations counter is set to 0, the optimization scheme will be applied until completion, without limit on the maximum number of iterations. 7. If you have chosen the Principal Axes or Cartesian Axes method, you can improve the partitioning by enabling the automatic testing of the different bisection directions before the actual partitioning occurs. To use pretesting, turn on the Pre-Test option. Pretesting is described in Section 31.5.5: Pretesting. 8. In the Zones and/or Registers lists, select the zone(s) and/or register(s) for which you want to partition. For most cases, you will select all Zones (the default) to partition the entire domain. See below for details. 9. You can assign selected Zones and/or Registers to a specific partition ID by entering a value for the Set Selected Zones and Registers to Partition ID. For example, if the Number of partitions for your grid is 2, then you can only use IDs of 0 or 1. If you have three partitions, then you can enter IDs of 0, 1, or 2. This can be useful in situations where the gradient at a region is known to be high. In such cases, you can mark the region or zone and set the marked cells to one of the partition IDs, thus preventing the partition from going through that region. This in turn will facilitate convergence. This is also useful in cases where mesh manipulation

c Fluent Inc. September 29, 2006

31-31

Parallel Processing

tools are not available in parallel. In this case, you can assign the related cells to a particular ID so that the grid manipulation tools are now functional. If you are running the parallel solver, and you have marked your region and assigned an ID to the selected Zones and/or Registers, click the Use Stored Partitions button to make the new partitions valid. Refer to the example described later in this section for a demonstration of how selected registers are assigned to a partition. 10. Click the Partition button to partition the grid. 11. If you decide that the new partitions are better than the previous ones (if the grid was already partitioned), click the Use Stored Partitions button to make the newly stored cell partitions the active cell partitions. The active cell partition is used for the current calculation, while the stored cell partition (the last partition performed) is used when you save a case file. 12. When using the dynamic mesh model in your parallel simulations, the Partition panel includes an Auto Repartition option and a Repartition Interval setting. These parallel partitioning options are provided because FLUENT migrates cells when local remeshing and smoothing is performed. Therefore, the partition interface becomes very wrinkled and the load balance may deteriorate. By default, the Auto Repartition option is selected, where a percentage of interface faces and loads are automatically traced. When this option is selected, FLUENT automatically determines the most appropriate repartition interval based on various simulation parameters. Sometimes, using the Auto Repartition option provides insufficient results, therefore, the Repartition Interval setting can be used. The Repartition Interval setting lets you to specify the interval (in time steps or iterations respectively) when a repartition is enforced. When repartitioning is not desired, then you can set the Repartition Interval to zero.

i

31-32

Note that when dynamic meshes and local remeshing is utilized, updated meshes may be slightly different in parallel FLUENT (when compared to serial FLUENT or when compared to a parallel solution created with a different number of compute nodes), resulting in very small differences in the solutions.

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

Example of Setting Selected Registers to Specified Partition IDs 1. Start FLUENT in parallel. The case in this example was partitioned across two nodes. 2. Read in your case. 3. Display the grid with the Partitions option enabled in the Display Grid panel (Figure 31.5.5).

Grid

FLUENT 6.3 (2d, segregated, ske)

Figure 31.5.5: The Partitioned Grid

4. Adapt your region and mark your cells (see Section 26.7.3: Performing Region Adaption). This creates a register.

c Fluent Inc. September 29, 2006

31-33

Parallel Processing

5. Open the Partition Grid panel.

6. Keep the Set Selected Zones and Registers to Partition ID set to 0 and click the corresponding button. This prints the following output to the FLUENT console window:

>> 2 Active Partitions: ---------------------------------------------------------------------Collective Partition Statistics: Minimum Maximum Total ---------------------------------------------------------------------Cell count 459 459 918 Mean cell count deviation 0.0% 0.0% Partition boundary cell count 11 11 22 Partition boundary cell count ratio 2.4% 2.4% 2.4% Face count Mean face count deviation Partition boundary face count Partition boundary face count ratio

764 -38.3% 13 0.8%

1714 38.3% 13 1.7%

2461 17 0.7%

Partition neighbor count 1 1 ---------------------------------------------------------------------Partition Method Principal Axes Stored Partition Count 2 Done.

31-34

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

7. Click the Use Stored Partitions button to make the new partitions valid. This migrates the partitions to the compute-nodes. The following output is then printed to the FLUENT console window:

Migrating partitions to compute-nodes. >> 2 Active Partitions: P Cells I-Cells Cell Ratio Faces I-Faces Face Ratio Neighbors 0 672 24 0.036 2085 29 0.014 1 1 246 24 0.098 425 29 0.068 1 ---------------------------------------------------------------------Collective Partition Statistics: Minimum Maximum Total ---------------------------------------------------------------------Cell count 246 672 918 Mean cell count deviation -46.4% 46.4% Partition boundary cell count 24 24 48 Partition boundary cell count ratio 3.6% 9.8% 5.2% Face count Mean face count deviation Partition boundary face count Partition boundary face count ratio

425 -66.1% 29 1.4%

2085 66.1% 29 6.8%

2461 49 2.0%

Partition neighbor count 1 1 ---------------------------------------------------------------------Partition Method Principal Axes Stored Partition Count 2 Done.

8. Display the grid (Figure 31.5.6). 9. This time, set the Set Selected Zones and Registers to Partition ID to 1 and click the corresponding button. This prints a report to the FLUENT console. 10. Click the Use Stored Partitions button to make the new partitions valid and to migrate the partitions to the compute-nodes. 11. Display the grid (Figure 31.5.7). Notice now that the partition appears in a different location as specified by your partition ID.

i

Although this example demonstrates setting selected registers to specific partition IDs in parallel, it can be similarly applied in serial.

c Fluent Inc. September 29, 2006

31-35

Parallel Processing

Grid

FLUENT 6.3 (2d, segregated, ske)

Figure 31.5.6: The Partitioned ID Set to Zero

Grid

FLUENT 6.3 (2d, segregated, ske)

Figure 31.5.7: The Partitioned ID Set to 1

31-36

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

Partitioning Within Zones or Registers The ability to restrict partitioning to cell zones or registers gives you the flexibility to apply different partitioning strategies to subregions of a domain. For example, if your geometry consists of a cylindrical plenum connected to a rectangular duct, you may want to partition the plenum using the Cylindrical Axes method, and the duct using the Cartesian Axes method. If the plenum and the duct are contained in two different cell zones, you can select one at a time and perform the desired partitioning, as described in Section 31.5.4: Using the Partition Grid Panel. If they are not in two different cell zones, you can create a cell register (basically a list of cells) for each region using the functions that are used to mark cells for adaption. These functions allow you to mark cells based on physical location, cell volume, gradient or isovalue of a particular variable, and other parameters. See Chapter 26: Adapting the Grid for information about marking cells for adaption. Section 26.11.1: Manipulating Adaption Registers provides information about manipulating different registers to create new ones. Once you have created a register, you can partition within it as described above.

i

Note that partitioning within zones or registers is not available when Metis is selected as the partition Method.

For dynamic mesh applications (see item 11 above), FLUENT stores the partition method used to partition the respective zone. Therefore, if repartitioning is done, FLUENT uses the same method that was used to partition the mesh. Reporting During Partitioning As the grid is partitioned, information about the partitioning process will be printed in the text (console) window. By default, the solver will print the number of partitions created, the number of bisections performed, the time required for the partitioning, and the minimum and maximum cell, face, interface, and face-ratio variations. (See Section 31.5.6: Interpreting Partition Statistics for details.) If you increase the Verbosity to 2 from the default value of 1, the partition method used, the partition ID, number of cells, faces, and interfaces, and the ratio of interfaces to faces for each partition will also be printed in the console window. If you decrease the Verbosity to 0, only the number of partitions created and the time required for the partitioning will be reported. You can request a portion of this report to be printed again after the partitioning is completed. When you click the Print Active Partitions or Print Stored Partitions button in the parallel solver, FLUENT will print the partition ID, number of cells, faces, and interfaces, and the ratio of interfaces to faces for each active or stored partition in the console window. In addition, it will print the minimum and maximum cell, face, interface, and face-ratio variations. In the serial solver, you will obtain the same information about the stored partition when you click Print Partitions. See Section 31.5.6: Interpreting

c Fluent Inc. September 29, 2006

31-37

Parallel Processing

Partition Statistics for details.

i

Recall that to make the stored cell partitions the active cell partitions you must click the Use Stored Partitions button. The active cell partition is used for the current calculation, while the stored cell partition (the last partition performed) is used when you save a case file.

Resetting the Partition Parameters If you change your mind about your partition parameter settings, you can easily return to the default settings assigned by FLUENT by clicking on the Default button. When you click the Default button, it will become the Reset button. The Reset button allows you to return to the most recently saved settings (i.e., the values that were set before you clicked on Default). After execution, the Reset button will become the Default button again.

31.5.5

Grid Partitioning Methods

Partitioning the grid for parallel processing has three major goals: • Create partitions with equal numbers of cells. • Minimize the number of partition interfaces—i.e., decrease partition boundary surface area. • Minimize the number of partition neighbors. Balancing the partitions (equalizing the number of cells) ensures that each processor has an equal load and that the partitions will be ready to communicate at about the same time. Since communication between partitions can be a relatively time-consuming process, minimizing the number of interfaces can reduce the time associated with this data interchange. Minimizing the number of partition neighbors reduces the chances for network and routing contentions. In addition, minimizing partition neighbors is important on machines where the cost of initiating message passing is expensive compared to the cost of sending longer messages. This is especially true for workstations connected in a network. The partitioning schemes in FLUENT use bisection algorithms to create the partitions, but unlike other schemes which require the number of partitions to be a factor of two, these schemes have no limitations on the number of partitions. For each available processor, you will create the same number of partitions (i.e., the total number of partitions will be an integral multiple of the number of processors).

31-38

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

Bisection Methods The grid is partitioned using a bisection algorithm. The selected algorithm is applied to the parent domain, and then recursively applied to the child subdomains. For example, to divide the grid into four partitions, the solver will bisect the entire (parent) domain into two child domains, and then repeat the bisection for each of the child domains, yielding four partitions in total. To divide the grid into three partitions, the solver will “bisect” the parent domain to create two partitions—one approximately twice as large as the other—and then bisect the larger child domain again to create three partitions in total. The grid can be partitioned using one of the algorithms listed below. The most efficient choice is problem-dependent, so you can try different methods until you find the one that is best for your problem. See Section 31.5.4: Guidelines for Partitioning the Grid for recommended partitioning strategies. Cartesian Axes bisects the domain based on the Cartesian coordinates of the cells (see Figure 31.5.8). It bisects the parent domain and all subsequent child subdomains perpendicular to the coordinate direction with the longest extent of the active domain. It is often referred to as coordinate bisection. Cartesian Strip uses coordinate bisection but restricts all bisections to the Cartesian direction of longest extent of the parent domain (see Figure 31.5.9). You can often minimize the number of partition neighbors using this approach. Cartesian X-, Y-, Z-Coordinate bisects the domain based on the selected Cartesian coordinate. It bisects the parent domain and all subsequent child subdomains perpendicular to the specified coordinate direction. (See Figure 31.5.9.) Cartesian R Axes bisects the domain based on the shortest radial distance from the cell centers to that Cartesian axis (x, y, or z) which produces the smallest interface size. This method is available only in 3D. Cartesian RX-, RY-, RZ-Coordinate bisects the domain based on the shortest radial distance from the cell centers to the selected Cartesian axis (x, y, or z). These methods are available only in 3D. Cylindrical Axes bisects the domain based on the cylindrical coordinates of the cells. This method is available only in 3D. Cylindrical R-, Theta-, Z-Coordinate bisects the domain based on the selected cylindrical coordinate. These methods are available only in 3D. Metis uses the METIS software package for partitioning irregular graphs, developed by Karypis and Kumar at the University of Minnesota and the Army HPC Research Center. It uses a multilevel approach in which the vertices and edges on the fine

c Fluent Inc. September 29, 2006

31-39

Parallel Processing

graph are coalesced to form a coarse graph. The coarse graph is partitioned, and then uncoarsened back to the original graph. During coarsening and uncoarsening, algorithms are applied to permit high-quality partitions. Detailed information about METIS can be found in its manual [172].

i

Note that when using the socket version (-pnet), the METIS partitioner is not available. In this case, METIS partitioning can be obtained using the partition filter, as described below.

i

If you create non-conformal interfaces, and generate virtual polygonal faces, your METIS partition can cross non-conformal interfaces by using the connectivity of the virtual polygonal faces. This improves load balancing for the parallel solver and minimizes communication by decreasing the number of partition interface cells.

Polar Axes bisects the domain based on the polar coordinates of the cells (see Figure 31.5.12). This method is available only in 2D. Polar R-Coordinate, Polar Theta-Coordinate bisects the domain based on the selected polar coordinate (see Figure 31.5.12). These methods are available only in 2D. Principal Axes bisects the domain based on a coordinate frame aligned with the principal axes of the domain (see Figure 31.5.10). This reduces to Cartesian bisection when the principal axes are aligned with the Cartesian axes. The algorithm is also referred to as moment, inertial, or moment-of-inertia partitioning. This is the default bisection method in FLUENT. Principal Strip uses moment bisection but restricts all bisections to the principal axis of longest extent of the parent domain (see Figure 31.5.11). You can often minimize the number of partition neighbors using this approach. Principal X-, Y-, Z-Coordinate bisects the domain based on the selected principal coordinate (see Figure 31.5.11). Spherical Axes bisects the domain based on the spherical coordinates of the cells. This method is available only in 3D. Spherical Rho-, Theta-, Phi-Coordinate bisects the domain based on the selected spherical coordinate. These methods are available only in 3D.

31-40

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

3.00e+00

2.25e+00

1.50e+00

7.50e-01

0.00e+00 Contours of Cell Partition

Figure 31.5.8: Partitions Created with the Cartesian Axes Method

3.00e+00

2.25e+00

1.50e+00

7.50e-01

0.00e+00 Contours of Cell Partition

Figure 31.5.9: Partitions Created with the Cartesian Strip or Cartesian XCoordinate Method

c Fluent Inc. September 29, 2006

31-41

Parallel Processing

3.00e+00

2.25e+00

1.50e+00

7.50e-01

0.00e+00 Contours of Cell Partition

Figure 31.5.10: Partitions Created with the Principal Axes Method

3.00e+00

2.25e+00

1.50e+00

7.50e-01

0.00e+00 Contours of Cell Partition

Figure 31.5.11: Partitions Created with the Principal Strip or Principal XCoordinate Method

31-42

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

3.00e+00

2.25e+00

1.50e+00

7.50e-01

0.00e+00 Contours of Cell Partition

Figure 31.5.12: Partitions Created with the Polar Axes or Polar ThetaCoordinate Method

Optimizations Additional optimizations can be applied to improve the quality of the grid partitions. The heuristic of bisecting perpendicular to the direction of longest domain extent is not always the best choice for creating the smallest interface boundary. A “pre-testing” operation (see Section 31.5.5: Pretesting) can be applied to automatically choose the best direction before partitioning. In addition, the following iterative optimization schemes exist: Smooth attempts to minimize the number of partition interfaces by swapping cells between partitions. The scheme traverses the partition boundary and gives cells to the neighboring partition if the interface boundary surface area is decreased. (See Figure 31.5.13.) Merge attempts to eliminate orphan clusters from each partition. An orphan cluster is a group of cells with the common feature that each cell within the group has at least one face which coincides with an interface boundary. (See Figure 31.5.14.) Orphan clusters can degrade multigrid performance and lead to large communication costs. In general, the Smooth and Merge schemes are relatively inexpensive optimization tools.

c Fluent Inc. September 29, 2006

31-43

Parallel Processing

Figure 31.5.13: The Smooth Optimization Scheme

Figure 31.5.14: The Merge Optimization Scheme

31-44

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

Pretesting If you choose the Principal Axes or Cartesian Axes method, you can improve the bisection by testing different directions before performing the actual bisection. If you choose not to use pretesting (the default), FLUENT will perform the bisection perpendicular to the direction of longest domain extent. If pretesting is enabled, it will occur automatically when you click the Partition button in the Partition Grid panel, or when you read in the grid if you are using automatic partitioning. The bisection algorithm will test all coordinate directions and choose the one which yields the fewest partition interfaces for the final bisection. Note that using pretesting will increase the time required for partitioning. For 2D problems partitioning will take 3 times as long as without pretesting, and for 3D problems it will take 4 times as long.

Using the Partition Filter As noted above, you can use the METIS partitioning method through a filter in addition to within the Auto Partition Grid and Partition Grid panels. To perform METIS partitioning on an unpartitioned grid, use the File/Import/Partition/Metis... menu item. File −→ Import −→ Partition −→Metis... FLUENT will use the METIS partitioner to partition the grid, and then read the partitioned grid into the solver. The number of partitions will be equal to the number of processes. You can then proceed with the model definition and solution.

i

Direct import to the parallel solver through the partition filter requires that the host machine has enough memory to run the filter for the specified grid. If not, you will need to run the filter on a machine that does have enough memory. You can either start the parallel solver on the machine with enough memory and repeat the process described above, or run the filter manually on the new machine and then read the partitioned grid into the parallel solver on the host machine.

To manually partition a grid using the partition filter, enter the following command: utility partition input filename partition count output filename where input filename is the filename for the grid to be partitioned, partition count is the number of partitions desired, and output filename is the filename for the partitioned grid. You can then read the partitioned grid into the solver (using the standard File/Read/Case... menu item) and proceed with the model definition and solution.

c Fluent Inc. September 29, 2006

31-45

Parallel Processing

When the File/Import/Partition/Metis... menu item is used to import an unpartitioned grid into the parallel solver, the METIS partitioner partitions the entire grid. You may also partition each cell zone individually, using the File/Import/Partition/Metis Zone... menu item. File −→ Import −→ Partition −→Metis Zone... This method can be useful for balancing the work load. For example, if a case has a fluid zone and a solid zone, the computation in the fluid zone is more expensive than in the solid zone, so partitioning each zone individually will result in a more balanced work load.

31.5.6

Checking the Partitions

After partitioning a grid, you should check the partition information and examine the partitions graphically.

Interpreting Partition Statistics You can request a report to be printed after partitioning (either automatic or manual) is completed. In the parallel solver, click the Print Active Partitions or Print Stored Partitions button in the Partition Grid panel. In the serial solver, click the Print Partitions button. FLUENT distinguishes between two cell partition schemes within a parallel problem: the active cell partition and the stored cell partition. Initially, both are set to the cell partition that was established upon reading the case file. If you re-partition the grid using the Partition Grid panel, the new partition will be referred to as the stored cell partition. To make it the active cell partition, you need to click the Use Stored Partitions button in the Partition Grid panel. The active cell partition is used for the current calculation, while the stored cell partition (the last partition performed) is used when you save a case file. This distinction is made mainly to allow you to partition a case on one machine or network of machines and solve it on a different one. Thanks to the two separate partitioning schemes, you could use the parallel solver with a certain number of compute nodes to subdivide a grid into an arbitrary different number of partitions, suitable for a different parallel machine, save the case file, and then load it into the designated machine. When you click Print Partitions in the serial solver, you will obtain information about the stored partition. The output generated by the partitioning process includes information about the recursive subdivision and iterative optimization processes. This is followed by information about the final partitioned grid, including the partition ID, number of cells, number of faces, number of interface faces, ratio of interface faces to faces for each partition, number of neighboring partitions, and cell, face, interface, neighbor, mean cell, face ratio, and global face ratio variations. Global face ratio variations are the minimum and maximum values of the respective quantities in the present partitions. For example, in the sample

31-46

c Fluent Inc. September 29, 2006

31.5 Partitioning the Grid

output below, partitions 0 and 3 have the minimum number of interface faces (10), and partitions 1 and 2 have the maximum number of interface faces (19); hence the variation is 10–19. Your aim is to achieve small values of Interface ratio variation and Global interface ratio while maintaining a balanced load (Cell variation).

>> Partitions: P Cells I-Cells Cell Ratio 0 134 10 0.075 1 137 19 0.139 2 134 19 0.142 3 137 10 0.073 -----Partition count Cell variation Mean cell variation Intercell variation Intercell ratio variation Global intercell ratio Face variation Interface variation Interface ratio variation Global interface ratio Neighbor variation

Faces I-Faces Face Ratio Neighbors 217 10 0.046 1 222 19 0.086 2 218 19 0.087 2 223 10 0.045 1 = = = = = = = = = = =

4 (134 - 137) ( -1.1% (10 - 19) ( 7.3% 10.7% (217 - 223) (10 - 19) ( 4.5% 3.4% (1 - 2)

1.1%) 14.2%)

8.7%)

Computing connected regions; type ^C to interrupt. Connected region count = 4

Note that partition IDs correspond directly to compute node IDs when a case file is read into the parallel solver. When the number of partitions in a case file is larger than the number of compute nodes, but is evenly divisible by the number of compute nodes, then the distribution is such that partitions with IDs 0 to (M − 1) are mapped onto compute node 0, partitions with IDs M to (2M − 1) onto compute node 1, etc., where M is equal to the ratio of the number of partitions to the number of compute nodes.

c Fluent Inc. September 29, 2006

31-47

Parallel Processing

Examining Partitions Graphically To further aid interpretation of the partition information, you can draw contours of the grid partitions, as illustrated in Figures 31.5.8–31.5.12. Display −→Contours... To display the active cell partition or the stored cell partition (which are described above), select Active Cell Partition or Stored Cell Partition in the Cell Info... category of the Contours Of drop-down list, and turn off the display of Node Values. (See Section 28.1.2: Displaying Contours and Profiles for information about displaying contours.)

i 31.5.7

If you have not already done so in the setup of your problem, you will need to perform a solution initialization in order to use the Contours panel.

Load Distribution

If the speeds of the processors that will be used for a parallel calculation differ significantly, you can specify a load distribution for partitioning, using the load-distribution text command. parallel −→ partition −→ set −→load-distribution For example, if you will be solving on three compute nodes, and one machine is twice as fast as the other two, then you may want to assign twice as many cells to the first machine as to the others (i.e., a load vector of (2 1 1)). During subsequent grid partitioning, partition 0 will end up with twice as many cells as partitions 1 and 2. Note that for this example, you would then need to start up FLUENT such that compute node 0 is the fast machine, since partition 0, with twice as many cells as the others, will be mapped onto compute node 0. Alternatively, in this situation, you could enable the load balancing feature (described in Section 31.6.2: Load Balancing) to have FLUENT automatically attempt to discern any difference in load among the compute nodes.

i

31-48

If you adapt a grid that contains non-conformal interfaces, and you want to rebalance the load on the compute nodes, you will have to save your case and data files after adaption, read the case and data files into the serial solver, repartition using the Encapsulate Grid Interfaces and Encapsulate for Adaption options in the Partition Grid panel, and save case and data files again. You will then be able to read the manually repartitioned case and data files into the parallel solver, and continue the solution from where you left it.

c Fluent Inc. September 29, 2006

31.6 Checking and Improving Parallel Performance

31.6

Checking and Improving Parallel Performance

To determine how well the parallel solver is working, you can measure computation and communication times, and the overall parallel efficiency, using the performance meter. You can also control the amount of communication between compute nodes in order to optimize the parallel solver, and take advantage of the automatic load balancing feature of FLUENT. Information about checking and improving parallel performance is provided in the following sections: • Section 31.6.1: Checking Parallel Performance • Section 31.6.2: Optimizing the Parallel Solver

31.6.1

Checking Parallel Performance

The performance meter allows you to report the wall clock time elapsed during a computation, as well as message-passing statistics. Since the performance meter is always activated, you can access the statistics by printing them after the computation is completed. To view the current statistics, use the Parallel/Timer/Usage menu item. Parallel −→ Timer −→Usage Performance statistics will be printed in the text window (console). To clear the performance meter so that you can eliminate past statistics from the future report, use the Parallel/Timer/Reset menu item. Parallel −→ Timer −→Reset

c Fluent Inc. September 29, 2006

31-49

Parallel Processing

The following example demonstrates how the current parallel statistics are displayed in the console window: Performance Timer for 1 iterations on 4 compute nodes Average wall-clock time per iteration: 4.901 sec Global reductions per iteration: 408 ops Global reductions time per iteration: 0.000 sec (0.0%) Message count per iteration: 801 messages Data transfer per iteration: 9.585 MB LE solves per iteration: 12 solves LE wall-clock time per iteration: 2.445 sec (49.9%) LE global solves per iteration: 27 solves LE global wall-clock time per iteration: 0.246 sec (5.0%) AMG cycles per iteration: 64 cycles Relaxation sweeps per iteration: 4160 sweeps Relaxation exchanges per iteration: 920 exchanges Total wall-clock time: Total CPU time:

4.901 sec 17.030 sec

A description of the parallel statistics is as follows: • Average wall-clock time per iteration describes the average real (wall clock) time per iteration. • Global reductions per iteration describes the number of global reduction operations (such as variable summations over all processes). This requires communication among all processes. A global reduction is a collective operation over all processes for the given job that reduces a vector quantity (the length given by the number of processes or nodes) to a scalar quantity (e.g., taking the sum or maximum of a particular quantity). The number of global reductions cannot be calculated from any other readily known quantities. The number is generally dependent on the algorithm being used and the problem being solved. • Global reductions time per iteration describes the time per iteration for the global reduction operations. • Message count per iteration describes the number of messages sent between all processes per iteration. This is important with regard to communication latency, especially on high-latency interconnnects. A message is defined as a single point-to-point, send-and-receive operation between any two processes. (This excludes global, collective operations such as global reductions.) In terms of domain decomposition, a message is passed from the process

31-50

c Fluent Inc. September 29, 2006

31.6 Checking and Improving Parallel Performance

governing one subdomain to a process governing another (usually adjacent) subdomain. The message count per iteration is usually dependent on the algorithm being used and the problem being solved. The message count and the number of messages that are reported are totals for all processors. The message count provides some insight into the impact of communication latency on parallel performance. A higher message count indicates that the parallel performance may be more adversely affected if a high-latency interconnect is being used. Ethernet has a higher latency than Myrinet or Infiniband. Thus, a high message count will more adversely affect performance with Ethernet than with Infiniband. • Data transfer per iteration describes the amount of data communicated between processors per iteration. This is important with respect to interconnect bandwidth. Data transfer per iteration is usually dependent on the algorithm being used and the problem being solved. This number generally increases with increases in problem size, number of partitions, and physics complexity. The data transfer per iteration may provide some insight into the impact of communication bandwidth (speed) on parallel performance. The precise impact is often difficult to quantify because it is dependent on many things including: ratio of data transfer to calculations, and ratio of communication bandwidth to CPU speed. The unit of data transfer is a byte. • LE solves per iteration describes the number of linear systems being solved per iteration. This number is dependent on the physics (non-reacting versus reacting flow) and the algorithms (pressure-based versus density-based solver), but is independent of mesh size. For the pressure-based solver, this is usually the number of transport equations being solved (mass, momentum, energy, etc.). • LE wall-clock time per iteration describes the time (wall-clock) spent doing linear equation solvers (i.e., multigrid). • LE global solves per iteration describes the number of solutions on the coarse level of the AMG solver where the entire linear system has been pushed to a single processor (n0). The system is pushed to a single processor to reduce the computation time during the solution on that level. Scaling generally is not adversely affected because the number of unknowns is small on the coarser levels. • LE global wall-clock time per iteration describes the time (wall-clock) per iteration for the linear equation global solutions (see above). • AMG cycles per iteration describes the average number of multigrid cycles (V, W, flexible, etc.) per iteration.

c Fluent Inc. September 29, 2006

31-51

Parallel Processing

• Relaxation sweeps per iteration describes the number of relaxation sweeps (or iterative solutions) on all levels for all equations per iteration. A relaxation sweep is usually one iteration of Gauss-Siedel or ILU. • Relaxation exchanges per iteration describes the number of solution communications between processors during the relaxation process in AMG. This number may be less than the number of sweeps because of shifting the linear system on coarser levels to a single node/process. • Time-step updates per iteration describes the number of sub-iterations on the time step per iteration. • Time-step wall-clock time per iteration describes the time per sub-iteration. • Total wall-clock time describes the total wall-clock time. • Total CPU time describes the total CPU time used by all processes. This does not include any wait time for load imbalances or for communications (other than packing and unpacking local buffers). The most relevant quantity is the Total wall clock time. This quantity can be used to gauge the parallel performance (speedup and efficiency) by comparing this quantity to that from the serial analysis (the command line should contain -t1 in order to obtain the statistics from a serial analysis). In lieu of a serial analysis, an approximation of parallel speedup may be found in the ratio of Total CPU time to Total wall clock time.

31-52

c Fluent Inc. September 29, 2006

31.6 Checking and Improving Parallel Performance

31.6.2

Optimizing the Parallel Solver

Increasing the Report Interval In FLUENT, you can reduce communication and improve parallel performance by increasing the report interval for residual printing/plotting or other solution monitoring reports. You can modify the value for Reporting Interval in the Iterate panel. Solve −→Iterate...

i

Note that you will be unable to interrupt iterations until the end of each report interval.

Load Balancing A dynamic load balancing capability is available in FLUENT. The principal reason for using parallel processing is to reduce the turnaround time of your simulation, ideally by a factor proportional to the collective speed of the computing resources used. If, for example, you were using four CPUs to solve your problem, then you would expect to reduce the turnaround time by a factor of four. This is of course the ideal situation, and assumes that there is very little communication needed among the CPUs, that the CPUs are all of equal speed, and that the CPUs are dedicated to your job. In practice, this is often not the case. For example, CPU speeds can vary if you are solving in parallel on a cluster that includes nodes with different clock speeds, other jobs may be competing for use of one or more of the CPUs, and network traffic either from within the parallel solver or generated from external sources may delay some of the necessary communication among the CPUs. If you enable dynamic load balancing in FLUENT, the load across the computational and networking resources will be monitored periodically. If the load balancer determines that performance can be improved by redistributing the cells among the compute nodes, it will automatically do so. There is a time penalty associated with load balancing itself, and so it is disabled by default. If you will be using a dedicated homogeneous resource, or if you are using a heterogeneous resource but have accounted for differences in CPU speeds during partitioning by specifying a load distribution (see Section 31.5.7: Load Distribution), then you may not need to use load balancing.

i

Note that when the shell conduction model is used, you will not be able to turn on load balancing.

c Fluent Inc. September 29, 2006

31-53

Parallel Processing

To enable and control FLUENT’s automatic load balancing feature, use the Load Balance panel (Figure 31.6.1). Load balancing will automatically detect and analyze parallel performance, and redistribute cells between the existing compute nodes to optimize it. Parallel −→Load Balance...

Figure 31.6.1: The Load Balance Panel

The procedure for using load balancing is as follows: 1. Turn on the Load Balancing option. 2. Select the bisection method to create new grid partitions in the Partition Method drop-down list. The choices are the techniques described in Section 31.5.5: Bisection Methods. As part of the automatic load balancing procedure, the grid will be repartitioned into several small partitions using the specified method. The resulting partitions will then be distributed among the compute nodes to achieve a more balanced load. 3. Specify the desired Balance Interval. When a value of 0 is specified, FLUENT will internally determine the best value to use, initially using an interval of 25 iterations. You can override this behavior by specifying a non-zero value. FLUENT will then attempt to perform load balancing after every N iterations, where N is the specified Balance Interval. You should be careful to select an interval that is large enough to outweigh the cost of performing the load balancing operations.

31-54

c Fluent Inc. September 29, 2006

31.6 Checking and Improving Parallel Performance

Note that you can interrupt the calculation at any time, turn the load balancing feature off (or on), and then continue the calculation.

i

If problems arise in your computations due to adaption, you can turn off the automatic load balancing, which occurs any time that mesh adaption is performed in parallel.

To instruct the solver to skip the load balancing step, issue the following Scheme command: (disable-load-balance-after-adaption) To return to the default behavior use the following Scheme command: (enable-load-balance-after-adaption)

c Fluent Inc. September 29, 2006

31-55

Parallel Processing

31-56

c Fluent Inc. September 29, 2006