:sequential_nav: next .. _tutorial-performanceoptions: Performance Options ******************* This reference section contains several options that might be helpful when trying to make the most out of your current SIESTA installation. We will not cover here neither options that affect the physics of your calculation (basis sets, mesh cut-off, etc.), nor options related to SCF convergence. The options in this reference guide should not affect the numerical quality of your results, while at the same time enabling speedups in your calculation. .. note:: There are no "absolute best" values for all of these options. You should test what works best for your specific calculation in your specific hardware. --------- SIESTA Input Options ==================== The following options can be set within the FDF file you use for your calculations. Refer to the manual for further information. The defaults are set between parenthesis, when available. Regular SCF calculations ------------------------ These options can speed up your calculations when running single-point SCFs, molecular dynamics, and similar methods. * BlockSize (64) The BlockSize indicates how orbitals are distributed across processors, which has an effect on how both grid and matrices are distributed. It is usually convenient to set it to a power of 2 (16, 32, 64, etc.), but it can depend on your own hardware's specification. * Diag.Algorithm (DivideAndConquer) This is the main algorithm used for Hamiltonian diagonalization. Other interesting alternatives are MRRR and ELPA (if ELPA support is enabled). * NumberOfEigenStates (all orbitals) This variable is useful when using the Expert, MRRR and ELPA diagonalization algorithms. When positive, indicates the total amount of eigenvectors to find (i.e. the number of orbitals/bands). But it's more useful when negative: for example `NumberOfEigenStates -50` will find all eigenvectors up to the 50th over the Fermi level (or, more correctly, over half the number of electrons). This option can provide great results, but setting it to a low number **will** affect the physics of your system. * Diag.ParallelOverK (False) When true, this enables parallelization of the diagonalization over kpoints instead of orbitals. This is only useful when the amount of k-points is much larger than the total amount of orbitals and CPUs used. TranSIESTA specific ------------------- The following options can be useful for TranSIESTA calculations, specifically when relying on the Block-Tri-Diagonal algorithm (which is the default). Please refer to the manual for more details. * TS.BTD.Optimize (speed) Selects which kind of BTD optimization is performed. Can be "speed" or "memory"; this second case is particularly useful if your TS calculations are eating up too much RAM. * TS.BTD.Spectral (column) Selects the algorithm to compute the spectral function; can be column or propagation. * TS.BTD.Pivot (*first electrode*) Selects the partitioning of the BTD matrix. There are several options for this input, please refer to the manual for further details. --------- External Parallelization Options ================================ SIESTA offers a hybrid OpenMP+MPI parallelization. The options presented here lie outside of the SIESTA-specific territory, as they are controled by external launchers (such as mpirun) or libraries (such as OMP_NUM_THREADS). Tasks and threads ----------------- .. hint:: * Parallelization in OpenMP is done over *threads*. These share memory and thus it is a bad idea to have threads split over hardware sockets. * Parallelization in MPI is done over *tasks*. These do not share memory and you could, in principle, have as many tasks as orbitals in your system. * When both MPI and OpenMP are used, each MPI *task* launches M OpenMP *threads*. So if you have N tasks, you will end up using N*M processors. The total amount of *threads* can be usually controlled by OMP_NUM_THREADS. Under SLURM, this us usually also controled using the "--cpus-per-task" option. The total amount of *tasks* is set via the MPI launcher (mpirun, mpiexec, srun) with the "-n" or "-np" options: .. code:: shell export OMP_NUM_THREADS = 4 mpiexec -np 56 siesta input.fdf > output The previous lines will run SIESTA using 56 tasks and 4 threads, for a total of 224 CPUs. This is equivalent to the following *srun* command: .. code:: shell srun -n 56 --cpus-per-task=4 siesta input.fdf > output In general, you should avoid using an odd number of threads or tasks. As a general rule of the thumb, the amount of threads should be a factor of the amount of cores per socket in your computer. In HPC centers this information is available in their documentation; if you are not sure, you can run a command such as "lscpu" and look for the following lines: .. code:: shell Core(s) per socket: 10 Socket(s): 1 This means that this computer in particular has only one socket with 10 cores. CPU Binding ----------- In general, tasks and threads may "jump" between different physical CPUs during the excecution of a program. This is usually undesirable, and as such you may want to **bind** your tasks (i.e., never allow them to leave the same CPU). For the examples above, this can be set easily by relying on the MPI launcher: .. code:: shell export OMP_NUM_THREADS = 4 mpiexec -np 56 --bind-to core --map-by socket:PE=4 siesta input.fdf > output In this case, the "--map-by" option indicates that tasks will be ordered by CPUs in each socket, by jumping to every 4th CPU. So task 0 will go to CPU 0, task 1 will go to CPU 3, task 2 will go to CPU 7, and so on. Then the CPUs we skipped will be assigned to each of the threads spawned by OpenMP. .. note:: On newer versions of OpenMPI (greater than 5.0), "socket" option is deprecated in favor of the new "package" option. This means that the line above becomes `--map-by package:PE=4`. Similarly, for srun: .. code:: shell srun -n 56 --cpus-per-task=4 --cpu-bind=cores siesta input.fdf > output In HPC centers -------------- These are general recommendations when using SIESTA on large supercomputers. * Try to rely on *srun* instead of *mpirun* or *mpiexec*. HPC centers usually setup srun to provide additional configurations for performance (for example, the core binding option is usually enabled by default). * *Read each center's documentation*. This will not only tell you important hardware information (such as the amount of CPUs per socket), but they may also provide better MPI-related options for your job. * *Aim to fully utilize nodes*. If the center provides 112-CPU nodes, avoid using something like 128 MPI tasks; rather, aim for 112 (one node), 228 (two nodes) and so on. In addition, do not increase mindlessly the amount of nodes you are using, since parallelization might degrade pretty fast.