:sequential_nav: next

..  _tutorial-performanceoptions:

Performance Options
*******************

This reference section contains several options that might be helpful when
trying to make the most out of your current SIESTA installation.

We will not cover here neither options that affect the physics of your
calculation (basis sets, mesh cut-off, etc.), nor options related to SCF
convergence. The options in this reference guide should not affect the numerical
quality of your results, while at the same time enabling speedups in your
calculation.

.. note::
   There are no "absolute best" values for all of these options. You should test
   what works best for your specific calculation in your specific hardware.

---------

SIESTA Input Options
====================

The following options can be set within the FDF file you use for your
calculations. Refer to the manual for further information. The defaults are
set between parenthesis, when available.

Regular SCF calculations
------------------------
These options can speed up your calculations when running single-point SCFs,
molecular dynamics, and similar methods.

* BlockSize (64)
   The BlockSize indicates how orbitals are distributed across processors, which
   has an effect on how both grid and matrices are distributed. It is usually
   convenient to set it to a power of 2 (16, 32, 64, etc.), but it can depend on
   your own hardware's specification.

* Diag.Algorithm (DivideAndConquer)
   This is the main algorithm used for Hamiltonian diagonalization. Other
   interesting alternatives are MRRR and ELPA (if ELPA support is enabled).

* NumberOfEigenStates (all orbitals)
   This variable is useful when using the Expert, MRRR and ELPA diagonalization
   algorithms. When positive, indicates the total amount of eigenvectors to find
   (i.e. the number of orbitals/bands). But it's more useful when negative: for
   example `NumberOfEigenStates -50` will find all eigenvectors up to the 50th
   over the Fermi level (or, more correctly, over half the number of electrons).
   This option can provide great results, but setting it to a low number **will**
   affect the physics of your system.

* Diag.ParallelOverK (False)
   When true, this enables parallelization of the diagonalization over kpoints
   instead of orbitals. This is only useful when the amount of k-points is much
   larger than the total amount of orbitals and CPUs used.


TranSIESTA specific
-------------------
The following options can be useful for TranSIESTA calculations, specifically
when relying on the Block-Tri-Diagonal algorithm (which is the default). Please
refer to the manual for more details.

* TS.BTD.Optimize (speed)
   Selects which kind of BTD optimization is performed. Can be "speed" or
   "memory"; this second case is particularly useful if your TS calculations are
   eating up too much RAM.

* TS.BTD.Spectral (column)
   Selects the algorithm to compute the spectral function; can be column or
   propagation.

* TS.BTD.Pivot (*first electrode*)
   Selects the partitioning of the BTD matrix. There are several options for
   this input, please refer to the manual for further details.

---------

External Parallelization Options
================================

SIESTA offers a hybrid OpenMP+MPI parallelization. The options presented here
lie outside of the SIESTA-specific territory, as they are controled by external
launchers (such as mpirun) or libraries (such as OMP_NUM_THREADS).


Tasks and threads
-----------------
.. hint::

   * Parallelization in OpenMP is done over *threads*. These share memory and
     thus it is a bad idea to have threads split over hardware sockets.

   * Parallelization in MPI is done over *tasks*. These do not share memory and
     you could, in principle, have as many tasks as orbitals in your system.

   * When both MPI and OpenMP are used, each MPI *task* launches M OpenMP
     *threads*. So if you have N tasks, you will end up using N*M processors.

The total amount of *threads* can be usually controlled by OMP_NUM_THREADS. Under
SLURM, this us usually also controled using the "--cpus-per-task" option. The
total amount of *tasks* is set via the MPI launcher (mpirun, mpiexec, srun) with
the "-n" or "-np" options:

.. code:: shell

   export OMP_NUM_THREADS = 4
   mpiexec -np 56 siesta input.fdf > output

The previous lines will run SIESTA using 56 tasks and 4 threads, for a total of
224 CPUs. This is equivalent to the following *srun* command:

.. code:: shell

   srun -n 56 --cpus-per-task=4 siesta input.fdf > output

In general, you should avoid using an odd number of threads or tasks.

As a general rule of the thumb, the amount of threads should be a factor of the
amount of cores per socket in your computer. In HPC centers this information is
available in their documentation; if you are not sure, you can run a command
such as "lscpu" and look for the following lines:

.. code:: shell

      Core(s) per socket:  10
      Socket(s):           1

This means that this computer in particular has only one socket with 10 cores.


CPU Binding
-----------
In general, tasks and threads may "jump" between different physical CPUs during
the excecution of a program. This is usually undesirable, and as such you may
want to **bind** your tasks (i.e., never allow them to leave the same CPU). For
the examples above, this can be set easily by relying on the MPI launcher:

.. code:: shell

   export OMP_NUM_THREADS = 4
   mpiexec -np 56 --bind-to core --map-by socket:PE=4 siesta input.fdf > output


In this case, the "--map-by" option indicates that tasks will be ordered by CPUs
in each socket, by jumping to every 4th CPU. So task 0 will go to CPU 0, task 1
will go to CPU 3, task 2 will go to CPU 7, and so on. Then the CPUs we skipped
will be assigned to each of the threads spawned by OpenMP.


.. note::
   On newer versions of OpenMPI (greater than 5.0), "socket" option is deprecated
   in favor of the new "package" option. This means that the line above becomes
   `--map-by package:PE=4`.

Similarly, for srun:

.. code:: shell

   srun -n 56 --cpus-per-task=4 --cpu-bind=cores siesta input.fdf > output


In HPC centers
--------------
These are general recommendations when using SIESTA on large supercomputers.

* Try to rely on *srun* instead of *mpirun* or *mpiexec*. HPC centers usually
  setup srun to provide additional configurations for performance (for example,
  the core binding option is usually enabled by default).

* *Read each center's documentation*. This will not only tell you important
  hardware information (such as the amount of CPUs per socket), but they may also
  provide better MPI-related options for your job.

* *Aim to fully utilize nodes*. If the center provides 112-CPU nodes, avoid
  using something like 128 MPI tasks; rather, aim for 112 (one node), 228
  (two nodes) and so on. In addition, do not increase mindlessly the amount of
  nodes you are using, since parallelization might degrade pretty fast.