🔪 The Sharp Bits 🔪#

Read ahead for some pitfalls, counter-intuitive behavior, and sharp edges that we had to introduce in order to make this work.

Parallelization#

netket makes use of parallelism in two principal ways:

By leveraging the just-in-time compilation of XLA vector-instructions are used on CPU (as well as multiple threads for certain linear algebra operations), and, similarly, calculations are parallelized to run on all available cuda cores on GPU.
Explicit parallelization by distributing the markov chains and samples across multiples nodes/devices. This is achieved by using MPI (with mpi4jax), or alternatively by using native collective communication built into jax (still experimental).

In the following we go through the peculiarities of using those different approaches.

XLA Multi-threading#

Netket computations run mostly via Jax’s XLA. Compared to NetKet 2, this means that we can automatically benefit from multiple cpu cores without having to use MPI. This is because mathematical operations such as matrix multiplications and overs will be split into sub-chunks and distributed across different cpu cores. This behaviour is triggered only for matrices/vectors above a certain size, and will not perform particularly good for small matrices or if you have many cpu cores. To disable this behaviour, refer to Jax#743, which mainly suggest defining the two env variables:

export XLA_FLAGS="--xla_cpu_multi_thread_eigen=false intra_op_parallelism_threads=1"

On Linux it is also possible to control the cores visible to XLA with taskset.

MPI (mpi4jax)#

Requires that mpi4py and mpi4jax are installed, please refer to Installation#MPI.

When using netket it is crucial to run Python with the same implementation and version of MPI that the mpi4py module is compiled against. If you encounter issues, you can check whether your MPI environment is set up properly by running:

$ mpirun -np 2 python3 -m netket.tools.check_mpi
mpi4py_available             : True
mpi4jax_available            : True
available_cpus (rank 0)       : 12
n_nodes                      : 1
mpi4py | MPI version         : (3, 1)
mpi4py | MPI library_version : Open MPI v4.1.0, package: Open MPI brew@BigSur Distribution, ident: 4.1.0,  repo rev: v4.1.0, Dec 18, 2020

This should print some basic information about the MPI installation and, in particular, pick up the correct n_nodes. If you get the same output multiple times, each with n_nodes : 1, this is a clear sign that your MPI setup is broken. The tool above also reports the number of (logical) CPUs that might be subscribed by Jax on every independent MPI rank during linear algebra operations. Be mindfull that Jax, in general, is like an invasive plant and tends to use all resources that he can access, and the environment variables above might not prevent it from making use of the available_cpus. On Mac it is not possible to control this number. On Linux it can be controlled using taskset or --bind-to core when using mpirun.

Native Jax parallelism (experimental)#

Historically the principal way to run netket in parallel has been to use MPI via mpi4py and mpi4jax. However, recently jax gained support for shared arrays and collective operations on multiple devices/nodes (see here and here) and we adapted netket to support those, enabling native parallelism via jax.

Warning

This feature is still experimental and not everything may work perfectly right out of the box. Any feedback, be it positive or negative, would be greatly appreciated.

Single Process#

To run on a single process with multiple devices on a single node usually all that is necessary is to set the environment flag NETKET_EXPERIMENTAL_SHARDING=1, e.g. by setting them before importing netket:

GPU

import os
os.environ['NETKET_EXPERIMENTAL_SHARDING'] = 1

import netket as nk
# ...

CPU

You can force jax to use multiple threads as cpu devices (see jax 101), e.g.:

import os
os.environ['XLA_FLAGS'] = '--xla_force_host_platform_device_count=8'
os.environ['NETKET_EXPERIMENTAL_SHARDING'] = 1

import netket as nk
# ...

Multi-Process#

Background: Jax internally uses the grpc library (launching a http server) for setup and book-keeping of the cluster and the nvidia nccl library for communication between gpus, and (experimentally) gloo for communication between cpus. Note that even if launched with mpirun, mpi is currently not used for communication (until somebody writes a plugin for it), but the environment variables set by it are instead picked up by jax.distributed.initialize and used to set up the other communication libraries.

To launch netket on a multi-node cluster usually all that is required is to add a call to jax.distributed.initialize() at the top of the main script, e.g. as follows:

GPU

import jax
jax.distributed.initialize()

import os
os.environ['NETKET_EXPERIMENTAL_SHARDING'] = 1

import netket as nk
# ...

It is required that libnccl2 and libnccl2-dev are installed in addition to cuda. If you run into communication errors, you might want to set the environment variable NCCL_DEBUG=INFO for detailed error messages.

CPU (experimental, requires jax>=0.4.23; see jax #11182 (comment))

import jax
jax.config.update("jax_cpu_enable_gloo_collectives", True)
jax.distributed.initialize()

import os
os.environ['NETKET_EXPERIMENTAL_SHARDING'] = 1

import netket as nk
# ...

Then, these scripts can be conveniently launched with srun (on slurm clusters) or mpirun (openmpi only). For more details and manual setups we refer to the jax documentation.

GRPC incompatibility with http proxy wildcards#

We noticed that communication errors can arise when a http proxy is used on the cluster. Grpc will try to communicate with the other nodes via the proxy, whenever they are only excluded in the no_proxy variable via wildcards (e.g. no_proxy=10.0.0.*) which we found grpc cannot parse. To avoid this one needs to include all addresses explicitly.

Alternatively, a simple way to work around it is to disable the proxy completely for jax by unsetting the respective environment variables (see grpc docs) e.g. as follows:

import os
del os.environ['http_proxy']
del os.environ['https_proxy']
del os.environ['no_proxy']

import jax
jax.distributed.initialize()

Multiple GPU devices per process#

According to our testing, it is best to use 1 process per gpu on the cluster.

Nevertheless, if you want to use multiple gpus per process you can force jax to do so by setting local_device_ids, e.g. extracting it from CUDA_VISIBLE_DEVICES as follows:

import os
import jax
ldi = list(map(int, os.environ.get('CUDA_VISIBLE_DEVICES').split(',')))
jax.distributed.initialize(local_device_ids=ldi)

Using GPUs#

Jax supports GPUs, so your calculations should run fine on GPU, however there are a few gotchas:

GPUs have a much higher overhead, therefore you will see very bad performance at small system size (typically below 40 spins)
Not all Metropolis Transition Rules work on GPUs. To go around that, those rules have been rewritten in numpy in order to run on the cpu, therefore you might need to use netket.sampler.MetropolisSamplerNumpy instead of netket.sampler.MetropolisSampler.

Eventually we would like the selection to be automatic, but this has not yet been implemented.

Please open tickets if you find issues!

Running on CPU when GPUs are present#

If you have the CUDA version of jaxlib installed, then computations will, by default, run on the GPU. For small systems this will be very inefficient. To check if this is the case, run the following code:

import jax
print(jax.devices())

If the output is [CpuDevice(id=0)], then computations will run by default on the CPU, if instead you see something like [GpuDevice(id=0)] computations will run on the GPU.

To force Jax/XLA to run computations on the CPU, set the environment variable

export JAX_PLATFORM_NAME="cpu"

NaNs in training and loss of precision#

If you find NaNs while training, especially if you are using your own model, there might be a few reasons:

It might simply be a precision issue, as you might be using single precision (np.float32, np.complex64) instead of double precision (np.float64, np.complex128). Be careful that if you use float and complex as dtype, they will not always behave as you expect! They are known as weak dtypes, and when multiplied by a single-precision number they will be converted to single precision. This issue might manifest especially when using Flax, which respects type promotion, as opposed to jax.example_libraries.stax, which does not.
Check the initial parameters. In the NetKet 2 models were always initialized with weights normally distributed. In Netket 3, netket.nn layers use the same default (normal distribution with standard deviation 0.01) but if you use general flax layers they might use different initializers. different initialisation distributions have particularly strong effects when working with complex-valued models. A good way to enforce the same distribution across all your weights, similar to NetKet 2 behaviour, is to use init_parameters().

🔪 The Sharp Bits 🔪

Contents

🔪 The Sharp Bits 🔪#

Parallelization#

XLA Multi-threading#

MPI (mpi4jax)#

Native Jax parallelism (experimental)#

Single Process#

Multi-Process#

GRPC incompatibility with http proxy wildcards#

Multiple GPU devices per process#

Using GPUs#

Running on CPU when GPUs are present#

NaNs in training and loss of precision#