Running libEnsemble

Introduction

libEnsemble runs with one manager and multiple workers. Each worker may run either a generator or simulator function (both are Python scripts). Generators determine the parameters/inputs for simulations. Simulator functions run and manage simulations, which often involve running a user application (see Executor).

Note

As of version 1.3.0, the generator can be run as a thread on the manager, using the libE_specs option gen_on_manager. When using this option, set the number of workers desired for running simulations. See Running generator on the manager for more details.

To use libEnsemble, you will need a calling script, which in turn will specify generator and simulator functions. Many examples are available.

There are currently three communication options for libEnsemble (determining how the Manager and Workers communicate). These are mpi, local, tcp. The default is mpi.

Note

You do not need the mpi communication mode to use the MPI Executor. The communication modes described here only refer to how the libEnsemble manager and workers communicate.

MPI Comms

This option uses mpi4py for the Manager/Worker communication. It is used automatically if you run your libEnsemble calling script with an MPI runner such as:

mpirun -np N python myscript.py

where N is the number of processes. This will launch one manager and N-1 workers.

This option requires mpi4py to be installed to interface with the MPI on your system. It works on a standalone system, and with both central and distributed modes of running libEnsemble on multi-node systems.

It also potentially scales the best when running with many workers on HPC systems.

Limitations of MPI mode

If launching MPI applications from workers, then MPI is nested. This is not supported with Open MPI. This can be overcome by using a proxy launcher (see Balsam). This nesting does work with MPICH and its derivative MPI implementations.

It is also unsuitable to use this mode when running on the launch nodes of three-tier systems (e.g., Summit). In that case local mode is recommended.

Local Comms

Uses Python’s built-in multiprocessing module. The comms type local and number of workers nworkers may be provided in libE_specs. Then run:

python myscript.py

Or, if the script uses the parse_args function or an Ensemble object with Ensemble(parse_args=True), you can specify these on the command line:

python myscript.py --comms local --nworkers N

This will launch one manager and N workers.

libEnsemble will run on one node in this scenario. To disallow this node from app-launches (if running libEnsemble on a compute node), set libE_specs["dedicated_mode"] = True.

This mode is often used to run on a launch node of a three-tier system (e.g., Summit), ensuring the whole compute-node allocation is available for launching apps. Make sure there are no imports of mpi4py in your Python scripts.

Note that on macOS (since Python 3.8) and Windows, the default multiprocessing method is "spawn" instead of "fork"; to resolve many related issues, we recommend placing calling script code in an if __name__ == "__main__": block.

Limitations of local mode

Workers cannot be distributed across nodes.
In some scenarios, any import of mpi4py will cause this to break.
Does not have the potential scaling of MPI mode, but is sufficient for most users.

TCP Comms

Run the Manager on one system and launch workers to remote systems or nodes over TCP. Configure through libE_specs, or on the command line if using an Ensemble object with Ensemble(parse_args=True),

Reverse-ssh interface

Set comms to ssh to launch workers on remote ssh-accessible systems. This co-locates workers, functions, and any applications. User functions can also be persistent, unlike when launching remote functions via Globus Compute.

The remote working directory and Python need to be specified. This may resemble:

python myscript.py --comms ssh --workers machine1 machine2 --worker_pwd /home/workers --worker_python /home/.conda/.../python

Limitations of TCP mode

There cannot be two calls to libE() or Ensemble.run() in the same script.

Further Command Line Options

See the parse_args function in Convenience Tools for further command line options.

Persistent Workers

In a regular (non-persistent) worker, the user’s generator or simulation function is called whenever the worker receives work. A persistent worker is one that continues to run the generator or simulation function between work units, maintaining the local data environment.

A common use-case consists of a persistent generator (such as persistent_aposmm) that maintains optimization data while generating new simulation inputs. The persistent generator runs on a dedicated worker while in persistent mode. This requires an appropriate allocation function that will run the generator as persistent.

When running with a persistent generator, it is important to remember that a worker will be dedicated to the generator and cannot run simulations. For example, the following run:

mpirun -np 3 python my_script.py

starts one manager, one worker with a persistent generator, and one worker for running simulations.

If this example was run as:

mpirun -np 2 python my_script.py

No simulations will be able to run.

Running generator on the manager

The majority of libEnsemble use cases run a single generator. The libE_specs option gen_on_manager will cause the generator function to run on a thread on the manager. This can run persistent user functions, sharing data structures with the manager, and avoids additional communication to a generator running on a worker. When using this option, the number of workers specified should be the (maximum) number of concurrent simulations.

If modifying a workflow to use gen_on_manager consider the following.

Set nworkers to the number of workers desired for running simulations.
If using add_unique_random_streams() to seed random streams, the default generator seed will be zero.
If you have a line like libE_specs["nresource_sets"] = nworkers -1, this line should be removed.
If the generator does use resources, nresource_sets can be increased as needed so that the generator and all simulations are resourced.

Environment Variables

Environment variables required in your run environment can be set in your Python sim or gen function. For example:

os.environ["OMP_NUM_THREADS"] = 4

set in your simulation script before the Executor submit command will export the setting to your run. For running a bash script in a sub environment when using the Executor, see the env_script option to the MPI Executor.

liberegister / libesubmit

Command-line utilities for preparing and launching libEnsemble workflows onto almost any machine and any scheduler, using a PSI/J Python implementation.

liberegister

Creates an initial, platform-independent PSI/J serialization of a libEnsemble submission. Run this utility on a script:

liberegister my_calling_script.py --comms local --nworkers 4

This produces an initial my_calling_script.json serialization conforming to PSI/J’s specification:

libesubmit

Further parameterizes a serialization, and submits a corresponding Job to the specified scheduler:

libesubmit my_calling_script.json -q debug -A project -s slurm --nnodes 8

Results in:

*** libEnsemble 0.9.3 ***
Imported PSI/J serialization: my_calling_script.json. Preparing submission...
Calling script: my_calling_script.py
...found! Proceeding.
Submitting Job!: Job[id=ce4ead75-a3a4-42a3-94ff-c44b3b2c7e61, native_id=None, executor=None, status=JobStatus[NEW, time=1658167808.5125017]]

$ squeue --long --users=user
Mon Jul 18 13:10:15 2022
        JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
    2508936    debug  ce4ead75     user  PENDING       0:00     30:00      8 (Priority)

This also produces a Job-specific representation, e.g:

If libesubmit is run on a .json serialization from liberegister and can’t find the specified calling script, it’ll help search for matching candidate scripts.

Further Run Information

For running on multi-node platforms and supercomputers, there are alternative ways to configure libEnsemble to resources. See the Running on HPC Systems guide for more information, including some examples for specific systems.