Running libEnsemble
Introduction
libEnsemble runs with one manager and multiple workers. Each worker may run either a generator or simulator function (both are Python scripts). Generators determine the parameters/inputs for simulations. Simulator functions run and manage simulations, which often involve running a user application (see Executor).
Note
As of version 1.3.0, the generator can be run as a thread on the manager, using the libE_specs option gen_on_manager. When using this option, set the number of workers desired for running simulations. See Running generator on the manager for more details.
To use libEnsemble, you will need a calling script, which in turn will specify generator and simulator functions. Many examples are available.
There are currently three communication options for libEnsemble (determining how
the Manager and Workers communicate). These are mpi
, local
, tcp
.
The default is mpi
.
Note
You do not need the mpi
communication mode to use the
MPI Executor. The communication modes described
here only refer to how the libEnsemble manager and workers communicate.
This option uses mpi4py for the Manager/Worker communication. It is used automatically if you run your libEnsemble calling script with an MPI runner such as:
mpirun -np N python myscript.py
where N
is the number of processes. This will launch one manager and
N-1
workers.
This option requires mpi4py
to be installed to interface with the MPI on your system.
It works on a standalone system, and with both
central and distributed modes of running libEnsemble on
multi-node systems.
It also potentially scales the best when running with many workers on HPC systems.
Limitations of MPI mode
If launching MPI applications from workers, then MPI is nested. This is not supported with Open MPI. This can be overcome by using a proxy launcher (see Balsam). This nesting does work with MPICH and its derivative MPI implementations.
It is also unsuitable to use this mode when running on the launch nodes of
three-tier systems (e.g., Summit). In that case local
mode is recommended.
Uses Python’s built-in multiprocessing module.
The comms
type local
and number of workers nworkers
may
be provided in libE_specs.
Then run:
python myscript.py
Or, if the script uses the parse_args
function
or an Ensemble
object with Ensemble(parse_args=True)
,
you can specify these on the command line:
python myscript.py --comms local --nworkers N
This will launch one manager and N
workers.
libEnsemble will run on one node in this scenario. To
disallow this node
from app-launches (if running libEnsemble on a compute node),
set libE_specs["dedicated_mode"] = True
.
This mode is often used to run on a launch node of a three-tier
system (e.g., Summit), ensuring the whole compute-node allocation is available for
launching apps. Make sure there are no imports of mpi4py
in your Python scripts.
Note that on macOS (since Python 3.8) and Windows, the default multiprocessing method
is "spawn"
instead of "fork"
; to resolve many related issues, we recommend placing
calling script code in an if __name__ == "__main__":
block.
Limitations of local mode
Workers cannot be distributed across nodes.
In some scenarios, any import of
mpi4py
will cause this to break.Does not have the potential scaling of MPI mode, but is sufficient for most users.
Run the Manager on one system and launch workers to remote
systems or nodes over TCP. Configure through
libE_specs
, or on the command line
if using an Ensemble
object with
Ensemble(parse_args=True)
,
Reverse-ssh interface
Set comms
to ssh
to launch workers on remote ssh-accessible systems. This
co-locates workers, functions, and any applications. User
functions can also be persistent, unlike when launching remote functions via
Globus Compute.
The remote working directory and Python need to be specified. This may resemble:
python myscript.py --comms ssh --workers machine1 machine2 --worker_pwd /home/workers --worker_python /home/.conda/.../python
Limitations of TCP mode
There cannot be two calls to
libE()
orEnsemble.run()
in the same script.
Further Command Line Options
See the parse_args
function in Convenience Tools for
further command line options.
Persistent Workers
In a regular (non-persistent) worker, the user’s generator or simulation function is called whenever the worker receives work. A persistent worker is one that continues to run the generator or simulation function between work units, maintaining the local data environment.
A common use-case consists of a persistent generator (such as persistent_aposmm) that maintains optimization data while generating new simulation inputs. The persistent generator runs on a dedicated worker while in persistent mode. This requires an appropriate allocation function that will run the generator as persistent.
When running with a persistent generator, it is important to remember that a worker will be dedicated to the generator and cannot run simulations. For example, the following run:
mpirun -np 3 python my_script.py
starts one manager, one worker with a persistent generator, and one worker for running simulations.
If this example was run as:
mpirun -np 2 python my_script.py
No simulations will be able to run.
Running generator on the manager
The majority of libEnsemble use cases run a single generator. The libE_specs option gen_on_manager will cause the generator function to run on a thread on the manager. This can run persistent user functions, sharing data structures with the manager, and avoids additional communication to a generator running on a worker. When using this option, the number of workers specified should be the (maximum) number of concurrent simulations.
If modifying a workflow to use gen_on_manager
consider the following.
Set
nworkers
to the number of workers desired for running simulations.If using
add_unique_random_streams()
to seed random streams, the default generator seed will be zero.If you have a line like
libE_specs["nresource_sets"] = nworkers -1
, this line should be removed.If the generator does use resources,
nresource_sets
can be increased as needed so that the generator and all simulations are resourced.
Environment Variables
Environment variables required in your run environment can be set in your Python sim or gen function. For example:
os.environ["OMP_NUM_THREADS"] = 4
set in your simulation script before the Executor submit command will export the setting
to your run. For running a bash script in a sub environment when using the Executor, see
the env_script
option to the MPI Executor.
liberegister / libesubmit
Command-line utilities for preparing and launching libEnsemble workflows onto almost any machine and any scheduler, using a PSI/J Python implementation.
Creates an initial, platform-independent PSI/J serialization of a libEnsemble submission. Run this utility on a script:
liberegister my_calling_script.py --comms local --nworkers 4
This produces an initial my_calling_script.json
serialization conforming to PSI/J’s specification:
my_calling_script.json
{
"version": 0.1,
"type": "JobSpec",
"data": {
"name": "libe-job",
"executable": "python",
"arguments": [
"my_calling_script.py",
"--comms",
"local",
"--nworkers",
"4"
],
"directory": null,
"inherit_environment": true,
"environment": {
"PYTHONNOUSERSITE": "1"
},
"stdin_path": null,
"stdout_path": null,
"stderr_path": null,
"resources": {
"node_count": 1,
"process_count": null,
"process_per_node": null,
"cpu_cores_per_process": null,
"gpu_cores_per_process": null,
"exclusive_node_use": true
},
"attributes": {
"duration": "30",
"queue_name": null,
"project_name": null,
"reservation_id": null,
"custom_attributes": {}
},
"launcher": null
}
}
Further parameterizes a serialization, and submits a corresponding Job to the specified scheduler:
libesubmit my_calling_script.json -q debug -A project -s slurm --nnodes 8
Results in:
*** libEnsemble 0.9.3 ***
Imported PSI/J serialization: my_calling_script.json. Preparing submission...
Calling script: my_calling_script.py
...found! Proceeding.
Submitting Job!: Job[id=ce4ead75-a3a4-42a3-94ff-c44b3b2c7e61, native_id=None, executor=None, status=JobStatus[NEW, time=1658167808.5125017]]
$ squeue --long --users=user
Mon Jul 18 13:10:15 2022
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2508936 debug ce4ead75 user PENDING 0:00 30:00 8 (Priority)
This also produces a Job-specific representation, e.g:
8ba9de56.my_calling_script.json
{
"version": 0.1,
"type": "JobSpec",
"data": {
"name": "libe-job",
"executable": "/Users/jnavarro/miniconda3/envs/libe/bin/python3.9",
"arguments": [
"my_calling_script.py",
"--comms",
"local",
"--nworkers",
"4"
],
"directory": "/home/user/libensemble/scratch",
"inherit_environment": true,
"environment": {
"PYTHONNOUSERSITE": "1"
},
"stdin_path": null,
"stdout_path": "8ba9de56.my_calling_script.out",
"stderr_path": "8ba9de56.my_calling_script.err",
"resources": {
"node_count": 8,
"process_count": null,
"process_per_node": null,
"cpu_cores_per_process": null,
"gpu_cores_per_process": null,
"exclusive_node_use": true
},
"attributes": {
"duration": "30",
"queue_name": "debug",
"project_name": "project",
"reservation_id": null,
"custom_attributes": {}
},
"launcher": null
}
}
If libesubmit is run on a .json
serialization from liberegister and can’t find the
specified calling script, it’ll help search for matching candidate scripts.
Further Run Information
For running on multi-node platforms and supercomputers, there are alternative ways to configure libEnsemble to resources. See the Running on HPC Systems guide for more information, including some examples for specific systems.