Technical

Bitfusion Client Service – The Bash in the Rue Morgue

This blog post shows you how to create a Bitfusion Client service for your VMs. Such VMs can boot with remote GPUs pre-allocated and available to applications without any need to invoke Bitfusion on the command line.

Introduction – Is That the Short Straw You Drew, Nancy?

The biggest value of Bitfusion is that you can share GPUs with other users from a pool of servers across the network. You invoke Bitfusion to allocate GPUs from the pool, to run an application, and to deallocate them when you are done. This increases the GPU utilization and makes sharing relatively painless—you don’t have to arrange schedules with other users, you don’t have to spin down your VM to allow someone else access, and you don’t have to port your code and its environment to special hosts with GPUs.

But maybe you want more. Maybe…

  • You want Bitfusion to be even more invisible; you don’t want to invoke it from the command line at all.
  • You want to allocate a set of GPUs for a whole session of application runs. This allows for apple-to-apple comparisons across runs. This guarantees that once you’ve begun a session, you’ll own the GPUs until you finish.

These “maybes” coincide with what you might want in a GPUaaS environment, paying (or at least being tracked) for a session that comes with GPUs, whether or not you run anything that needs them.

Well, you can have all of this by leveraging what happens when you use Bitfusion to launch a bash shell, instead of a regular ML application.

Bitfusion Bash Session – Is This the Road the Chicken Should Cross, Alex?

The bitfusion run -n <N> <application> command does three things:

  1. Allocates the number, N, GPUs and sets up a Bitfusion environment
  2. Runs the application in that Bitfusion environment which intercepts CUDA calls and forwards them to the remote GPUs
  3. Deallocates the GPUs and tears down the environment

Here is a specific example using CUDA sample code that comes with the CUDA toolkit.

cd /usr/local/cuda/samples/0_Simple/matrixMul
bitfusion run -n 1 -- ./matrixMul
...regular successful program output
#
#  Some comments
#  • alloc 1 GPU, create config file and set environment variables to intercept CUDA calls)
#  • run matrixMul, a CUDA app
#  • dealloc 1 GPU, unset environment variables

But you can use Bitfusion to run anything, including bash. Under bash, the Bitfusion environment will stay up until you explicitly exit bash. This provides a method to allocate the GPUs once, and use them to run several applications. And any commands you run inside the bash do not need to be prefixed with a Bitfusion command.

Figure 1

The Bitfusion Environment – Are There No Places Like Holmes, Sherlock?

The Bitfusion environment is one that intercepts calls to the CUDA driver (libcuda.so), in which a configuration file exists that identifies the previously allocated GPUs, and in which environment variables specify information needed by the Bitfusion software. We can see this environment from the inner bash command line.

Figure 2

Plan of Attack – Are You Digging a Garden, or a Grave with that Spade, Sam?

You now know the two things you need to set up a Bitfusion client service:

  • A Bitfusion bash shell lets you run sequential applications with no further Bitfusion commands
  • You can easily clone the Bitfusion bash shell by duplicating its environment variables

A client service needs:

  • To run a bash script under Bitfusion that captures the client environment
  • This “capture” script to generate a bash profile script that will replicate the environment variables in any new shells that are launched
  • This “capture” script to keep itself alive (or Bitfusion will detect completion and deallocate the GPUs)

This blog shows you how to set up the service with systemd. The service uses three files. You must write the service file and the “capture” shell script, but the profile script will be created dynamically by the “capture” script.

Figure 3

The Client Service File – Is This Done in the Service of the Queen, Ellery?

A Bitfusion cluster comprises vCenter, Bitfusion servers (appliances with GPUs), and Bitfusion clients (VMs running applications needing acceleration from the server GPUs). This client service, not surprisingly, is written on and will run on the Bitfusion clients. It provisions, invisibly, the Bitfusion GPUs that the applications need.

Below is the text of the systemd service file, which defines the Bitfusion client service.

cat /lib/systemd/system/bitfusion-client.service
#
[Unit]
Description=Start Bitfusion Client Environment
#
[Service]
# Set User (and/or Group) to not run as root and cause log files to be written in user's .bitfusion subdir
User=root
#User=<username>
#Group=<usergroup>
Type=simple
# Edit the bitfusion run command to allocate the number of GPUs and partial size which you need
ExecStart=/usr/bin/bitfusion run -n 1 -- bash /opt/bitfusion/bitfusion-client-env
ExecStopPost=/bin/rm /etc/profile.d/bitfusion-client-env.sh
#ExecStopPost=/bin/rm /home/<username>/.bitfusion-client-env.sh
RestartSec=5
Restart=always
KillMode=process
#
[Install]
WantedBy=multi-user.target
Alias=bfcenv.service

The key line is the one that runs the “capture” script under Bitfusion:

ExecStart=/usr/bin/bitfusion run -n 1 – bash /opt/bitfusion/bitfusion-client-env

This allocates a single GPU. Via the -n option, you can allocate a different number of GPUs. Via a -p or -m option, you can allocate partial GPUs to run your application within a partition of GPU memory (-p 0.314 would allocate 31.4% of GPU memory, -m 4000 would allocate four thousand MBs of GPU memory). See User Guide.

As written, the service will be launched by root and will provide the service for all users. To run it by and for a single user, uncomment the second User, Group, and second ExecStopPost lines, filling in the fields in angle brackets (<username> and <usergroup>).

The Capture Script – Are You Mimicking Me Like a Parrot, Hercule? Or — Can You Mirror the Big, Blue Marble, Jane? (Challenging names; do you purposely pose punning problems for us plain old, regular folk in Peoria or Corpus Christi, Agatha?)

To make a correct capture script, you need to identify all the environment variables created or modified by Bitfusion. You can do this by running diff on the output of env inside and outside of a Bitfusion shell. You may still need to do manual comparison if the variables are printed in different orders, but the diff output will at least be a starting point.

env > outside.txt
bitfusion run -n 1 – bash
env > inside.txt
exit
diff outside.txt inside.txt > bitfusionenv.txt

Once you have your list of environment variables, write a script that recreates each of those variables and place it in /opt/bitfusion/bitfusion-client-env, where the systemd service file expects it to be. Note: you can ignore the variable SHLVL.

Here is an example valid for version 2.0.1 of Bitfusion.

cat /opt/bitfusion/bitfusion-client-env
BITFUSIONTARGFILE=/etc/profile.d/bitfusion-client-env.sh
#BITFUSIONTARGFILE=/home/\<username>/.bitfusion-client-env.sh
#
/bin/chmod 644 $BF_ADAPTOR_CONFIG
#
/bin/echo "export LD_LIBRARY_PATH=\"/opt/bitfusion/lib/x86_64-linux-gnu/bitfusion/lib/nvml:/opt/intel/opencl/lib64:/opt/bitfusion/lib/x86_64-linux-gnu/bitfusion/lib/cuda:/etc/bitfusion/icd:/opt/bitfusion/lib/x86_64-linux-gnu/bitfusion/lib/opencl:/opt/bitfusion/lib/x86_64-linux-gnu/bitfusion/lib:\$LD_LIBRARY_PATH\"" > $BITFUSIONTARGFILE
/bin/echo "export BF_USER_COMMAND=bash" >> $BITFUSIONTARGFILE
/bin/echo "export BF_ENABLE_RDMA_TWO_HOPS=$BF_ENABLE_RDMA_TWO_HOPS" >> $BITFUSIONTARGFILE
/bin/echo "export BF_LICENSE_FILE=$BF_LICENSE_FILE" >> $BITFUSIONTARGFILE
/bin/echo "export BF_LOG_FILE=$BF_LOG_FILE" >> $BITFUSIONTARGFILE
/bin/echo "export BF_DISABLE_DEVPTR_BUF_SCAN=$BF_DISABLE_DEVPTR_BUF_SCAN" >> $BITFUSIONTARGFILE
/bin/echo "export BF_CACHE_STORE_ROOT=$BF_CACHE_STORE_ROOT" >> $BITFUSIONTARGFILE
/bin/echo "export OPENCL_VENDOR_PATH=$OPENCL_VENDOR_PATH" >> $BITFUSIONTARGFILE
/bin/echo "export BF_CACHE_STORE_CLEANUP_THRESHOLD=$BF_CACHE_STORE_CLEANUP_THRESHOLD" >> $BITFUSIONTARGFILE
/bin/echo "export BF_ENABLE_CUDA_CACHING_ALL=$BF_ENABLE_CUDA_CACHING_ALL" >> $BITFUSIONTARGFILE
/bin/echo "export LD_AUDIT=\"/opt/bitfusion/lib/x86_64-linux-gnu/bitfusion/lib/libBFAudit.so:\$LD_AUDIT\"" >> $BITFUSIONTARGFILE
/bin/echo "export NCCL_P2P_DISABLE=1" >> $BITFUSIONTARGFILE
#
/bin/echo "export BF_ADAPTOR_PATH=$BF_ADAPTOR_PATH" >> $BITFUSIONTARGFILE
/bin/echo "export BF_ADAPTOR_CONFIG=$BF_ADAPTOR_CONFIG" >> $BITFUSIONTARGFILE
/bin/echo "export IBV_FORK_SAFE=1" >> $BITFUSIONTARGFILE
/bin/echo "export BF_ADAPTOR_RDMA=$BF_ADAPTOR_RDMA" >> $BITFUSIONTARGFILE
/bin/echo "export LD_PRELOAD=\":/opt/bitfusion/lib/x86_64-linux-gnu/bitfusion/lib/libsyscall_intercept.so:\$LD_PRELOAD\"" >> $BITFUSIONTARGFILE
#
# If expecting service to be run by root...
#    Set log file so local user will succeed in writing it.
#    Set cache lock file so local user will be able to access it.
#/bin/echo "export BF_LOG_FILE=~/.bitfusion/bf_Global.log" >> $BITFUSIONTARGFILE
#/bin/echo "export BF_CACHE_STORE_ROOT=~/.bitfusion/cache/" >> $BITFUSIONTARGFILE
#
/bin/sleep infinity

Note the following about this file:

  • At the top, define BITFUSIONTARGFILE either for all users, or uncomment the subsequent line and define it for a single user, replacing <username> with the actual name. This is the file we create dynamically that brings up new shells with the Bitfusion environment.
  • The chmod command close to the top lets users access the Bitfusion configuration file their applications will need to identify the allocated GPUs.
  • The new value for the variable LD_LIBRARY_PATH is set by prefixing the previous value with Bitfusion directories.
  • If you are launching the service as root, uncomment the two lines before the sleep command. They set up logging and caching files which are accessible to normal users.
  • The last line prevents the shell script from completing, and in turn, keeps the Bitfusion session alive.

Private User Session – Are You a Lone Wolfe, Nero?

The files in the above two sections give you the option of launching the Bitfusion client service as root or as a regular, but specific, user. You just have to uncomment and complete the lines appropriate to your choice.

If the service is launched by root, then all users who log in to the VM (a Bitfusion client VM) can use the GPUs allocated by Bitfusion. Otherwise, only the specific user can use the GPUs provided by the service.

If you want to run the service by and for a specific user, there is one more file you must edit. The user’s file, ~/.profile, needs to find and execute the local copy of the dynamically-created profile script.

The lines to add are listed here:

cat ~/.profile
...
# use local Bitfusion profile if it exists
if [ -f "$HOME/.bitfusion-client-env.sh" ] ; then
. "$HOME/.bitfusion-client-env.sh"
fi
...

Ready, Set, Go – Can You Keep Up with the Joneses, Jupiter?

All that remains to be done is to enable the Bitfusion client service so it will start automatically when the VM boots up. To enable the service:
[sudo] systemctl enable bitfusion-client
Now if you reboot the system, the service will start. Just log in and you can successfully run your CUDA applications as if there were local GPUs. Any time you start the service, you will have to log in to a new shell to join the Bitfusion session; the current shell’s environment will not have been changed.

You may want to run other systemd commands, as well. Here is a list of common systemd service commands, each with a brief comment:

# Start (or re-start) the service
[sudo] systemctl [re]start bitfusion-client
#
# Stop the service
[sudo] systemctl stop bitfusion-client
#
# Is the service running? ":q" to quit
systemctl status bitfusion-client
#
# Automatically start the service
[sudo] systemctl enable bitfusion-client
#
# Run this between a start and stop if you’ve edited the service
[sudo] systemctl daemon-reload
#
# View the service log; ":q" to quit.
# Helps to debug launch problems.
journalctl -u bitfusion-client

Limitations – Are You Yet on the Side of the Angels, Charlie?

The client service you have created here, tautologically, is a service you have created yourself. It is not a feature built into Bitfusion. We expect to Bitfusion to introduce many enhancements, services, and features over the course of time. But for now, consider some limitations of what you have done.

If you are running this service for yourself, you might be satisfied. But if you are running it for other users, notice you have not constrained them to stay within the bounds of the service. For example, a user could:

  • Run Bitfusion from the command line to allocate and use different GPUs (while not deallocating the initial GPUs)
  • Modify or stop the service with systemd commands

Also, the service, by itself, could be judged to have rough edges:

  • The service does not terminate if the GPUs are left idle
  • While it’s true that an enabled service which experiences a temporary shutdown (a longish period of network congestion, etc) should restart itself, it’s also true that the current shell will lose access permanently—some of its environment variables become invalid—you will have to log out (say, of ssh) and log back in

The administrator, however, can ameliorate some of those shortcomings from the vCenter Bitfusion plug-in. See the Bitfusion User Guide to:

  1. Limit the number of GPUs a client may allocate
  2. Set a client “idle GPU” timeout to force GPUs to be deallocated if they sit unused for too long

On the other hand, if your users happen to be responsible human beings, the kind that can be trusted with a bit of responsibility, then their ability to access the systemd commands, means they can address issues or modify the service without requiring their VM to be restarted and without consuming any more of your valuable time.

Conclusion Is Latin Word Order Truly Unimportant in an Opus Magnum, Thomas?

This blog was more of howdunnit than a whodunnit…

But we done it.

(Irrelevant Aside and Challenge: want a short whodunnit parody of an iconic movie, in verse, with a surprising, yet inevitable conclusion? Find the text of Mae Z. Scanlan’s “Gone with the Wind Murder Case”. If I still have my copy, it’s in a box in the attic somewhere, but this blog has made me think it would be nice to read it again.)

We have built a Bitfusion Client service with aspects of GPUaaS. It gives you access to a Bitfusion session with GPUs available for your applications. You do not have to type Bitfusion commands at the prompt; the GPUs are allocated for the life of the service. The means of doing this was to run bash under Bitfusion, keeping bash alive under systemd, and then replicating its environment in any new log-ins.

Top-Ten Rejected Section Titles

  1. Should I buff high, but mar low, Philip?
  2. Is it into that good night I should not go gently, Dirk?
  3. Is that what you’d do in the mo-o-rning, if you had a hammer, Mike?
  4. Is that sandwich good and hardy, Frank and Joe?
  5. Are you reading pop psychology about Venus and Mars, Veronica?
  6. Are these biscuits golden brown, Father?
  7. Shall I traverse the wooden bridge or the rock ford, Jim?
  8. If the water could be liquid or frozen, would you want to warsh or ski, Vic?
  9. Who sang Me and Bobby McGee, Travis?
  10. Will you have, using the lead pipe on that billiard, room to apply the necessary mustard, Colonel?