Model governance is difficult. There are many challenges to manage, from securely onboarding and scanning a model, to deployment and management concerns such as ongoing access control. That’s why at VMware Explore 2024, we introduced the Model Gallery as a new key capability in Private AI Foundation with NVIDIA.

In this article, we will walkthrough secure onboarding of a Llama-3-8b-instruct to the Model Gallery, governance strategies and deployment in your private cloud. But there’s so much more you can do with the Model Gallery. Once you have models stored in an OCI compatible format you can begin to add CI/CD processes to automate integration and deployment of model runtimes alongside GenAI applications.

Workflow Overview

To securely onboard any model to the Model Gallery, we are going to deploy an AI workstation to act as our DMZ for ensuring the models we onboard to our infrastructure are both secure and performant. Once we have a workstation, we need to download a model, perform static security scanning, ensure it runs correctly, complete runtime security scanning and a performance evaluation before uploading it to the Model Gallery for later use outside the DMZ.

Once uploaded to the Model Gallery, we can govern models and adjust access control as well as revisions.

Finally, developers will be able to easily deploy the model via CLI in their own environment.

You can watch a demonstration of this workflow in our 2024 Breakout Session.

Prerequisites

To follow along with this guide, you will need:

VMware Cloud Foundation 5.2.1 or later
VMware Private AI Foundation with NVIDIA Add-on
VCF Preparation for Private AI
Harbor Deployed on VCF
A content library with a Deep Learning VM Image

Infrastructure Configuration

To get started, we need to provision a Deep Learning VM. The Deep Learning VM is a VM image delivered as part of VMware Private AI Foundation with NVIDIA. It is preconfigured with popular ML tools and is optimized and validated by NVIDIA and VMware for GPU acceleration in VMware Cloud Foundation. We’re going to use the Private AI quick start to configure our environment.

Here you can select between air gapped and cloud connected. Follow the respective instructions needed to deploy a Deep Learning VM:

From here, we run the Private AI Quick Start Wizard in VCF Automation Catalog setup wizard to create 5 templates for us to deploy. But before we do so, we need to make sure we have configured a VM class for our large language model inferencing workload.

In this guide, we make use of the Triton TensorRT-LLM backend to run Llama3-8b-instruct. To use this backend, we need to create a new VMclass with our vGPU profile connected as a PCI device.

Then, we set the UVM parameter under Advanced Parameters.

Now we are ready to run the quick start, using the VM class we created earlier.

When the quick start wizard is complete we should see 5 new catalog items in Service Broker.

Deploy an AI Workstation

Let’s use the AI workstation catalog item to deploy a VM that we can use to onboard models. When deploying this AI Workstation, we select the same VM Class we configured earlier.

Deployment can take up to 30 minutes.

Check prerequisites

Once the Deep Learning VM is deployed and booted correctly we can ssh into it using the password we set in VCF Automation.

Let’s make sure everything is running properly and we have the necessary tools.

Firstly, let’s see if our GPU is installed properly by running:

nvidia-smi

1	nvidia-smi

Here we should see our vGPU profile and its drivers are all installed successfully.

Next, from VCF 5.2.1 the Deep Learning VM comes pre-installed with the Private AI Solutions CLI. Make sure it’s installed by running:

pais --help

1	pais --help

Here we expect an output similar to the following. If this CLI command is not found, the appropriate version of the Deep Learning VM is not installed. This can be addressed by updating the Deep Learning VM image in Content Library to the most up to date version.

Download a model

To download and run our model, we need to install and run Triton Inference Server.

docker run -itd --gpus all --rm -p 8000:8000 -p 8001:80001 -p 8002:8002 -v /opt/data:/models nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

1	docker run -itd --gpus all --rm -p 8000:8000 -p 8001:80001 -p 8002:8002 -v /opt/data:/models nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

This command will download and run our Triton container with the triton-trtllm engine installed for us. You can see the above command is making use of the Nvidia public registry. In an air gapped environments, this should be replaced with our private registry. You’ll also notice this command is binding our data disk to the container so that we can easily transfer models from the VM into the container.

Once the container image has been downloaded and the container is running, open a console in the container to install all the additional components we need.

First, get the container ID by running:

docker ps

docker ps

Then, open a console in the container by running the following command:

docker exec -it ＜containerID&gt; bash

1	docker exec -it ＜containerID> bash

In the container

Download the triton CLI.

pip install git+https://github.com/triton-inference-server/triton_cli.git@0.0.7

1	pip install git+https://github.com/triton-inference-server/[email protected]

Since we’re going to download llama3, a model that requires acceptance of terms and conditions and approval to download, we must log in to Hugging Face Hub.

huggingface-cli login

1	huggingface-cli login

Do not add your token as a git credential.

Finally, we are ready to download and build the model. Unlike other inference servers Triton requires building engine files for the LLM to be runnable. Depending on the size of the model you choose this can take quite a while. If you’d like to follow this guide with a smaller model you can replace llama3 with gpt2

triton import -m llama-3-8b-instruct --backend tensorrtllm --model-store=/models

1	triton import -m llama-3-8b-instruct --backend tensorrtllm --model-store=/models

Now that we have imported the model we need to make sure the engine is packaged with the files so we can automatically run it later.

# Create an engine directory for llama3 in the model-repository
mkdir /models/llama3-3-8b-instruct/1/engine
cd /models/llama-3-8b-instruct/1/engine

# move engine files to the same directory
mv /tmp/engines/llama-3-8b-instruct/* ./

# Update references to the files we moved
sed -i -e 's#"/tmp/engines/llama-3-8b-instruct#"/models/llama-3-8b-instruct/1/engine#' $(rgrep -l llama-3-8b-instruct /models)

# Create an engine directory for llama3 in the model-repository

mkdir /models/llama3-3-8b-instruct/1/engine

cd /models/llama-3-8b-instruct/1/engine

# move engine files to the same directory

mv /tmp/engines/llama-3-8b-instruct/* ./

# Update references to the files we moved

sed -i -e 's#"/tmp/engines/llama-3-8b-instruct#"/models/llama-3-8b-instruct/1/engine#' $(rgrep -l llama-3-8b-instruct /models)

We’re now ready to run the model and test it.

triton start --mode=local --model-store=/models

1	triton start --mode=local --model-store=/models

Now from a separate terminal in the Deep Learning VM we can make a simple POST request for a completion.

curl -X POST localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"parameters": {"temperature": 0.3,"stream":false}, "text_input": "Write python code to generate prime numbers.", "max_tokens": 250}'

1	curl -X POST localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"parameters": {"temperature": 0.3,"stream":false}, "text_input": "Write python code to generate prime numbers.", "max_tokens": 250}'

Security and Performance Evaluation

Next, we should make sure the model we’ve downloaded is secure, performant and fits our use case. These tests will vary depending on the specifics of the model you want to run and the app you plan to create. This section has a few suggestions you can use as a starting point.

You should validate the hash of the model downloaded using the triton import command. Ensure it has not been corrupted during the download.
Static file scanning tools. Standard enterprise AV scanners can be scanned for known threats in executable files that may be packaged as part of the model. Additionally, other new OSS static scanning tools like modelscan can be used to check for deserialization attacks.
Runtime validation scans can be used to check that the model does not produce inappropriate content. In our Explore 2024 demonstration we use Giskard to automate some of this scanning.

For Giskard, we wrote a simple script that first wraps a call to the model in their Giskard Model type. Behind the scenes it is using OpenAI to evaluate our model’s performance. However, this can be replaced with a suitable on-premise open source model.

This is the most basic scan possible and can be augmented with your own tests for both security and performance. When complete, it builds a report of any the issues it may have detected. This includes harmfulness, hallucinations, sensitive information disclosure, prompt injection and stereotype propagation. In our case, we found quite a few false positives, but some great examples of where LLM guardrails should be added to our application.

This test could be repeated after adding LLM guardrails and RAG to test our complete Agent against a suite of issues. It is important to note that at this stage we are simply onboarding a model for our developers to use. It is unlikely for a model to pass all of Giskard’s runtime tests without LLM guardrails. In the model onboarding phase, we must validate whether this model is safe to deploy internally by our development team, and is free from malware or a potential supply chain attack adjusting model performance. At this stage we should understand the model’s baseline capabilities and weaknesses that must be secured.

Upload to Model Gallery

When we are reasonably confident that our model performs well enough to onboard into our environment, we can upload it to our Model Gallery.

Add the Model Gallery to your system’s trust store

Download your Harbor registry’s certificate.

openssl s_client -connect harbor-repo-paih.vmware.com:443 2&gt;/dev/null ＜/dev/null |&nbsp; sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' &gt;harbor-repo-paih.vmware.com.pem

1	openssl s_client -connect harbor-repo-paih.vmware.com:443 2>/dev/null ＜/dev/null \|  sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' >harbor-repo-paih.vmware.com.pem

NOTE: The above method is susceptible to a man in the middle attack and is only for demonstration purposes. Secure certificate management is out of scope for this article.

Convert your certificate and copy it to the ca-certificate store.

openssl x509 -outform der -in harbor-repo-paih.vmware.com.pem -out harbor-repo-paih.vmware.com.crt

1	openssl x509 -outform der -in harbor-repo-paih.vmware.com.pem -out harbor-repo-paih.vmware.com.crt

sudo cp harbor-repo-paih.vmware.com.crt /usr/local/share/ca-certificates/harbor-repo-paih.vmware.com.crt

1	sudo cp harbor-repo-paih.vmware.com.crt /usr/local/share/ca-certificates/harbor-repo-paih.vmware.com.crt

Then update your system certificate store.

sudo update-ca-certificates.

1	sudo update-ca-certificates.

Authenticate your local docker system to Harbor

Running the following command will log you into the Harbor registry in the CLI.

docker login harbor-repo-paih.vmware.com

1	docker login harbor-repo-paih.vmware.com

Push model

Before we push the model, we need to create a new Model Gallery in Harbor. This can be done by creating a new Harbor Project.

Now we’re ready to push the model we’ve built!

The following command will push the contents of the /opt/data directory into a new repository called meta/llama-3-8b-instruct

cd /opt/data
pais models push --modelName meta/llama-3-8b-instruct --modelStore harbor-repo
paih.vmware.com/model-gallery --tag approved

cd /opt/data

pais models push --modelName meta/llama-3-8b-instruct --modelStore harbor-repo

paih.vmware.com/model-gallery --tag approved

Pushing again to the same modelName will create new revisions of the model which can be tagged according to your preference. You’ll note each revision is given a unique digest to maintain integrity.

We can list our models with the following command:

pais models list --modelStore harbor-repo-paih.vmware.com/model-gallery

1	pais models list --modelStore harbor-repo-paih.vmware.com/model-gallery

We can also list the revisions of a model:

In Harbor we can see this Model Gallery and control access, image signing, integration with CI/CD and much more.

Auto run Llama3 in a new Deep Learning VM

With a Model Gallery set up with a few models, we can now provide access directly to our developers or ML operations team. Here they can use the Triton Inference Server self-service tile in VCF Automation to request a DL VM with Triton already running.

For this demo, we edited the base triton template to use the Triton image with the trtllm engine running by default. This was done by updating the line highlighted in yellow.

Once deployed we can see that Triton is already running and using our GPU by running:

# See how our GPU is running including processes using it
nvidia-smi

# Investigate running containers
docker ps

# See how our GPU is running including processes using it

nvidia-smi

# Investigate running containers

docker ps

Next, the developer needs to ensure that this Deep Learning VM trusts our Model Gallery: follow the same steps in the section above to add your registry’s certificate and log in with docker. However, this could be automated by injecting the certificate at deploy time as part of our cloud-init.

We then make sure our model-repository directory is owned by the vmware user:

sudo chown vmware /home/vmware/model_repository

1	sudo chown vmware /home/vmware/model_repository

And finally pull the model directly:

pais models pull --modelStore harbor-repo-paih.vmware.com/model-gallery --modelName meta/llama-3-8b-instruct --tag approved

1	pais models pull --modelStore harbor-repo-paih.vmware.com/model-gallery --modelName meta/llama-3-8b-instruct --tag approved

Now we can see since Triton is running in poll mode it has loaded the model.

docker logs ＜container-id&gt;

1	docker logs ＜container-id>

Now the developer can use the triton API to make POST requests for completions!

That wraps up our explainer of the demonstration you saw at VMware Explore in 2024, but there’s much more you can do with the Model Gallery. Once you have models stored in an OCI compatible format you can begin to add CI/CD processes to automate integration and deployment of model runtimes alongside our GenAI applications.

You can learn more about VMware Private AI here.