Model governance is difficult. There are many challenges to manage, from securely onboarding and scanning a model, to deployment and management concerns such as ongoing access control. That’s why at VMware Explore 2024, we introduced the Model Gallery as a new key capability in Private AI Foundation with NVIDIA.
In this article, we will walkthrough secure onboarding of a Llama-3-8b-instruct to the Model Gallery, governance strategies and deployment in your private cloud. But there’s so much more you can do with the Model Gallery. Once you have models stored in an OCI compatible format you can begin to add CI/CD processes to automate integration and deployment of model runtimes alongside GenAI applications.
Workflow Overview
To securely onboard any model to the Model Gallery, we are going to deploy an AI workstation to act as our DMZ for ensuring the models we onboard to our infrastructure are both secure and performant. Once we have a workstation, we need to download a model, perform static security scanning, ensure it runs correctly, complete runtime security scanning and a performance evaluation before uploading it to the Model Gallery for later use outside the DMZ.
Once uploaded to the Model Gallery, we can govern models and adjust access control as well as revisions.
Finally, developers will be able to easily deploy the model via CLI in their own environment.
You can watch a demonstration of this workflow in our 2024 Breakout Session.
Prerequisites
To follow along with this guide, you will need:
- VMware Cloud Foundation 5.2.1 or later
- VMware Private AI Foundation with NVIDIA Add-on
- VCF Preparation for Private AI
- Harbor Deployed on VCF
- A content library with a Deep Learning VM Image
Infrastructure Configuration
To get started, we need to provision a Deep Learning VM. The Deep Learning VM is a VM image delivered as part of VMware Private AI Foundation with NVIDIA. It is preconfigured with popular ML tools and is optimized and validated by NVIDIA and VMware for GPU acceleration in VMware Cloud Foundation. We’re going to use the Private AI quick start to configure our environment.
Here you can select between air gapped and cloud connected. Follow the respective instructions needed to deploy a Deep Learning VM:
From here, we run the Private AI Quick Start Wizard in VCF Automation Catalog setup wizard to create 5 templates for us to deploy. But before we do so, we need to make sure we have configured a VM class for our large language model inferencing workload.
In this guide, we make use of the Triton TensorRT-LLM backend to run Llama3-8b-instruct. To use this backend, we need to create a new VMclass with our vGPU profile connected as a PCI device.
Then, we set the UVM parameter under Advanced Parameters.
Now we are ready to run the quick start, using the VM class we created earlier.
When the quick start wizard is complete we should see 5 new catalog items in Service Broker.
Deploy an AI Workstation
Let’s use the AI workstation catalog item to deploy a VM that we can use to onboard models. When deploying this AI Workstation, we select the same VM Class we configured earlier.
Deployment can take up to 30 minutes.
Check prerequisites
Once the Deep Learning VM is deployed and booted correctly we can ssh into it using the password we set in VCF Automation.
Let’s make sure everything is running properly and we have the necessary tools.
Firstly, let’s see if our GPU is installed properly by running:
1 |
nvidia-smi |
Here we should see our vGPU profile and its drivers are all installed successfully.
Next, from VCF 5.2.1 the Deep Learning VM comes pre-installed with the Private AI Solutions CLI. Make sure it’s installed by running:
1 |
pais --help |
Here we expect an output similar to the following. If this CLI command is not found, the appropriate version of the Deep Learning VM is not installed. This can be addressed by updating the Deep Learning VM image in Content Library to the most up to date version.
Download a model
To download and run our model, we need to install and run Triton Inference Server.
1 |
docker run -itd --gpus all --rm -p 8000:8000 -p 8001:80001 -p 8002:8002 -v /opt/data:/models nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 |
This command will download and run our Triton container with the triton-trtllm engine installed for us. You can see the above command is making use of the Nvidia public registry. In an air gapped environments, this should be replaced with our private registry. You’ll also notice this command is binding our data disk to the container so that we can easily transfer models from the VM into the container.
Once the container image has been downloaded and the container is running, open a console in the container to install all the additional components we need.
First, get the container ID by running:
1 |
docker ps |
Then, open a console in the container by running the following command:
1 |
docker exec -it <containerID> bash |
In the container
Download the triton CLI.
1 |
Since we’re going to download llama3, a model that requires acceptance of terms and conditions and approval to download, we must log in to Hugging Face Hub.
1 |
huggingface-cli login |
Do not add your token as a git credential.
Finally, we are ready to download and build the model. Unlike other inference servers Triton requires building engine files for the LLM to be runnable. Depending on the size of the model you choose this can take quite a while. If you’d like to follow this guide with a smaller model you can replace llama3 with gpt2
1 |
triton import -m llama-3-8b-instruct --backend tensorrtllm --model-store=/models |
Now that we have imported the model we need to make sure the engine is packaged with the files so we can automatically run it later.
1 2 3 4 5 6 7 8 9 |
# Create an engine directory for llama3 in the model-repository mkdir /models/llama3-3-8b-instruct/1/engine cd /models/llama-3-8b-instruct/1/engine # move engine files to the same directory mv /tmp/engines/llama-3-8b-instruct/* ./ # Update references to the files we moved sed -i -e 's#"/tmp/engines/llama-3-8b-instruct#"/models/llama-3-8b-instruct/1/engine#' $(rgrep -l llama-3-8b-instruct /models) |
We’re now ready to run the model and test it.
1 |
triton start --mode=local --model-store=/models |
Now from a separate terminal in the Deep Learning VM we can make a simple POST request for a completion.
1 |
curl -X POST localhost:8000/v2/models/llama-3-8b-instruct/generate -d '{"parameters": {"temperature": 0.3,"stream":false}, "text_input": "Write python code to generate prime numbers.", "max_tokens": 250}' |
Security and Performance Evaluation
Next, we should make sure the model we’ve downloaded is secure, performant and fits our use case. These tests will vary depending on the specifics of the model you want to run and the app you plan to create. This section has a few suggestions you can use as a starting point.
- You should validate the hash of the model downloaded using the triton import command. Ensure it has not been corrupted during the download.
- Static file scanning tools. Standard enterprise AV scanners can be scanned for known threats in executable files that may be packaged as part of the model. Additionally, other new OSS static scanning tools like modelscan can be used to check for deserialization attacks.
- Runtime validation scans can be used to check that the model does not produce inappropriate content. In our Explore 2024 demonstration we use Giskard to automate some of this scanning.
For Giskard, we wrote a simple script that first wraps a call to the model in their Giskard Model type. Behind the scenes it is using OpenAI to evaluate our model’s performance. However, this can be replaced with a suitable on-premise open source model.
This is the most basic scan possible and can be augmented with your own tests for both security and performance. When complete, it builds a report of any the issues it may have detected. This includes harmfulness, hallucinations, sensitive information disclosure, prompt injection and stereotype propagation. In our case, we found quite a few false positives, but some great examples of where LLM guardrails should be added to our application.
This test could be repeated after adding LLM guardrails and RAG to test our complete Agent against a suite of issues. It is important to note that at this stage we are simply onboarding a model for our developers to use. It is unlikely for a model to pass all of Giskard’s runtime tests without LLM guardrails. In the model onboarding phase, we must validate whether this model is safe to deploy internally by our development team, and is free from malware or a potential supply chain attack adjusting model performance. At this stage we should understand the model’s baseline capabilities and weaknesses that must be secured.
Upload to Model Gallery
When we are reasonably confident that our model performs well enough to onboard into our environment, we can upload it to our Model Gallery.
Add the Model Gallery to your system’s trust store
Download your Harbor registry’s certificate.
1 |
openssl s_client -connect harbor-repo-paih.vmware.com:443 2>/dev/null </dev/null | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' >harbor-repo-paih.vmware.com.pem |
NOTE: The above method is susceptible to a man in the middle attack and is only for demonstration purposes. Secure certificate management is out of scope for this article.
Convert your certificate and copy it to the ca-certificate store.
1 |
openssl x509 -outform der -in harbor-repo-paih.vmware.com.pem -out harbor-repo-paih.vmware.com.crt |
1 |
sudo cp harbor-repo-paih.vmware.com.crt /usr/local/share/ca-certificates/harbor-repo-paih.vmware.com.crt |
Then update your system certificate store.
1 |
sudo update-ca-certificates. |
Authenticate your local docker system to Harbor
Running the following command will log you into the Harbor registry in the CLI.
1 |
docker login harbor-repo-paih.vmware.com |
Push model
Before we push the model, we need to create a new Model Gallery in Harbor. This can be done by creating a new Harbor Project.
Now we’re ready to push the model we’ve built!
The following command will push the contents of the /opt/data directory into a new repository called meta/llama-3-8b-instruct
1 2 3 |
cd /opt/data pais models push --modelName meta/llama-3-8b-instruct --modelStore harbor-repo paih.vmware.com/model-gallery --tag approved |
Pushing again to the same modelName will create new revisions of the model which can be tagged according to your preference. You’ll note each revision is given a unique digest to maintain integrity.
We can list our models with the following command:
1 |
pais models list --modelStore harbor-repo-paih.vmware.com/model-gallery |
We can also list the revisions of a model:
In Harbor we can see this Model Gallery and control access, image signing, integration with CI/CD and much more.
Auto run Llama3 in a new Deep Learning VM
With a Model Gallery set up with a few models, we can now provide access directly to our developers or ML operations team. Here they can use the Triton Inference Server self-service tile in VCF Automation to request a DL VM with Triton already running.
For this demo, we edited the base triton template to use the Triton image with the trtllm engine running by default. This was done by updating the line highlighted in yellow.
Once deployed we can see that Triton is already running and using our GPU by running:
1 2 3 4 5 |
# See how our GPU is running including processes using it nvidia-smi # Investigate running containers docker ps |
Next, the developer needs to ensure that this Deep Learning VM trusts our Model Gallery: follow the same steps in the section above to add your registry’s certificate and log in with docker. However, this could be automated by injecting the certificate at deploy time as part of our cloud-init.
We then make sure our model-repository directory is owned by the vmware user:
1 |
sudo chown vmware /home/vmware/model_repository |
And finally pull the model directly:
1 |
pais models pull --modelStore harbor-repo-paih.vmware.com/model-gallery --modelName meta/llama-3-8b-instruct --tag approved |
Now we can see since Triton is running in poll mode it has loaded the model.
1 |
docker logs <container-id> |
Now the developer can use the triton API to make POST requests for completions!
That wraps up our explainer of the demonstration you saw at VMware Explore in 2024, but there’s much more you can do with the Model Gallery. Once you have models stored in an OCI compatible format you can begin to add CI/CD processes to automate integration and deployment of model runtimes alongside our GenAI applications.
You can learn more about VMware Private AI here.