This blog was co-written by Arnab Chakraborty and Ahmed Rachid Hazourli.
In a previous blog, we discussed the growing significance of generative artificial intelligence (GenAI) and large language models (LLMs) across various industries and explained the importance of using a robust data platform to support these advanced AI technologies. We highlighted a few of VMware Tanzu Greenplum’s cutting-edge capabilities—such as massively parallel processing (MPP), data federation, real-time data processing, text analytics, and geospatial data support— which make it an ideal choice for handling the data demands of generative AI models. In this post, we delve deeper into Tanzu Greenplum’s role as an AI and analytics data platform.
VMware Tanzu Greenplum is a unified data platform that has already helped customers break down data silos and bring more types of development directly to their data.
One of its greatest strengths lies in the many ways it enables users to run in-database machine learning (ML) and to train and fine-tune leading LLMs. Here are some of the ways it does that.
Leveraging in-database machine learning with Python and R for LLM model training
Tanzu Greenplum’s seamless integration of Python and R brings the power of these popular programming languages directly into the data platform.
Data scientists can utilize the extensive libraries and frameworks available in Python (e.g., Hugging Face Transformers, TensorFlow, Keras, Scikit-learn, Pandas, and NumPy) and R to train and fine tune LLMs on the vast datasets stored within Tanzu Greenplum.
This in-database machine learning approach eliminates the need for data transfers, helping to ensure that data remains secure and significantly reducing processing times—all in a single, distributed platform.
Tanzu Greenplum’s container services
Tanzu Greenplum’s support for containers (PL/Container) further extends the platform’s capabilities for LLM training and experimentation. PL/Container allows users to deploy custom containers using specialized ML frameworks and dependencies, providing flexibility and expandability to the ML workflow within Tanzu Greenplum.
By running customized containers directly within Tanzu Greenplum, data scientists can fine-tune LLM architectures, experiment with hyperparameters, and work with pre-trained models efficiently. This streamlined approach to experimentation enhances the agility of model development and helps ensure that rapid iterations achieve optimal performance.
Empowering LLM training with MADlib and PostgresML extensions
Tanzu Greenplum’s integrated extensions, MADlib and PostgresML, elevate the platform to new heights for LLM training and advanced analytics.
MADlib and PostgresML provide a vast array of machine learning and statistical functions. Data scientists and analysts can easily perform complex tasks, such as regression, classification, clustering, and more at scale. These extensions allow you to run machine learning algorithms directly within the database by harnessing the parallel processing capabilities of Tanzu Greenplum.
Training AI models with MADlib and PostgresML
With this combination of MADlib and PostgresML, organizations can effortlessly develop and deploy sophisticated LLMs, facilitating deeper insights and informed decision making.
GPU acceleration in Tanzu Greenplum
Tanzu Greenplum supports GPU-enabled infrastructure and empowers data scientists with unparalleled computational power for LLM training. The integration of GPUs allows for highly efficient parallel processing, accelerating ML model training and fine tuning exponentially.
By leveraging GPUs in Tanzu Greenplum, organizations can significantly reduce training time and can handle larger datasets without compromising performance.
You have a solid data platform established. Now what?
Once companies have trained and fine-tuned AI models, they usually start looking to use them on their own data for things like chatbots, recommendation systems, or search engines. However, a challenge soon arises.
Embeddings are the result of transforming data or complex objects, like texts, images, or audio, into a list of numbers in a high-dimensional space. They are a powerful way to standardize and unify the way we represent data.
But how can organizations manage and deploy AI models, get useful insights from these embeddings, and also store and query ML-generated data qualified by embeddings at scale?
Resulting vector representations can be used for a variety of tasks, including generating text, chatbots, text summarization, image generation, and natural language processing capabilities, such as answering questions.
How Tanzu Greenplum helps store embeddings using pgvector
Tanzu Greenplum is capable of storing and querying vector embeddings at large scale thanks to the pgvector extension.
Pgvector is an open source extension for PostgreSQL that adds the ability to store and search over machine learning–generated vector embeddings. It provides different capabilities that enable you to identify both exact and approximate nearest neighbors. It’s designed to work seamlessly with other PostgreSQL and Tanzu Greenplum features, including indexing and querying.
This brings vector database capabilities to the Tanzu Greenplum data warehouse, which enables users to perform fast retrieval and efficient semantic similarity searches, on text, image, audio, and video.
Vector database and vector similarity search
Pgvector also supports indexing, so we can improve the performance of the operations easily. There are two types of indexes: Inverted File (IVFFlat) and Hierarchical Navigable Small Worlds (HNSW).
Why choose Tanzu Greenplum and pgvector
Many companies would like to store, query, and perform vector semantic searches within their data warehouses without managing another vector database.
Fortunately, combining Tanzu Greenplum and pgvector can help you build scalable ML-enabled analytics and AI applications using embeddings from AI models; get to faster insights; and perform fast retrieval, similarity, and semantic search over massive amounts of vector embeddings and unstructured data.
They can leverage ML capabilities in e-commerce, media, healthcare applications, and more to analyze similar patterns within their data. For example, they could be used for medical diagnostics and patient similarity to facilitate diagnoses, image similarity search in media databases, fraud detection in financial transactions, or e-commerce product recommendations.
VMware Greenplum is one of few data warehouses with vector search capabilities.
In addition, Tanzu Greenplum and pgvector can be used to build massive-scale data applications without adding operational burden. For example, a streaming data application could use pgvector to provide a list of film recommendations similar to the one you just watched.
Leverage VMware Greenplum’s vector similarity search for movie recommendations.
Another potential use case could be building chatbots tailored to your business, using Tanzu Greenplum as your own knowledge base with relevant information from your documents as described in this post. It helps to build industry-specific LLMs using retrieval augmented generation (RAG) and to develop AI agents leveraging popular development frameworks, such as Langchain and LlamaIndex.
Building an AI-powered chatbot using pgvector, OpenAI, and VMware Greenplum.
Conclusion
Tanzu Greenplum is an ideal choice for organizations that are working with AI and seeking to augment their data platforms for LLMs and generative AI. Its powerful vector database and semantic search functionalities, combined with its in-database machine learning advanced capabilities, provide advanced search and similarity capabilities for high-dimensional data but also a fully edged data warehouse for end-to-end machine learning pipelines.
What’s more, these functionalities are integrated into a comprehensive and flexible data platform that can handle diverse datasets and workloads: large-scale, real-time analytics. While other vector databases might offer specialized capabilities, Tanzu Greenplum’s broad feature set and flexible architecture make it a versatile tool for managing data and analytics needs.
Whether you’re working with natural language processing or image recognition, Tanzu Greenplum provides the speed and accuracy you need to seamlessly meet the real-time demands of AI and your business.