Introducing VMware Tanzu Data Lake v2.0: Now with Apache Spark and Iceberg for Next-Gen Analytics

Earlier this year we introduced Tanzu Data Lake, which is a Hadoop-based, enterprise-grade solution that provides a unified and secure environment for storing and processing vast amounts of diverse data, ranging from structured data to unstructured logs and streaming data. Built on open source technologies, Tanzu Data Lake offers flexibility and interoperability, allowing organizations to leverage their existing investments while also embracing modern data architectures for their data lake requirements.

In this latest release, Tanzu Data Lake will deliver new integrations for Apache Spark and Apache Iceberg, unlocking powerful capabilities for batch processing, scalable analytics, and open table formats.

As a core component of recently announced Tanzu Data Intelligence, Tanzu Data Lake plays a critical role in unifying structured and unstructured data across environments. These enhancements can make it even more effective for powering AI-ready analytics, modern lake house architectures, and enterprise-grade data pipelines.

Apache Spark with Tanzu Data Lake

Apache Spark is an open source, distributed processing system used for big data workloads. It provides in-memory computation capabilities, making it significantly faster than traditional disk-based processing engines. Spark’s versatile APIs support various data processing tasks:

Streaming data – Analyzing real-time data as it arrives
Machine learning – Building and deploying ML models
Batch processing – Efficiently processing large datasets in chunks
Graph analytics – Analyzing relationships within connected data

Spark’s ability to handle complex analytics and its extensive ecosystem of libraries make it an indispensable tool for modern data operations. The integration of Tanzu Data Lake with Apache Spark creates a powerful synergy, offering numerous advantages for organizations looking to harness the full potential of their data.

Enhanced performance and scalability

Tanzu Data Lake provides a highly scalable storage layer on Hadoop HDFS, enabling organizations to store petabytes of data without compromising performance. When combined with Spark’s in-memory processing capabilities, data analysis becomes incredibly fast. The distributed nature of both components ensures that the solution can scale horizontally to accommodate increasing data volumes and processing demands.

Simplified data management

Tanzu Data Lake offers tools for centralized data management. This can simplify the process of discovering, understanding, and governing data across the organization. With Spark’s integration, data processing jobs can be seamlessly managed and monitored within Tanzu Data Lake, enabling better consistency and compliance.

Accelerated application development

The robust APIs and rich ecosystem of Apache Spark, combined with the comprehensive features of Tanzu Data Lake, can accelerate the development and deployment of data-intensive applications. Data scientists can leverage familiar programming languages like Python, Java, and R to build sophisticated analytics and machine learning models directly on the platform.

Robust security

Tanzu Data Lake incorporates enterprise-grade security features, including data encryption, access control, and auditing. This enables sensitive data to remain better protected.

Cost efficiency

By leveraging open source technologies and providing a flexible, cloud native architecture, Tanzu Data Lake with Apache Spark can help organizations optimize their data infrastructure costs. It can reduce the need for expensive proprietary solutions and allows for efficient resource utilization.

Apache Iceberg with Tanzu Data Lake

Apache Iceberg is an open table format built for high-performance analytics on large datasets. Its integration with Tanzu Data Lake brings powerful capabilities like ACID transactions, schema evolution, time travel, and efficient querying.

When used with Tanzu Data Lake, Iceberg can leverage the underlying infrastructure and services provided by Tanzu:

Scalable storage – Utilize Tanzu Data Lake’s scalable storage solutions for large datasets
Compute resources – Benefit from Tanzu’s integrated compute engines for high-performance data processing
Unified data access – Provide a unified interface for various data sources and types within the Tanzu ecosystem

This enables enterprises to better manage data at petabyte scale with more reliability and flexibility, accelerating analytics, improving data consistency, and supporting AI-ready workloads. Here are some of the benefits explained.

Schema evolution

Iceberg natively supports schema evolution, allowing for safer and more reliable changes to data schemas over time. This includes:

Adding new columns
Renaming existing columns
Reordering columns
Changing column types (with compatible conversions)

This flexibility is crucial for evolving data requirements without disrupting existing analytical pipelines.

Transactional guarantees

Iceberg provides ACID (atomicity, consistency, isolation, durability) transactional guarantees for data operations:

Atomic writes – All-or-nothing writes ensure data integrity
Consistent views – Readers always see a consistent snapshot of the data, even during ongoing writes
Isolation – Concurrent operations do not interfere with each other
Durability – Once data is committed, it is persistent

These transactional capabilities are essential for reliable data pipelines and accurate reporting.

Hidden partitioning

Iceberg introduces hidden partitioning, where the partitioning logic is managed by the table format rather than being exposed in the file paths. This offers several benefits:

Partition evolution – Allows changing partition schemes over time without rewriting all existing data
Improved query performance – Query engines can optimize queries based on the hidden partitioning metadata
Simplified data management – Users do not need to manually manage complex directory structures

Time travel and rollbacks

Iceberg’s versioning capabilities enable time travel and rollbacks, allowing users to:

Query historical versions of the data
Roll back a table to a previous state in case of errors or data corruption—invaluable for auditing, reproducibility, and disaster recovery.

Efficient data compaction

Iceberg tables can be efficiently compacted to optimize query performance and reduce storage costs. The table format tracks data files, making it easier to identify and combine small files into larger, more efficient ones without impacting ongoing queries.

With the integration of Apache Spark and Apache Iceberg, Tanzu Data Lake takes a significant step forward in delivering a more scalable, open, and high-performance foundation for modern data and AI workloads. These capabilities enable enterprises to simplify data architecture, accelerate time-to-insight, and power next-gen applications with confidence. To learn more, feel free to reach out to us.