Unlocking High Performance with Delta Engine: The Keynote at Spark + AI Summit 2020

Find AI Tools in second

Find AI Tools

No difficulty

No complicated process

Find ai tools

Home AI News Unlocking High Performance with Delta Engine: The Keynote at Spark + AI Summit 2020

Updated on Dec 26,2023

Unlocking High Performance with Delta Engine: The Keynote at Spark + AI Summit 2020

Table of Contents:

Introduction
The Need for Delta Lake
Introducing Delta Engine
Delta Engine's Components 4.1 Improved Query Optimizer 4.2 Native Vectorized Execution Engine 4.3 Caching Layer
Hardware Trends and Workload Characteristics 5.1 Hardware Trends 5.2 Workload Characteristics
How Delta Engine Achieves High Performance 6.1 Data-Level Parallelism 6.2 Instruction-Level Parallelism 6.3 Optimizing String Operations 6.4 Optimizing Regular Expressions
Conclusion

Article: Achieving High Performance with Delta Engine

Introduction

In the era of data lakes and analytics, it's crucial to have a reliable and structured architecture. That's where Delta Lake comes in. Delta Lake fills the gap in modern lakehouse architecture by providing structure and reliability to data lakes, supporting various downstream data use cases. However, indexing alone is not enough to meet the demands of analytic workloads that require fast execution. There is a need for a high-performance query engine that can deliver the required performance for traditional analytics. This is where Delta Engine, developed by Databricks, comes into play.

The Need for Delta Lake

Delta Lake has proven to be an essential component in modern data lakes. It provides the ability to have a structured and reliable data lake, offering support for various downstream data use cases. The need for Delta Lake arises from the requirements of analytic workloads that demand not only structure and reliability but also high-performance query engines. Large-Scale analytics workloads involving small queries that need to be executed swiftly require more than just indexing. Traditional analytic workloads necessitate a high-performance query engine that can deliver the performance required to process data efficiently.

Introducing Delta Engine

Databricks introduces Delta Engine, a high-performance query engine designed to work seamlessly with data lakes. Delta Engine builds on Apache Spark 3.0, making it fully compatible with Spark's APIs, including Spark's SQL and DataFrame APIs. Additionally, Delta Engine is compatible with the Koalas DataFrame API. By leveraging the power of Apache Spark 3.0, Delta Engine delivers massive performance for SQL and DataFrame workloads. It comprises three critical components: an improved query optimizer, a native vectorized execution engine, and a caching layer that enhances input/output throughput.

Delta Engine's Components

4.1 Improved Query Optimizer

To enhance the performance of analytic workloads, Delta Engine introduces an improved query optimizer. It extends Spark's cost-Based optimizer and adaptive query execution optimizer with more advanced statistics. Leveraging this technique, Delta Engine can deliver up to an 18 times increase in performance for star schema workloads. The improved query optimizer ensures optimal query execution based on comprehensive statistical analysis.

4.2 Native Vectorized Execution Engine

Delta Engine's native vectorized execution engine is designed from scratch in C++ to achieve maximum performance. It exploits data-level parallelism and instruction-level parallelism, making full use of the CPU's capabilities. By transposing data into a column-oriented memory format, the execution engine enables efficient processing of large datasets. This approach significantly reduces cache misses and improves instruction throughput, leading to faster query execution.

4.3 Caching Layer

The caching layer in Delta Engine plays a crucial role in enhancing IO throughput. Situated between the execution engine and cloud object storage, the caching layer automatically selects input data to cache for the user. Unlike a simple byte cache, Delta Engine's caching layer transcodes data into a more CPU-efficient format. This format allows for quicker decoding and processing of queries, making use of the local NVMe SSDs available in cloud virtual machines. This caching layer delivers up to a five times increase in scan performance for all types of workloads.

Hardware Trends and Workload Characteristics

5.1 Hardware Trends

Understanding hardware trends is vital when designing high-performance software solutions. While CPU clock frequencies have remained relatively stable over the years, IO throughput has significantly increased. This shift has resulted in CPUs becoming a bottleneck, particularly for analytics workloads that rely heavily on IO operations. To address this, Databricks launched Project Tungsten within Apache Spark, focused on optimizing CPU efficiency. However, even with these optimizations, CPUs Continue to be a limiting factor in achieving the next level of performance.

5.2 Workload Characteristics

Modern workloads present unique challenges. Businesses are moving faster than ever, and data teams have limited time to properly model and prepare their data. Traditional data modeling approaches are not feasible when business contexts change frequently. STRING manipulations and lack of data modeling become prevalent, impacting query execution performance. To cater to these modern workloads, Delta Engine's Photon execution engine was designed to deliver both agility and performance. Photon optimizes Spark SQL workloads, accounting for the characteristics of not-so-well-modeled data.

How Delta Engine Achieves High Performance

6.1 Data-Level Parallelism

One of the key techniques employed by Delta Engine is data-level parallelism. Delta Engine leverages vectorized execution to exploit data-level parallelism and instruction-level parallelism offered by modern CPUs. By transposing data into a column-oriented format, Delta Engine ensures that memory access is sequential, reducing cache misses and enabling efficient loop execution. This approach allows the CPU to process multiple data points simultaneously, leading to significant performance improvements.

6.2 Instruction-Level Parallelism

Instruction-level parallelism is another critical aspect of Delta Engine's high-performance design. By breaking complex loops into smaller, simpler loops, instruction-level parallelism can be effectively utilized. The CPU's out-of-order window allows for reordering and parallel execution of multiple instructions. This technique reduces CPU idle time and maximizes instruction throughput, resulting in faster query execution.

6.3 Optimizing String Operations

In modern workloads, strings are pervasive and often unoptimized, resulting in decreased query performance. Delta Engine addresses this challenge by optimizing string operations. By separating string processing into two steps, Delta Engine utilizes a specialized SIMD ASCII detection algorithm. This algorithm quickly identifies whether a string can fit entirely within the ASCII character set, allowing for optimized execution using fixed-length operations. This technique significantly improves the performance of string functions such as "upper" and "substring."

6.4 Optimizing Regular Expressions

Regular expressions play a crucial role in data manipulation and cleansing. Delta Engine has spent considerable effort optimizing regular expressions to achieve superior performance. As regular expressions are frequently used in real-world scenarios, their optimization is essential. Delta Engine's regular expression implementation surpasses performance benchmarks, ensuring efficient and fast processing of regex-based operations.

Conclusion

Delta Engine, developed by Databricks, brings unprecedented performance to data lakes. By combining an improved query optimizer, a native vectorized execution engine, and an advanced caching layer, Delta Engine enables high-speed data processing for analytics workloads. It leverages data-level parallelism and instruction-level parallelism, optimizing execution for both well-modeled and not-so-well-modeled data. Delta Engine's techniques, including optimized string operations and accelerated regular expressions, ensure exceptional performance for SQL and DataFrame workloads. By utilizing Delta Engine, organizations can unlock the full potential of their data lakes and streamline their analytics processes.

Highlights

Delta Engine is a high-performance query engine designed for data lakes.
It builds on Apache Spark 3.0 and has full compatibility with Spark's APIs.
Delta Engine includes an improved query optimizer for enhanced performance.
The native vectorized execution engine exploits data-level and instruction-level parallelism for speed.
The caching layer optimizes IO throughput for faster data access.
Hardware trends and workload characteristics drive the design of Delta Engine.
Data-level parallelism and instruction-level parallelism are key techniques for achieving high performance.
Optimized string operations and regular expressions further improve query execution speed.
Delta Engine enables agile and high-performance analytics on data lakes.
Photon, Delta Engine's execution engine, delivers outstanding performance for Spark SQL workloads.

FAQ

Q: Is Delta Engine compatible with Spark's APIs? A: Yes, Delta Engine is fully compatible with Spark's SQL and DataFrame APIs, ensuring seamless integration with existing Spark applications.

Q: How does Delta Engine optimize string operations? A: Delta Engine employs a specialized SIMD ASCII detection algorithm to quickly identify if a string can fit within the ASCII character set. This optimization allows for faster execution using fixed-length operations.

Q: Does Delta Engine support regular expressions? A: Yes, Delta Engine has optimized regular expressions to achieve superior performance. Regular expression operations in Delta Engine are significantly faster compared to other implementations.

Q: Can Delta Engine handle both well-modeled and not-so-well-modeled data? A: Yes, Delta Engine is designed to deliver high performance for both well-modeled and not-so-well-modeled data. It offers agility and performance, catering to modern workload characteristics.

Q: What are the advantages of Delta Engine's caching layer? A: Delta Engine's caching layer sits between the execution engine and cloud object storage, optimizing IO throughput. It automatically selects and caches input data for faster access, resulting in improved scan performance for all types of workloads.

Q: How does Delta Engine leverage data-level and instruction-level parallelism? A: Delta Engine transposes data into a column-oriented format to exploit data-level parallelism. It also breaks complex loops into smaller, simpler loops to leverage instruction-level parallelism, maximizing CPU throughput and reducing idle time.

Hilarious AI Presidents Easter Eggstravaganza!

Monetize Your LoFi Channel with AI!