Tuesday, April 1, 2025

Accelerating Databricks Virtual Machines: 7-Fold Boost in Serverless Computing Speed

The Databricks serverless compute infrastructure efficiently manages hundreds of thousands of virtual machines daily across three primary cloud providers, posing a significant challenge in operating such vast infrastructure effectively. As we speak, we’re excited to share with you our latest achievements in delivering a true Serverless experience: seamlessly deploying not only compute resources but also the underlying infrastructure capable of handling data and AI workloads, such as fully fledged Apache Spark clusters or large language model serving, in mere seconds at scale.

To date, no other Serverless platform has demonstrated the capability to seamlessly process diverse datasets and AI workloads at massive scales within mere seconds? The critical challenge arises from the time and cost invested in configuring a VM environment for maximum performance, encompassing not only installing diverse software suites but also meticulously fine-tuning the runtime environment to optimize its efficacy.

Databricks Runtime (DBR) necessitates warming up the JVM’s Just-In-Time (JIT) compiler to deliver optimal performance from the outset, thereby ensuring efficient execution for users right away.

This blog post showcases the system-level optimizations we’ve developed to reduce the boot time of VMs preloaded with the Databricks software (or Databricks VMs) from minutes to seconds, a 7x improvement since the launch of our Serverless platform, which now powers nearly all Databricks products. Our optimizations span your entire software program stack, from the operating system and container runtime all the way to hosted functions, allowing us to save millions of minutes in compute time daily and deliver top-notch value to Databricks Serverless customers.

Booting a Databricks VM

When you spin up a Databricks VM on the Serverless Platform, the following boot sequence occurs:

The first step involves checking for any existing ephemeral storage that may have been created during a previous runtime. If such storage exists, it’s mounted and made available to the new runtime. This ensures that any files or data that were previously written to disk are preserved across reboots.

Next, the Databricks kernel is initialized, which involves loading the necessary dependencies and libraries required for running Apache Spark jobs. This includes configuring the Spark context, setting up the driver node, and initializing the cluster configuration.

Following this, the Spark UI is started, allowing you to monitor job progress and performance metrics in real-time. The UI provides a centralized location to view cluster status, job logs, and runtime metrics, making it easier to debug and optimize your workflows.

Finally, the Databricks Runtime environment is set up, including configuration of Python, Scala, or SQL environments as needed. This prepares the environment for running notebooks, executing Spark jobs, or performing data engineering tasks using the full range of Databricks features.

The three fundamental boot stages outlined in Determine 1 require a certain amount of time to complete due to the following reasons?

  1. A Databricks virtual machine commences its operating system boot process by initializing the kernel, initiating system services, launching the container runtime, and finally establishing a connection with the cluster manager that oversees all virtual machines within the deployment.
  2. At Databricks, we encapsulate functions in container images, streamlining the allocation of runtime resources and deployment process. Upon establishing a connection with the cluster supervisor, the virtual machine promptly receives a comprehensive catalog of container specifications, subsequently initiating the massive download of numerous gigabytes’ worth of images from the container registry.

    These images encapsulate not only the latest Databricks Runtime, but also essential utilities for log processing, virtual machine health monitoring, and metric reporting, among other crucial capabilities.

  3. Ultimately, the virtual machine launches the workload container, sets up its environment, and enables it to function. When deploying Databricks Runtime, the setup process involves a laborious process of loading thousands of Java libraries and performing a series of carefully crafted queries to warm up the JVM. To optimize performance, we execute a series of preliminary queries that trigger the JVM’s JIT compilation process, converting bytecode into native machine code for frequently executed code paths. This enables customers to enjoy peak runtime efficiency from their initial inquiry onwards. Engaging in multiple warm-up exercises ensures the system delivers fast and efficient performance across various query types and information processing demands. Despite this, an increased range of inquiries may lead to the initialization process taking mere minutes to complete.

We significantly reduced latency for each phase by implementing the following optimizations.

A purpose-built Serverless OS

We craft a customised Serverless Operating System for Databricks Serverless, ensuring seamless integration with ephemeral virtual machines that meet the demands of your entire software program stack. To optimize the performance of our serverless operating system, we aim to achieve a nimble design. We specifically utilize essential software programs necessary for operating containers, optimizing their boot sequence to provide critical services ahead of a standard OS’s timeline. To optimize performance, we calibrate the operating system to prioritize buffered input/output operations for writing and address potential disk constraints during the boot process.

Eliminating unnecessary operating system components significantly accelerates the boot process, not only by reducing the number of elements that need initialization, such as disabling the USB subsystem, which is irrelevant for a cloud-based virtual machine, but also by making the boot process more suitable for a cloud environment. In virtual machines (VMs), the operating system boots from a remote disk, where the disk contents are transferred to the physical host during booting. Cloud providers optimize this process through various caching layers of the disk content, leveraging predictions on block sectors that are more likely to be accessed. A smaller OS footprint enables cloud providers to cache disk contents more effectively.

By tailoring our Serverless OS to mitigate I/O contention during the boot process, we effectively reduce the need for crucial file writes that often occur at this stage. We fine-tune the system settings to enable a memory buffer for excess file writes before the kernel needs to flush them to disk. We further optimize the container runtime by mitigating the impact of blocking, synchronous writes during image pulls and container creations. We specifically engineer these optimisations for fleeting, transitory virtual machines where data loss due to power outages or system failures is a relatively minor consideration.

A lazy container filesystem

After connecting to the cluster supervisor, a Databricks VM should acquire gigabytes of container images promptly, prior to initializing the Databricks Runtime and other utility functions, such as those supporting log processing and metrics emission? Even with access to the entire community’s bandwidth and disk resources, the download process may still require several minutes to complete. Despite this, a mere 24% of data is actually necessary for containers to begin functioning effectively, while a significant 76% is consumed by simply downloading container images at startup time.

Determining 2: A Lazy Container Filesystem Based Mostly on OverlayFS

Based on our analysis, we permit the employment of a relaxed file system within a container, as substantiated by Determine 2. To create a container image, we introduce an extra step that converts the traditional gzip-compressed picture format into the block device-based format optimized for efficient lazy loading. This enables the container image to be represented as a scalable, block-based storage system comprising 4-megabyte sectors during production.

Our custom container runtime extracts only the essential metadata for the container’s root directory, including directory structure, file names, and permissions, generating a digital block system in the process. Once installed, the digital block system is seamlessly integrated into the container, allowing for immediate application functionality. Upon initial file read, the I/O request against the digital block system triggers a callback to the image fetcher process, which retrieves the specific block contents from the remote container registry. The retrieved block content can also be locally cached to prevent repetitive network requests to the container registry, thereby reducing the impact of varying network latency on subsequent reads.

The Lazy Container Filesystem significantly reduces the need for pre-downloading the complete container image, resulting in a latency decrease from several minutes to mere seconds when starting an application. By distributing the image processing process across a prolonged timeline, this approach reduces the pressure on the blob storage bandwidth, thereby preventing potential throttling issues.

Checkpointing/Restoring a pre-initialized container

Before proceeding, we kick-start the container’s initialization process by running a comprehensive in-container setup routine, thereby ensuring a seamless preparation phase precedes the declaration of the VM as operational and ready for service. For Databricks Runtime, we preload all essential Java libraries and execute a comprehensive initialization process to thoroughly warm up the Spark JVM instance. While this approach optimizes initial query performance for customers, it significantly boosts boot time. The identical setup process is replicated for every VM launched by Databricks, resulting in inefficient use of resources and potential inconsistencies across environments.

By mitigating the costly initialization process, we employ a caching strategy to store the fully optimized state. We capture a snapshot of our pre-initialized container’s process tree using checkpointing technology, then leverage this template to spin up subsequent instances of the same workload type with ease. On this setup, the containers are “restored” directly into a consistent, initialized state, bypassing the tedious setup process altogether.

Determine 3: Enabling checkpointing for a Databricks Runtime (DBR) container efficiently. During checkpointing, visual representations of a state’s progression are denoted by purple rectangles.

We integrate checkpointing and restoring capabilities into our custom-built container runtime environment. The functionality is outlined in detail within Determination 3. During checkpointing, the container runtime initially freezes the entire tree of the container’s state to ensure consistency. The process then writes its current method states, along with loaded libraries, open file descriptors, the entirety of its heap state – including just-in-time compiled native code – and stack memory to the disk. Additionally, it preserves the writable layer of the container’s filesystem to safeguard the data generated or updated during the container initialization process. This feature enables us to restore both the in-memory process state and the on-disk filesystem state at a later time, allowing for seamless recovery from system crashes or other disruptions. We package the checkpoint into a container image compatible with OCI and Docker, subsequently storing and distributing it through the container registry as though it were a standard container image.

While this method appears straightforward in principle, it nevertheless presents several inherent difficulties.

  • It was initially not the case due to the fact that Databricks Runtime had difficulty handling non-generic data entry, such as hostnames, IP addresses, and pod titles, which varied across use cases; conversely, we could restore the same checkpoint on multiple different VMs. Additionally, Databricks Runtime struggled to adapt to the sudden shift in wall clock time since checkpoint restoration often occurred days or even weeks after the initial checkpoint was taken. To facilitate seamless handling of checkpoints and restore operations, we introduced a compatible mode within the Databricks Runtime. This mode postpones the binding of host-specific data until after the system has been restored. Additionally, it offers customizability through pre-checkpoint and post-restore hooks, enabling users to integrate their own logic seamlessly into the checkpointing and restore processes. Databricks Runtime leverages hooks to effectively manage time zone shifts by temporarily suspending and restarting heartbeats, as well as re-establishing external network connections and more.
  • A checkpoint represents the definitive state of a container, influenced by various factors including the Databricks Runtime model, software configurations, heap size, CPU Instruction Set Architecture (ISA), and others. Restoring a checkpoint from a 64GB virtual machine to a 32GB one will likely trigger out-of-memory errors, whereas attempting to load a checkpoint created on an Intel processor onto an AMD one may lead to unlawful instruction sequences due to the JVM’s Just-In-Time compiler generating optimized native code based on the Instruction Set Architecture. The rapid pace of innovation in Databricks Runtime and compute infrastructure poses significant challenges when designing a CI/CD pipeline that can effectively keep up, ensuring seamless integration and timely deployment of new features and updates. Rather than listing all possible signatures across various dimensions, we develop customized checkpoints as soon as a novel signature emerges during production. The generated checkpoints are subsequently uploaded to a container registry for distribution, enabling seamless restoration of workloads with identical signatures across the entire fleet upon future launches. This approach not only streamlines the design of the checkpoint technology pipeline, but also guarantees that every generated checkpoint is genuinely beneficial in production.
  • Launches multiple containers from the same checkpoint, potentially compromising individuality guarantees. However, instances of identical-seed RNGs sharing the same sequence of random numbers may arise following a system restart. During initialization, we monitor RNG objects as they are created, ensuring that they benefit from the post-restore hook to effectively reseed them, thereby restoring their distinctiveness.

Our findings indicate that this optimisation has significantly reduced the initiation and warm-up times for Databricks Runtime, shrinking the timeframe from several minutes to approximately 10 seconds. This feature also enables a more thorough JVM warm-up process without worrying about timing constraints, as it is no longer on the critical path.

Conclusion

We’re committed to driving unparalleled value for our customers through relentless innovation and strategic optimization. A significant reduction in Databricks virtual machine boot times is achieved through a series of profound, foundational enhancements, resulting in a remarkable sevenfold decrease. This innovative approach not only enables superior latency and efficiency capabilities for serverless customers, but also empowers us to deliver an exceptional user experience at the lowest possible cost.

As we continue to optimize, we’ll aim to reduce the VM boot-up time even further by scaling up the storage capacity, ultimately driving down Serverless costs. Stay tuned for more details! We would like to extend our gratitude to the open-source communities that generously contributed to our project, allowing us to leverage their innovations and expertise in achieving these significant optimizations. Start your risk-free trial today and gain hands-on experience with Databricks Serverless in real-time.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles