DataPelago emerges from stealth mode, unveiling a novel virtualization layer that empowers users to seamlessly deploy AI, data analytics, and ETL workloads on any physical processor they prefer, without requiring code modifications, thus unlocking potentially transformative gains in efficiency and productivity for the realms of data science, analytics, engineering, and HPC.
The emergence of generative AI has sparked a frenzied pursuit of high-performance processors capable of handling the enormous computational demands of massive language models. As companies concurrently strive to extract maximum value from existing compute investments in support of advanced analytics and complex data pipelines, they must navigate the relentless influx of structured, semi-structured, and unstructured information.
Established organizations have reacted to market signals by developing a common knowledge processing engine, which streamlines the transmission of data-intensive workloads to underlying compute infrastructure, ultimately freeing customers to run large-scale datasets, advanced analytics, AI, and high-performance computing workloads on any public cloud or on-premises system that meets their value and efficiency requirements.
“Just as Solar created the Java Virtual Machine or VMware developed the hypervisor, our team at DataPelago is building a virtualization layer that operates within the software itself, not dependent on hardware,” explains DataPelago’s Co-founder and CEO Rajan Goyal. What does it describe?
The DataPelago virtualization layer seamlessly bridges the gap between high-performance query engines, such as Spark, Trino, and Flink, and various SQL dialects, while integrating with a diverse range of infrastructure components, including storage solutions and processing units like CPUs, GPUs, TPUs, and FPGAs. Customers and functions can initiate job submissions as usual, leveraging the DataPelago layer to dynamically direct and execute tasks to the most suitable processors, ensuring optimal alignment with customer-defined supply, pricing, and efficiency parameters.
When a consumer or utility initiates a job, such as an information pipeline task or query, the processing engine, similar to Spark, translates it into a plan. Then, DataPelago invokes an open-source layer, akin to Apache Gluten, to transform that plan into an Intermediate Representation (IR) utilizing open standards like Substrait or Velox. Here’s the improved text:
Within the DataPelago platform, the plan is distributed to employee nodes in the DataOS part, while the IR is converted into a Knowledge Circulate Graph (KCG) executable, which runs within the DataOS section. DataVM assesses the nodes of the DFG and dynamically assigns them to the relevant processing components according to organizational guidelines.
Having an automated technique to match optimal workloads with suitable processors is poised to greatly benefit DataPelago clients, many of whom have not fully leveraged the efficiency gains they expected when migrating to accelerated compute engines, according to Goyal.
“Embedded in computing architectures are three distinct breeds: CPUs, FPGAs, and GPUs. Each, much like its equivalent operator in programming languages (SQL or Python), excels in specific areas.” “Not all deep learning models operate efficiently when deployed on CPU, GPU, or FPGA,” Goyal states. We’re familiar with those popular sweet destinations. At runtime, our software seamlessly maps operators to their corresponding processing components. By partitioning this monumental task into numerous sub-tasks, a select few would be allocated to CPUs, while others would utilize the processing power of GPUs, with a subset running on FPGAs. The absence of real-time adaptive mapping to suitable processing components is a significant shortcoming in various frameworks.
DataPelago cannot surpass the peak efficiency limits of a utility when running natively on CUDA-enabled Nvidia GPUs, ROCm-enabled AMD GPUs, or leveraging high-performance CPU tasks with LLVM. Despite being on the cutting edge of market-leading utility efficiency, the firm’s product can still optimize its performance significantly without being bogged down by the intricacies of programming layers, thereby decoupling customers’ applications from these middlewares and shielding them from underlying complexity.
There exists a significant gap between the expected peak performance of GPUs and their actual capabilities. “We’re bridging that gap,” he remarks. “It’s astonishing that, despite running on powerful GPUs, even Spark-based workloads typically utilize less than 10% of their peak FLOPS.”
A significant contributor to inefficiencies in system performance is the limited input/output bandwidth, according to Goyal. GPUs possess a native memory allocation, necessitating data transfer from the host’s memory space to the GPU’s memory space for effective utilization. While individuals often fail to consider knowledge motion and input/output implications when transitioning to GPU processing, DataPelago’s solution alleviates the need for such concerns altogether.
“This digital machine enables us to consolidate operators and execute knowledge circulation graphs seamlessly,” Goyal explains. Issues rarely, if ever, remain confined to a single domain or sphere of influence, and can unexpectedly bleed over into adjacent areas, requiring proactive measures to mitigate their impact. There is no such concept as a “knowledge motion”. Streaming has taken over our world. We don’t operate retailers and precede. As a result, input/output operations are significantly reduced, enabling us to optimize GPU performance at approximately 80-90% of its peak capacity. The beauty of this framework lies in its flexibility.
The corporation is prioritizing various data-intensive tasks that companies strive to expedite through the deployment of advanced computing platforms. This solution enables ad-hoc analytics via SQL queries, as well as data processing and analysis using various engines such as SQL, Spark, Trino, and Presto, for ETL tasks crafted in either SQL or Python, and real-time data processing utilizing frameworks like Flink. DataPelago’s innovative architecture enables generative AI workloads to benefit from Large Language Model (LLM) coaching at both training and inference stages, thanks to its advanced functionality in accelerating retrieval-augmented technology (RAG), fine-tuning, and generating high-quality vector embeddings for a robust vector database.
“So, this platform offers a single space for executing fundamental lakehouse analytics and ETL processes, as well as leveraging GenAI’s advanced data preprocessing capabilities.”
Clients have the flexibility to deploy DataPelago either on-premise or in the cloud. Upon running a job on a cloud-based data processing platform, such as Amazon Web Services’ (AWS) EMR or Google Cloud’s DataProc, the system can efficiently execute the same workload as a large-scale cluster of 100 nodes using a significantly smaller cluster of just 10 nodes, according to Goyal. While queries execute 10 times faster on DataPelago, the overall return on investment sees a 200% improvement when considering licensing and maintenance costs, according to him.
“Most notably, there’s no modification required in the code,” he notes. “You might be writing Airflow. Despite leveraging Jupyter notebooks, your Python or PySpark code remains untouched.
Our corporation has rigorously benchmarked our software programme against several of the speediest data lakehouse platforms currently available. According to Goyal, running the software in opposition to Databricks’ Photon resulted in a 3- to 4-fold efficiency enhancement when tested by DataPelago.
According to Goyal, there is no reason why clients cannot leverage the DataPelago virtualization layer to accelerate scientific computing workloads running on high-performance computing (HPC) setups, including those involving artificial intelligence (AI) or simulation and modeling tasks.
“When you’ve tailored code for a specific hardware setup, you’re already optimizing for an A100 GPU with its distinctive 80 GB of GPU memory, numerous Streaming Multiprocessors (SMs), and thread count. You can continue to refine your code accordingly.” You’re likely optimizing your lower-level code and kernels to maximize FLOPS, or operations per second, by carefully orchestrating these components. By implementing an abstraction layer, we’ve successfully encapsulated a complex component, allowing us to seamlessly integrate it into our architecture while maintaining extensibility and scalability in line with this fundamental principle.
At the end of the day, there’s no magic at play here. There are primarily three fundamental issues: computing, input/output, and storage,” he explains. “So long as you design your system to harmonize with the trifecta of I/O, compute, and storage limitations, ensuring no single bottleneck dominates, then a smooth operating experience will prevail.”
With a proven track record, DataPelago is currently serving paying clients who have successfully piloted its software, with some now poised to move into full-scale manufacturing, according to Goyal. The corporation plans to officially debut its software product in wide release during the initial three months of 2025.
Silent until now, Mountain View-based firm emerged from stealth mode with an announcement that it has secured $47 million in funding from a prestigious group of investors, including Eclipse, Taiwania Capital, Qualcomm Ventures, Alter Enterprise Partners, Nautilus Enterprise Partners, and Silicon Valley Bank, a division of First Republic Bank.