Tuesday, September 30, 2025

Revolutionizing Automotive Measurement Knowledge Storage and Evaluation: Mercedes-Benz’s Petabyte-Scale Answer on the Databricks Intelligence Platform

Summary

With the rise of linked autos, the automotive trade is experiencing an explosion of time sequence knowledge. A whole bunch of Digital Management Items (ECUs) repeatedly stream knowledge throughout in-vehicle networks in excessive frequencies (1Hz-100Hz). This knowledge provides immense potential for predictive analytics and innovation, however extracting information at petabyte scale presents main technical, monetary, and sustainability challenges.

On this weblog submit, we introduce a novel hierarchical semantic knowledge mannequin tailor-made for large-scale time sequence knowledge. Leveraging the most recent options (e.g. liquid clustering) launched by the Databricks Intelligence Platform, it allows scalable and cost-efficient analytics – remodeling uncooked automotive measurement knowledge into actionable insights that drive automobile growth, efficiency tuning, and predictive upkeep.

Moreover, we share benchmarks primarily based on real-world knowledge from Mercedes-Benz and evaluate state-of-the-art knowledge optimization methods to judge efficiency throughout key trade use circumstances.

Introduction

Time sequence evaluation within the automotive trade isn’t simply quantity crunching; it is like studying the heartbeat of each automobile on the street. Every knowledge level tells a narrative, from the delicate vibrations of an engine to split-second choices of autonomous driving techniques and even driver-vehicle interactions. As these knowledge factors coalesce into traits and patterns, they reveal insights that may revolutionize automobile growth, improve security options, and even predict upkeep wants earlier than a single warning gentle flashes on a dashboard.

Nevertheless, the sheer quantity of this knowledge presents a formidable problem. Fashionable autos, geared up with a whole bunch of ECUs, generate an enormous period of time sequence knowledge. Whereas gathering and storing this wealth of knowledge is essential, the true problem—and alternative—lies in harnessing its energy to maneuver past easy reporting to forward-looking predictive analytics utilizing ML & AI.

On the coronary heart of this problem is the necessity for a universally relevant, environment friendly, and scalable mannequin for representing time sequence knowledge — one which helps each well-defined and rising use circumstances. To satisfy this want, we introduce a novel hierarchical semantic knowledge mannequin that addresses the complexity of automotive time sequence evaluation, remodeling uncooked measurement knowledge right into a strategic asset.

In creating this knowledge mannequin, we targeted on three essential features:

  1. Price-efficient and Scalable Knowledge Entry: A knowledge mannequin needs to be designed to help frequent question patterns in time sequence knowledge evaluation, enabling fast and resource-efficient processing of large datasets.
  2. Usability: Ease of use for knowledge practitioners in addition to area consultants is essential, guaranteeing that working with the information is easy and intuitive, whatever the scale to get insights rapidly with out spending hours on writing queries.
  3. Knowledge Discoverability & Knowledge Governance: Minimizing the information mannequin for time sequence knowledge of as much as hundreds or thousands and thousands of various indicators and contextual metadata is essential for governance and maintainability. Knowledge of an arbitrary variety of automotive fleets will be simply registered in a couple of Unity Catalog tables and customers can securely uncover, entry and collaborate on trusted knowledge.

In collaboration with Mercedes-Benz AG, one of many largest premium automobile producers primarily based in Stuttgart Germany, we improve the information mannequin primarily based on ASAM requirements to assist Mercedes-Benz to develop essentially the most fascinating automotive leveraging the ability of Mercedes-Benz Working System (MB.OS). Just like the Mercedes-Benz Imaginative and prescient EQXX idea automotive, which units new benchmarks for electrical vary and effectivity, we’re pushing the analytics efficiency and effectivity to a completely new degree through the use of cutting-edge applied sciences.

On this weblog submit, we showcase productive knowledge analytics use circumstances and real-world knowledge to exhibit the capabilities of our prolonged knowledge mannequin throughout varied setups. Moreover, we performed scientific analysis on totally different optimization methods and carried out systematic benchmarks on Z-Ordering and Liquid Clustering knowledge layouts.

A hierarchical semantic knowledge mannequin for addressing the three essential features

This knowledge mannequin can symbolize time sequence knowledge of tens of hundreds indicators in a single desk and features a hierarchical illustration of contextual metadata. Our mannequin, due to this fact, supplies the next benefits:

  1. Environment friendly Filtering: The hierarchical construction permits for fast filtering throughout a number of dimensions, enabling analysts to rapidly slim down their search area.
  2. Semantic Relationships: By incorporating semantic relationships between samples and contextual metadata, the mannequin facilitates extra intuitive and highly effective querying capabilities.
  3. Scalability: The hierarchical nature of the mannequin helps environment friendly knowledge group as the quantity grows to petabyte scale.
  4. Contextual Integration: The semantic layer permits for seamless integration of contextual metadata, enhancing the depth of research attainable.

The Core Knowledge Mannequin

The core mannequin consists of 5 tables that effectively symbolize time sequence knowledge and contextual metadata (see Determine 1 for the Entity Relationship diagram). Central to the mannequin is the samples desk, which comprises time sequence knowledge in a slim format with two identifier columns: container_id and channel_id. The container_id serves as a novel identifier for a set of time sequence objects, whereas channel_id uniquely identifies every time sequence (or channel) inside that container. This construction allows distributed evaluation of the underlying time sequence knowledge.

Within the automotive context, a container contains predefined channels recorded by automotive knowledge loggers throughout a check drive and saved in a single file. Nevertheless, a number of measurement information will be grouped right into a single container if measurements from a visit are cut up as a consequence of dimension constraints. This idea additionally applies to steady time sequence knowledge streams (e.g., from IoT units), the place container boundaries will be outlined by time (e.g., hourly or day by day) or by course of information, comparable to splitting streams primarily based on manufacturing steps or batches.

All pattern knowledge is saved utilizing run-length encoding (RLE), merging consecutive samples with the identical worth right into a single row outlined by a beginning time (“tstart”), an ending time (“have a tendency”) and the recorded worth. The ending time is non-inclusive, marking the transition to the following worth. RLE is a straightforward compression technique that facilitates environment friendly evaluation, comparable to calculating histograms by bucketing values and summing the period (have a tendency – tstart). Every row is listed by container_id, channel_id, and the lively timeframe. This core samples desk is saved easy to reduce storage dimension and improve question efficiency.

Along with the samples desk, we have now 4 tables to symbolize the contextual metadata:

  • “container_metrics” and “container_tags” are listed by their given “container_id”.
  • “channel_metrics” and “channel_tags” metadata is moreover identifiable by the corresponding “channel_id”.
  • Each metrics tables have a static schema which comprises helpful info for pruning queries.
  • Each tags tables are used as a easy key-value-pair retailer that may maintain any type of metadata.

Some metadata will be extracted immediately from measurement information; tags can be enriched from exterior metadata sources to offer a context to linked containers and indicators.

Entity Relationship Diagram
Determine 1: The Entity Relationship diagram of an information mannequin for representing time sequence knowledge and its contextual metadata

Mercedes-Benz Implementation

As a member of Affiliation for Standardization of Automation and Measuring Methods (ASAM) neighborhood (standing in August 2025), Mercedes-Benz has lengthy utilized varied applied sciences to investigate collected measurement knowledge. By means of our collaboration with Databricks, we acknowledged the immense potential of the earlier than talked about time sequence knowledge mannequin to help the Mercedes-Benz automobile growth. Consequently, we leveraged our automobile growth experience to boost the information mannequin primarily based on the ASAM MDF normal (see Determine 2). We contributed productive measurement knowledge from growth autos and tailored actual knowledge analytics use circumstances. This allowed us to validate the information mannequin idea and exhibit its feasibility in enhancing the automobile growth course of and high quality.

Our focus will now shift to demonstrating how this enhanced knowledge mannequin performs with Mercedes-Benz growth automobile measurement knowledge:

  • Stage 1 filtering through “t_test_object_metric” & “t_test_object_tag”: These two tables retailer enterprise info and statistics at check object degree (e.g. check automobile). Examples embrace automobile sort, automobile sequence, mannequin 12 months, automobile configuration and so on. This info permits the information analytics use circumstances in step one to give attention to particular check objects amongst a whole bunch of check objects.
  • Stage 2 filtering through “t_measurement_session_metric” & “t_measurement_session_tag”: These two tables retailer the enterprise info and statistics on the measurement session degree. Examples embrace check occasions, time zone info and measurement begin/finish timestamps. The measurement begin/finish timestamps assist the information analytics scripts within the second step to slim down the attention-grabbing a whole bunch of measurement classes from thousands and thousands of measurement classes.
  • Stage 3 filtering through “t_signal_metric” & “t_signal_tag”: These two tables retailer the enterprise info and statistics at sign key degree. Examples embrace automobile pace, street sort, climate situation, drive pilot indicators and so on. The information analytics scripts leverage the data within the remaining step to determine the related indicators for the underlying question from hundreds of obtainable indicators.
  • Analytics scripts on measurement sign knowledge tables: The precise analytics logic is executed on the measurement sign knowledge tables, which retailer the time sequence knowledge collected from check autos. Nevertheless, after making use of the three ranges of knowledge filtering talked about above, usually solely a small fraction of the unique uncooked time sequence knowledge must be processed and analyzed.
Simplified Measurement Data ER Model
Determine 2 Simplified Measurement Knowledge ER Mannequin of the Mercedes-Benz implementation

Mercedes-Benz Instance Use Case for working with the metadata tables

By introducing totally different ranges of metric and tag tables as core metadata, knowledge analytics efficiency has considerably improved in comparison with present options at Mercedes-Benz. For example how the core metadata enhances analytics efficiency, we want to use the Automated Lane Change (ALC) system readiness detection for example.

As highlighted in Mercedes-Benz innovation, the ALC perform is an integral a part of Energetic Distance Help DISTRONIC with Energetic Steering Help. If a slower automobile is driving forward, the automobile can provoke a lane change itself within the pace vary of 80-140 km/h and overtake utterly routinely if lane markings are detected and ample clearance is offered. The prerequisite is a street with a pace restrict and the automobile being geared up with MBUX Navigation. The subtle system requires no additional impulse from the driving force to execute the automated lane change. These three preconditions assist the analytics script filter the related classes from hundreds of classes. For readability, we current our methodology in a logical and sequential method (see Determine 3); it is very important notice that the precise implementation will be carried out in parallel.

  • Filter classes in the course of the check drive date vary between 2025-07-07 and 2025-07-11 from all generated measurement classes to create session group 1. On this step, we use the measurement_first_data_point_timestamp and measurement_end_data_point_timestamp columns within the “t_measurement_session_metric” desk to determine the related classes from all recorded classes of the automobile.
  • Filter classes inside session group 1 that comprise activated Energetic Steering Help to create session group 2. On this step, we verify the classes the place max_int_raw_value > 0 (assuming the sign integer uncooked worth for activated Energetic Steering Help is 1) within the “t_signal_metric” desk to determine the related classes from session group 1.
  • Filter classes inside session group 2 the place the automobile is pushed on a lane-marked street to create session group 3. On this step, we verify the classes the place max_int_raw_value > 2 (assuming the sign integer uncooked worth for lane-marked street sort is 3) within the “t_signal_metric” desk to determine the related classes from session group 2.
  • Filter classes inside session group 3 that comprise automobile pace within the vary of 80-140 km/h to create session group 4. On this step, we verify the classes the place max_float_value >= 80 OR min_float_value within the “t_signal_metric” desk to determine the related classes from session group 3.
  • Filter the required sign IDs inside session group 4. On this step, we use the mix of pdu_name, link_name, vlan_id, and signal_name to seek out the sign IDs of related indicators.
  • Use the filtered sign IDs and measurement session IDs from session group 4 to affix the measurement sign knowledge level desk and determine the ALC system readiness.
ALC system readiness detection
Determine 3 How the information filtering for ALC system readiness detection works

Choosing the optimum knowledge structure by real-world knowledge & use case benchmarking

To exhibit the efficiency and scalability of the described knowledge mannequin, we systematically benchmarked real-world measurement knowledge and use circumstances. In our benchmark research, we evaluated varied combos of knowledge layouts and optimization strategies. The benchmarks have been designed to optimize for:

  1. Knowledge structure & optimization methods: We examined totally different knowledge structure approaches, comparable to partitioning schemes, RLE, non-RLE, Z-Ordering, and Liquid Clustering, to optimize the question efficiency.
  2. Scalability: We targeted on options able to dealing with the ever-growing quantity of measurement knowledge whereas sustaining effectivity.
  3. Price-efficiency: We thought of each storage prices and question efficiency to determine essentially the most cost-effective strategy for long-term knowledge retention and evaluation.

For the reason that benchmark outcomes are essential for choosing the long run measurement knowledge schema and format in Mercedes-Benz, we used productive knowledge and analytics scripts to judge the totally different choices.

In observe even minor optimizations can unlock main financial savings at scale, enabling hundreds of engineers to extract insights safely and cost-efficiently. Benchmarking is essential to validate effectivity of a urged answer and needs to be steadily repeated with bigger modifications to the system.

Benchmark Setup

The benchmark dataset comprises measurement knowledge from 21 distinct check autos, every geared up with fashionable automotive loggers to gather the measurement knowledge. The gathering options between 30,000 to 60,000 recorded indicators per automobile, which supply a variety of knowledge factors for evaluation. In complete, the dataset represents 40,000 hours of recordings, with 12,500 hours particularly capturing knowledge whereas the autos have been dwell (ignition on). This dataset allows the research of assorted features of automotive conduct and efficiency throughout totally different autos and working situations.

The next 4 analytical question classes have been executed as a part of the benchmark:

  1. Sign Distribution Evaluation – We generated one-dimensional histograms for key indicators (e.g. Car Pace) to evaluate knowledge distribution and frequency patterns.
  2. Sign Arithmetic Operations – We carried out fundamental calculations (e.g. subtraction, ratios) on a number of to hundreds of indicators.
  3. Check Case Identification – The queries determine and validate predefined operational eventualities throughout the dataset, outlined by a sequence of occasions occurring in a given order.
  4. Detection of Readiness of Automated Lane Change Assistant System – This question extensively makes use of the metadata tables earlier than the precise underlying time sequence knowledge is queried.

Please notice that on this weblog submit, we solely current the outcomes for class 1 and 4, because the outcomes for the opposite classes yield comparable efficiency outcomes and don’t present extra insights.

To benchmark the scalability of the answer, we used 4 totally different cluster sizes. The reminiscence optimized Standard_E8d_v4 node sort was chosen due to its delta cache characteristic and bigger reminiscence to carry the core metadata. As for the Databricks runtime, the 15.4 LTS was the most recent obtainable long-term help runtime. In our earlier investigation, the Photon characteristic has confirmed to be extra cost-efficient, regardless of its increased DBU price, so Photon was utilized in all benchmarks. Desk 1 supplies particulars of the chosen Databricks cluster.

T-Shirt Measurement Node Sort DBR #Nodes (driver + employee) Photon
X-Small Standard_E8d_v4 15.4 LTS 1 + 2 sure
Small Standard_E8d_v4 15.4 LTS 1 + 4 sure
Medium Standard_E8d_v4 15.4 LTS 1 + 8 sure
Giant Standard_E8d_v4 15.4 LTS 1 + 16 sure

Desk 1 The benchmark cluster setups

Benchmark Outcomes

The benchmark was executed on two principal variations of the information mannequin. The primary model has run size encoded (RLE) samples knowledge (see part Core Knowledge Mannequin), whereas the second model doesn’t use RLE. Moreover, we utilized two totally different knowledge structure optimizations to each variations of the information mannequin. Within the first optimization we used hive-style partitioning to partition the measurement sign knowledge desk by the measurement_session_id column and utilized the Z-Ordering method on the signal_id column. Within the second optimization, we used Liquid Clustering to cluster the measurement sign knowledge desk by measurement_session_id and signal_id.

Runtime Efficiency

As a result of vital variations in absolute operating instances among the many benchmarked setups, we determined to make use of relative operating time primarily based on Z-Ordering with non-RLE outcomes to visualise the outcomes. Typically, throughout all checks we carried out, Liquid Clustering (inexperienced bars) outperforms hive-style partitioning +Z-Ordering (blue bars). For the histogram of steadily altering indicators, the RLE optimization reduces the runtime by roughly 60% for Z-Ordering, whereas it solely reduces the runtime by lower than 10% for Liquid Clustering.

Within the second use case, detection of readiness of automated lane change system, RLE diminished the runtime by practically 70% for Z-Ordering and greater than 50% for Liquid Clustering. The general outcomes of the demonstrated use circumstances point out that the mix of RLE and Liquid Clustering performs the perfect on knowledge mannequin.

Evaluation results on relative performance of the solution
Determine 4: Analysis outcomes on relative efficiency of the answer

Scalability

To judge the scalability of the answer, we executed all 4 analytical queries on a static dataset utilizing totally different cluster sizes. In reality, in each benchmarking run, we doubled the cluster dimension in comparison with the earlier run. Ideally, for an answer that scales completely, the runtime of a question ought to lower by an element of two with every doubling of the cluster dimension. Nevertheless, the technical limitations usually forestall excellent scaling.

Determine 5 reveals the leads to absolute operating instances (seconds) for varied benchmark setups for one use case, although we noticed the very same sample throughout all different use circumstances. The reference traces (yellow and blue dashed traces) symbolize the decrease certain of operating instances (excellent scaling) for the 2 totally different benchmark setups. For the proven use case the operating time usually decreases nearly completely because the cluster dimension will increase from X-Small to Giant. This means that the information mannequin and optimization methods are scalable, benefiting from extra nodes and processing energy.

However, we will see that the operating instances of the RLE Liquid Clustering answer (blue line) begin to drift away from the right scaling reference line starting with the Medium cluster dimension. This hole turns into much more pronounced with the Giant cluster dimension. Nevertheless, it is vital to notice that absolutely the operating instances for the RLE Liquid Clustering answer are considerably decrease than these for RLE Z-Ordering. Due to this fact, it’s anticipated that the RLE Liquid Clustering answer would exhibit diminished scalability enhancements at bigger cluster sizes, as its baseline operating time is already exceptionally low at that stage.

Evaluation results on scalability of the solution
Determine 5: Analysis outcomes on scalability of the answer

Storage Measurement

Our benchmark knowledge have been generated from 64.55TB of proprietary MDF information, collected from 21 Mercedes-Benz MB.OS check autos over a five-month check interval. To maximise the question efficiency whereas sustaining a suitable storage dimension, we use the zstd compression for Parquet file and set the DELTA goal file dimension as 32MB, primarily based on outcomes of earlier investigations. Small file sizes are fascinating on this situation to stop storing too many indicators in the identical bodily file, making dynamic file pruning extra environment friendly for extremely selective queries.

All knowledge layouts resulted in Delta tables comparable in dimension to the proprietary MDF knowledge (please see Desk 2). Basically, the compression charge from uncooked file format to the Delta tables relies upon rather a lot on totally different traits of the MF4 information. The underlying dataset comprises as much as 60.000 indicators per automobile and numerous them have been recorded on worth change solely. For these indicators compression strategies like RLE don’t have any impact. For different datasets with solely hundreds, however repeatedly recorded indicators, we discovered that the storage dimension was diminished by >50% in comparison with the uncooked MDF information.

Our outcomes confirmed that the Liquid Clustering tables have been considerably bigger in dimension when in comparison with the Z-Ordered tables (+14% for the RLE knowledge layouts). Nevertheless, contemplating the operating time efficiency benchmark outcomes offered above, the extra storage dimension required by RLE Liquid Clustering structure is justified by its superior efficiency.

Codecs Proprietary MDF File RLE Z-Ordering RLE Liquid Clustering
Storage dimension [TB] 64.55 67.43 77.05

Desk 2 Storage sizes of uncooked knowledge and of the totally different RLE knowledge layouts

Conclusion

We developed a hierarchical semantic knowledge mannequin to effectively retailer and analyze petabyte-scale time sequence knowledge from linked autos on the Databricks Intelligence Platform. Designed for cost-efficient, scalable entry, usability, and robust governance, the mannequin opens up turning uncooked telemetry into actionable insights.

Utilizing real-world Mercedes-Benz knowledge, we confirmed how hierarchical metadata tables enhance analytics efficiency through multi-level filtering. Within the Automated Lane Change Readiness instance, this construction enabled fast identification of related classes and indicators, drastically decreasing processing time.

Benchmarking revealed that combining Run-Size Encoding (RLE) with Liquid Clustering delivered the perfect efficiency throughout analytical question varieties, outperforming RLE with Z-Ordering, particularly in runtime. Whereas it required extra storage, the trade-off was justified by vital question pace good points. Scalability checks confirmed robust efficiency whilst knowledge volumes grew.

Sooner or later, the Databricks workforce will publish options on 1) learn how to convert MDF information into the newly launched knowledge mannequin with Databricks Jobs, 2) cowl learn how to govern complicated knowledge units containing massive fleets or different belongings and permit for straightforward discovery whereas sustaining privateness, safety and rising complexities with Unity Catalog and three) introduce a framework for engineers with no robust SQL or python background to effectively acquire insights from knowledge by themselves.

In abstract the hierarchical semantic knowledge mannequin with RLE and Liquid Clustering provides a strong, ruled, and scalable answer for automotive time sequence analytics, accelerating growth at Mercedes-Benz and fostering data-driven collaboration towards a extra sustainable, environment friendly future.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles