This submit is written in collaboration with Philipp Karg from BMW Group.
Companies more and more require scalable, cost-efficient architectures to course of and rework huge datasets. On the BMW Group, our Cloud Effectivity Analytics (CLEA) group has developed a FinOps resolution to optimize prices throughout over 10,000 cloud accounts. Whereas enabling organization-wide effectivity, the group additionally utilized these rules to the info structure, ensuring that CLEA itself operates frugally. After evaluating numerous instruments, we constructed a serverless information transformation pipeline utilizing Amazon Athena and dbt.
This submit explores our journey, from the preliminary challenges to our present structure, and particulars the steps we took to attain a extremely environment friendly, serverless information transformation setup.
Challenges: Ranging from a inflexible and dear setup
In our early phases, we encountered a number of inefficiencies that made scaling troublesome. We had been managing advanced schemas with large tables that required important effort in maintainability. Initially, we used Terraform to create tables and views in Athena, permitting us to handle our information infrastructure as code (IaC) and automate deployments by steady integration and supply (CI/CD) pipelines. Nonetheless, this methodology slowed us down when altering information fashions or coping with schema adjustments, due to this fact requiring excessive growth efforts.
As our resolution grew, we confronted challenges with question efficiency and prices. Every question scanned massive quantities of uncooked information, leading to elevated processing time and better Athena prices. We used views to supply a clear abstraction layer, however this masked underlying complexity as a result of seemingly easy queries towards these views scanned massive volumes of uncooked information, and our partitioning technique wasn’t optimized for these entry patterns. As our datasets grew, the dearth of modularity in our information design elevated complexity, making scalability and upkeep more and more troublesome. We would have liked an answer for pre-aggregating, computing, and storing question outcomes of computationally intensive transformations. The absence of strong testing and lineage options made it difficult to determine the foundation causes of knowledge inconsistencies once they occurred.
As a part of our enterprise intelligence (BI) resolution, we used Amazon QuickSight to construct our dashboards, offering visible insights into our cloud price information. Nonetheless, our preliminary information structure led to challenges. We had been constructing dashboards on prime of huge, large datasets, with some hitting the QuickSight per-dataset SPICE restrict of 1 TB. Moreover, throughout SPICE ingest, our largest datasets required 4–5 hours of processing time on account of performing full scans every time, usually scanning over a terabyte of knowledge. This structure wasn’t serving to us be extra agile and fast whereas scaling up. The lengthy processing occasions and storage limitations hindered our capacity to supply well timed insights and broaden our analytics capabilities.
To handle these points, we enhanced the info structure with AWS Lambda, AWS Step Features, AWS Glue, and dbt. This software stack considerably enhanced our growth agility, empowering us to rapidly modify and introduce new information fashions. On the identical time, we improved our general information processing effectivity with incremental masses and higher schema administration.
Resolution overview
Our present structure consists of a serverless and modular pipeline coordinated by GitHub Actions workflows. We selected Athena as our major question engine for a number of strategic causes: it aligns completely with our group’s SQL experience, excels at querying Parquet information instantly in our information lake, and alleviates the necessity for devoted compute assets. This makes Athena a really perfect match for CLEA’s structure, the place we course of round 300 GB every day from a knowledge lake of 15 TB, with our largest dataset containing 50 billion rows throughout as much as 400 columns. The potential of Athena to effectively question large-scale Parquet information, mixed with its serverless nature, permits us to deal with writing environment friendly transformations somewhat than managing infrastructure.
The next diagram illustrates the answer structure.
Utilizing this structure, we’ve streamlined our information transformation course of utilizing dbt. In dbt, a knowledge mannequin represents a single SQL transformation that creates both a desk or a view—primarily a constructing block of our information transformation pipeline. Our implementation contains round 400 such fashions, 50 information sources, and round 100 information exams. This setup permits seamless updates—whether or not creating new fashions, updating schemas, or modifying views—triggered just by making a pull request in our supply code repository, with the remaining dealt with robotically.
Our workflow automation contains the next options:
- Pull request – Once we create a pull request, it’s deployed to our testing atmosphere first. After passing validation and being permitted or merged, it’s deployed to manufacturing utilizing GitHub workflows. This setup permits seamless mannequin creation, schema updates, or view adjustments—triggered simply by making a pull request, with the remaining dealt with robotically.
- Cron scheduler – For nightly runs or a number of every day runs to cut back information latency, we use scheduled GitHub workflows. This setup permits us to configure particular fashions with completely different replace methods based mostly on information wants. We are able to set fashions to replace incrementally (processing solely new or modified information), as views (querying with out materializing information), or as full masses (fully refreshing the info). This flexibility optimizes processing time and useful resource utilization. We are able to goal solely particular folders—like supply, ready, or semantic layers—and run the dbt check afterward to validate mannequin high quality.
- On demand – When including new columns or altering enterprise logic, we have to replace historic information to keep up consistency. For this, we use a backfill course of, which is a customized GitHub workflow created by our group. The workflow permits us to pick particular fashions, embrace their upstream dependencies, and set parameters like begin and finish dates. This makes certain that adjustments are utilized precisely throughout your entire historic dataset, sustaining information consistency and integrity.
Our pipeline is organized into three major phases—Supply, Ready, and Semantic—every serving a particular function in our information transformation journey. The Supply stage maintains uncooked information in its authentic kind. The Ready stage cleanses and standardizes this information, dealing with duties like deduplication and information sort conversions. The Semantic stage transforms this ready information into business-ready fashions aligned with our analytical wants. An extra QuickSight step handles visualization necessities. To realize low price and excessive efficiency, we use dbt fashions and SQL code to handle all transformations and schema adjustments. By implementing incremental processing methods, our fashions course of solely new or modified information somewhat than reprocessing your entire dataset with every run.
The Semantic stage (to not be confused with dbt’s semantic layer characteristic) introduces enterprise logic, reworking information into aggregated datasets which might be instantly consumable by BMW’s Cloud Information Hub, inner CLEA dashboards, information APIs, or In-Console Cloud Assistant (ICCA) chatbot. The QuickSight step additional optimizes information by deciding on solely essential columns through the use of a column-level lineage resolution and setting a dynamic date filter with a sliding window to ingest solely related sizzling information into SPICE, avoiding unused information in dashboards or stories.
This strategy aligns with BMW Group’s broader information technique, which incorporates streamlining information entry utilizing AWS Lake Formation for fine-grained entry management.
General, as a high-level construction, we’ve absolutely automated schema adjustments, information updates, and testing by GitHub pull requests and dbt instructions. This strategy permits managed deployment with sturdy model management and alter administration. Steady testing and monitoring workflows uphold information accuracy, reliability, and high quality throughout transformations, supporting environment friendly, collaborative mannequin iteration.
Key advantages of the dbt-Athena structure
To design and handle dbt fashions successfully, we use a multi-layered strategy mixed with price and efficiency optimizations. On this part, we talk about how our strategy has yielded important advantages in 5 key areas.
SQL-based, developer-friendly atmosphere
Our group already had robust SQL expertise, so dbt’s SQL-centric strategy was a pure match. As an alternative of studying a brand new language or framework, builders may instantly begin writing transformations utilizing acquainted SQL syntax with dbt. This familiarity aligns nicely with the SQL interface of Athena and, mixed with dbt’s added performance, has elevated our group’s productiveness.
Behind the scenes, dbt robotically handles synchronization between Amazon Easy Storage Service (Amazon S3), the AWS Glue Information Catalog, and our fashions. When we have to change a mannequin’s materialization sort—for instance, from a view to a desk—it’s so simple as updating a configuration parameter somewhat than rewriting code. This flexibility has lowered our growth time dramatically, allowed us to deal with constructing higher information fashions somewhat than managing infrastructure.
Agility in modeling and deployment
Documentation is essential for any information platform’s success. We use dbt’s built-in documentation capabilities by publishing them to GitHub Pages, which creates an accessible, searchable repository of our information fashions. This documentation contains desk schemas, relationships between fashions, and utilization examples, enabling group members to know how fashions interconnect and the right way to use them successfully.
We use dbt’s built-in testing capabilities to implement complete information high quality checks. These embrace schema exams that confirm column uniqueness, referential integrity, and null constraints, in addition to customized SQL exams that validate enterprise logic and information consistency. The testing framework runs robotically on each pull request, validating information transformations at every step of our pipeline. Moreover, dbt’s dependency graph offers a visible illustration of how our fashions interconnect, serving to us perceive the upstream and downstream impacts of any adjustments earlier than we implement them. When stakeholders want to switch fashions, they’ll submit adjustments by pull requests, which, after they’re permitted and merged, robotically set off the required information transformations by our CI/CD pipeline. This streamlined course of enabled us to create new information merchandise inside days in comparison with weeks and lowered ongoing upkeep work by catching points early within the growth cycle.
Athena workgroup separation
We use Athena workgroups to isolate completely different question patterns based mostly on their execution triggers and functions. Every workgroup has its personal configuration and metric reporting, permitting us to watch and optimize individually. The dbt workgroup handles our scheduled nightly transformations and on-demand updates triggered by pull requests by our Supply, Ready, and Semantic phases. The dbt-test workgroup executes automated information high quality checks throughout pull request validation and nightly builds. The QuickSight workgroup manages SPICE information ingestion queries, and the Advert-hoc workgroup helps interactive information exploration by our group.
Every workgroup may be configured with particular information utilization quotas, enabling groups to implement granular governance insurance policies. This separation offers a number of advantages: it permits clear price allocation, offers remoted monitoring of question patterns throughout completely different use instances, and helps implement information governance by customized workgroup settings. Amazon CloudWatch monitoring per workgroup helps us monitor utilization patterns, determine question efficiency points, and alter configurations based mostly on precise wants.
Utilizing QuickSight SPICE
QuickSight SPICE (Tremendous-fast, Parallel, In-memory Calculation Engine) offers highly effective in-memory processing capabilities that we’ve optimized for our particular use instances. Slightly than loading total tables into SPICE, we create specialised views on prime of our materialized semantic fashions. These views are rigorously crafted to incorporate solely the required columns, related metadata joins, and acceptable time filtering to have solely latest information out there in dashboards.
We’ve carried out a hybrid refresh technique for these SPICE datasets: every day incremental updates hold the info contemporary, and weekly full refreshes keep information consistency. This strategy strikes a steadiness between information freshness and processing effectivity. The result’s responsive dashboards that keep excessive efficiency whereas protecting processing prices below management.
Scalability and cost-efficiency
The serverless structure of Athena eliminates guide infrastructure administration, robotically scaling based mostly on question demand. As a result of prices are based mostly solely on the quantity of knowledge scanned by queries, optimizing queries to scan as little information as attainable instantly reduces our prices. We use the distributed question execution capabilities of Athena by our dbt mannequin construction, enabling parallel processing throughout information partitions. By implementing efficient partitioning methods and utilizing Parquet file format, we decrease the quantity of knowledge scanned whereas maximizing question efficiency.
Our structure affords flexibility in how we materialize information by views, full tables, and incremental tables. With dbt’s incremental fashions and partitioning technique, we course of solely new or modified information as an alternative of total datasets. This strategy has confirmed extremely efficient—we’ve noticed important reductions in information processing quantity in addition to information scanning, notably in our QuickSight workgroup.
The effectiveness of those optimizations carried out on the finish of 2023 is seen within the following diagram, displaying prices by Athena workgroups.
The workgroups are illustrated as follows:
- Inexperienced (QuickSight): Exhibits lowered information scanning post-optimization.
- Gentle blue (Advert-hoc): Varies based mostly on evaluation wants.
- Darkish blue (dbt): Maintains constant processing patterns
- Orange (dbt-test): Exhibits common, environment friendly check execution.
The elevated dbt workload prices instantly correlate with decreased QuickSight prices, reflecting our architectural shift from utilizing advanced views in QuickSight workgroups (which beforehand masked question complexity however led to repeated computations) to utilizing dbt for materializing these transformations. Though this elevated the dbt workload, the general cost-efficiency improved considerably as a result of materialized tables lowered redundant computations in QuickSight. This demonstrates how our optimization methods efficiently handle rising information volumes whereas reaching internet price discount by environment friendly information materialization patterns.
Conclusion
Our information structure makes use of dbt and Athena to supply a scalable, cost-efficient, and versatile framework for constructing and managing information transformation pipelines. Athena’s capacity to question information instantly in Amazon S3 alleviates the necessity to transfer or copy information right into a separate information warehouse, and its serverless mannequin and dbt’s incremental processing decrease each operational overhead and processing prices. Given our group’s robust SQL experience, expressing these transformations in SQL by dbt and Athena was a pure selection, enabling fast mannequin growth and deployment. With dbt’s computerized documentation and lineage, troubleshooting and figuring out information points is simplified, and the system’s modularity permits for fast changes to satisfy evolving enterprise wants.
Beginning with this structure is fast and easy: all that’s wanted is the dbt-core and dbt-athena libraries, and Athena itself requires no setup, as a result of it’s a completely serverless service with seamless integration with Amazon S3. This structure is right for groups trying to quickly prototype, check, and deploy information fashions, optimizing useful resource utilization, accelerating deployment, and offering high-quality, correct information processing.
For these fascinated with a managed resolution from dbt, see From information lakes to insights: dbt adapter for Amazon Athena now supported in dbt Cloud.
In regards to the Authors
Philipp Karg is a Lead FinOps Engineer at BMW Group and has a powerful background in information engineering, AI, and FinOps. He focuses on driving cloud effectivity initiatives and fostering a cost-aware tradition throughout the firm to leverage the cloud sustainably.
Selman Ay is a Information Architect specializing in end-to-end information options, structure, and AI on AWS. Outdoors of labor, he enjoys enjoying tennis and interesting outside actions.
Cizer Pereira is a Senior DevOps Architect at AWS Skilled Companies. He works intently with AWS clients to speed up their journey to the cloud. He has a deep ardour for cloud-based and DevOps options, and in his free time, he additionally enjoys contributing to open supply initiatives.