Develops a cutting-edge, real-time measurement solution focused on privacy, empowering marketers to accurately assess the impact of their campaigns and seamlessly integrate them into the wider marketing ecosystem, processing an astonishing 100 billion events daily. AppsFlyer enables digital entrepreneurs to precisely attribute and assign credit scores to diverse user interactions driving app installations, leveraging advanced analytics capabilities.
AppsFlyer’s offering includes Audiences Segmentation, a feature that enables app owners to precisely target and re-engage users according to their behavior and demographic characteristics. The feature boasts a distinctive trait, providing real-time estimates of viewer counts within specific consumer demographics, referred to as the Estimation characteristic.
The AppsFlyer team initially leveraged Apache HBase, a popular open-source distributed database, to provide customers with real-time estimates of viewer measurements. Despite a significant surge in workload to 23 terabytes, it became imperative to revisit the HBase architecture to ensure compliance with service level agreements (SLAs) for response time and reliability.
This put-up enables insights into how AppsFlyer successfully transformed their Audiences Segmentation product by leveraging the power of. AWS Athena is a powerful, highly versatile serverless query service that enables users to analyze data of various formats using SQL. Designed to simplify customer access to knowledge stored in Amazon S3 through conventional SQL queries.
AppsFlyer leverages a diverse range of optimization strategies, including partition projection, sorting, parallel run execution, and leveraging previous result reusability. We delve into the obstacles faced by the team and the strategies employed to harness the full capabilities of Athena, as demonstrated through a real-world example that required ultra-low latency solutions. Moreover, our rigorous testing, meticulous monitoring, and streamlined rollout process ensured a seamless and profitable migration to the new Athena framework.
What drives audiences’ demand for segmentation, legacy structure modernization?
Viewers are segmented within AppsFlyer’s user interface by constructing a hierarchical tree structure using set operations and standardized atomic elements as leaf nodes.
This diagram illustrates a viewer segmenting scenario in the AppsFlyer Audiences administration console, where two atomic standards are used as leaf nodes and a set operation translates to a tree structure.
Using a framework called Spark, the AppsFlyer team developed an eco-friendly knowledge construction to count unique components in real-time, providing customers with accurate viewer measurements. These innovative sketches significantly enhance scalability and analytical capacities. The initial sketches have been stored in the HBase database.
HBase is a leading open-source, distributed, column-oriented database designed to handle enormous data volumes on standardised hardware, offering scalable performance.
Authentic knowledge construction
On this platform, we focus on the occasions
The largest dataset initially stored in HBase was the Desk database. The desk had the schema 2022-10-12 | 1234 | User Login | Successful |
and was partitioned by
2022-10-12 | 1234 | System Alert | Critical Error |
2022-10-11 | 5678 | User Logout | Timed Out |
2022-10-11 | 1234 | Data Upload | Partial Success |date
and app-id
.
The diagram illustrates the overarching architecture of AppsFlyer’s Estimations system at a high level, highlighting its distinctive organizational framework.
The structure featured an Airflow ETL course that initiated jobs to generate sketch data from the supplied dataset, followed by the importation of this data into HBase. Customers can leverage an API service to query HBase and retrieve estimates of consumer counts in accordance with predefined viewer sections configured within the user interface.
For further information on the previous HBase architecture, refer to .
As time passed, the workload grew beyond the initial design specifications of the HBase implementation, ultimately straining storage capacities to an unprecedented 23 terabytes. To ensure timely and reliable responses to AppsFlyer’s service-level agreements (SLAs), it became clear that the HBase architecture needed to be re-examined.
As previously discussed, the primary objective of this use case involves daily interactions between clients and the UI, requiring compliance with a UI customary SLA that ensures rapid response times and the ability to handle a substantial volume of daily requests, while accommodating existing data capacity and potential future growth.
In a quest for a more manageable, user-friendly, and cost-efficient solution to support the existing HBase infrastructure, it became essential to find an alternative that wouldn’t compromise the overall system architecture or introduce unnecessary complexity?
After thorough crew discussions and consultations with AWS specialists, the crew ultimately determined that a resolution leveraging Amazon S3 and Athena emerged as the most cost-effective and straightforward solution. The primary concern revolved around query latency, prompting the team to exercise extreme caution in order to avoid any detrimental effects on the overall customer experience.
The diagram that follows showcases the innovative architecture powered by Athena. Discover that import-..-sketches-to-hbase
Amazon S3 has seamlessly integrated with Athena, while excluding HBase, thereby bolstering its data analytics capabilities.
What data distribution strategies optimize query performance in a distributed system?
Within this portion, we concentrate on the methodology of schema design within a novel structural framework and innovative efficiency optimization strategies that the team employed in tandem with partition projection.
Merging knowledge for partition discount
To explore the potential for leveraging Athena in Audience Segmentation, a preliminary proof-of-concept was conducted. The scope was focused on instances emerging from just three primary areas. app-ids
Approximately 3 gigabytes of information are partitioned by app-id
and by date
Utilizing the identical partitioning schema employed in the HBase implementation. Because the crew successfully scaled to accommodate the comprehensive dataset of 10,000 units. app-ids
Within a one-month timeframe, yielding approximately 150 GB of data, the team started noticing slower query execution times, most noticeably for requests spanning significant periods. The team dove deep to discover that Athena invested considerable time during the query startup phase due to the significant number of partitions (7.3 million), which were loaded from the AWS Glue Information Catalog; for further information on using Athena with AWS Glue, refer to the relevant documentation.
The exploration of partition indexing was triggered by this development. Create optimized metadata indexes on partitioned columns in AWS Glue datasets to enable efficient pruning of data scans in Amazon Athena, thereby reducing the amount of data that needs to be read from Amazon S3? While partition indexing expedited partition discovery during the query initiation phase, its impact was insufficient to meet the necessary query latency Service Level Agreement (SLA).
To mitigate the limitations of partition indexing, the team explored an alternative approach: aggregating data from daily to monthly intervals to reduce the number of partitions required. By aggregating daily insights and combining them with monthly composites using Theta Sketches’ union functionality, this technique effectively condenses day-by-day knowledge into more comprehensive, month-long summaries. Taking knowledge of a month’s variability into account, the team condensed 30 separate entries into a single, streamlined entry, achieving a remarkable 97% reduction in data density.
The technique achieved a significant reduction in the time required for partition discovery, decreasing it by 30% – from approximately 10-15 seconds to a more efficient duration. Additionally, it minimised the amount of data needing to be scanned. Notwithstanding the UI’s responsiveness requirements, the anticipated latency objectives were ultimately surpassed.
Furthermore, the unintended merging process undermined the accuracy of the data, thereby necessitating the examination of alternative solutions.
Partition projection as a multiplier of strategic value, amplifying returns on investments in data-driven decision making. By distilling complexity into actionable insights, organizations can optimize resource allocation and maximize ROI.
Upon reaching this juncture, the crew was resolute in their endeavour to uncover.
Partition projection in Athena enables enhanced query effectiveness by projecting metadata from your partitions. Without explicit pre-definition in the database catalog, this feature seamlessly generates and detects desired partitions.
When dealing with immense numbers of partitions or rapid partition creation, this trait proves particularly valuable, as it excels in handling scenarios involving streaming data.
As previously established, on this unique application scenario, each terminal node represents a source text instance that needs to be converted into a query statement which must incorporate date
vary, app-id
, and event-name
. This guided the team to define the projection columns by leveraging a straightforward notation system consisting of. date
vary and for app-id
and event-name
.
Rather than scanning and loading all partition metadata from the catalog, Athena can dynamically generate partitions using preconfigured guidelines and values specified in the query. Without requiring additional processing time to retrieve and process partition metadata from a catalog, this approach generates the necessary information in real-time.
The projection course helped mitigate efficiency losses caused by an excessive number of partitions, thereby reducing latency during query runs.
As a consequence of partition projection eliminating dependencies between various partitions and runtime queries, the team can now explore additional partitions to enhance overall system performance. event-name
. Partitioning by three columns (date
, app-id
, and event-name
By reducing the volume of scanned data, a 10% boost in query performance was achieved compared to using partition projection with data partitioned solely by date
and app-id
.
The diagram above provides a high-level overview of the knowledge circulation involved in creating a sketch file. Developing a distinctive style through the rigorous discipline of sketch writing.write-events-estimation-sketches
When uploading data into Amazon S3 with a complex schema featuring three partition fields, the process takes approximately two times longer compared to the single-field structure due to the increased volume of sketch data being written to Amazon S3 – roughly 20 times more.
equipment from the helicopter. event-name
Partitioning and Compromising on Two Partitions: date
and app-id
Following
s3://bucket/table_root/date=${day}/app_id=${app_id}
Utilizing Parquet file format
The data team chose to utilize the Parquet file format within their newly designed system structure. Apache Parquet is a widely-used, open-source, columnar data file format optimized for efficient data storage and querying. Every Parquet file contains metadata equivalent to a minimum set of column names, which allows the query engine to skip loading unnecessary data. By reducing the quantity of information that needs to be scanned, Athena can quickly bypass or navigate through sections of the Parquet file that are unrelated to the query, thereby streamlining the search process. As a result, query performance enhancements become substantially more effective.
Parquet proves particularly effective in query scenarios involving sorted fields, owing to its ability to enable Athena’s predicate pushdown optimization, thus allowing for swift identification and retrieval of relevant data segments. To learn more about this feature’s implementation in Parquet file format, refer to the documentation.
Realizing the value of this advantage, the team decided to capitalize on. event-name
To enhance question effectiveness, achieving a 10% improvement compared to unsorted knowledge. Initially, they tried partitioning by event-name
To enhance efficiency, our approach actually increased writing time and necessitated uploading data to Amazon S3. The efficient sorting of data showcased a significant improvement in performance without incurring unnecessary processing overhead.
Question optimization and parallel queries
The crew discovered that efficiency could be further enhanced through parallel query execution. Queries were repeatedly posed across a prolonged period, as an alternative to soliciting a single inquiry within a lengthy timeframe, multiple questions were asked and answered over shorter intervals. Although this upgrade heightened the response’s intricacy, it fostered a 20% increase in efficiency for everyday applications.
What’s the estimated size of your app? com.demo
and occasion af_purchase
Between April 2024 and the end of June 2024, as previously demonstrated, the timeline is segmented according to customer specifications, transformed into an atomic leaf, and subsequently dissected into various queries reliant on the date range. The accompanying diagram illustrates how to dissect a 3-month preliminary question into two concurrent 60-day queries, then merge their respective outcomes.
Lowering outcomes set measurement
Upon examining efficiency bottlenecks, distinct query patterns and characteristics emerged, as well as varying levels of query execution, revealing that certain queries were slow to return results. The issue stemmed not from the query itself, but rather from the knowledge switch from Amazon S3, as a result of query results often contained vast numbers of rows, potentially exceeding tens of thousands.
The initial plan for handling numerous key-value combinations in a solitary framework led to a substantial increase in row diversity upfront. To overcome this challenge, the team introduced a cutting-edge event-attr-key
Distinct key-value pairs for organizing area sketches?
The schema unfolded in the following structure:
date|app_id|event_name|event_attr_key|event_attr_value|sketch
2023-02-15 14:30:00|com.example.app1|install|version|1.2.5|📊
This refactoring led to a significant reduction in the number of output rows, thereby substantially accelerating the process. GetQueryResults
The course of action, significantly improving overall query execution time by 90%.
Athena question outcomes reuse
To efficiently handle everyday scenarios in the Audiences Segmentation Graphical User Interface (GUI), users often perform nuanced adjustments to their queries by refining filters or slightly modifying timeframes, leveraging the Athena feature effectively. This characteristic enhances query efficiency and decreases costs by effectively caching and reutilizing the results of preceding inquiries. This characteristic plays a crucial role, particularly as it relates to the latest improvements surrounding the division of date intervals. The ability to reuse and rapidly access outcomes suggests that these small but recurring updates do not necessitate a complete reprocessing of questions.
As a result, the latency associated with successive query iterations decreased by up to 80%, thereby significantly improving customer understanding through expedited access to insights. This optimisation not only accelerates knowledge retrieval but also significantly reduces costs by eliminating the need to rescan data for every minor update.
Resolution rollout: Testing and monitoring
We concentrate on implementing the innovative framework, coupled with rigorous testing and continuous monitoring.
Fixing Amazon S3 slowdown errors
During the resolution testing phase, the crew created a customized automation framework to assess various user segments within the system, leveraging data structured according to the newly implemented schema. The study employed a comparative approach to assess the performance of HBase in relation to its counterpart, Athena, examining the respective outputs generated by each framework.
While conducting these checks, the team scrutinized the precision of the retrieved estimates and concurrently evaluated the latency shift.
During the testing phase, the team experienced issues with concurrent query performance, resulting in a high number of failures when executing multiple requests simultaneously. These concurrent Athena queries generating excessive GET requests to the same prefix have precipitated these failures.
To mitigate slowdowns caused by throttling, the team implemented a retry strategy for question executions featuring an adaptive backoff algorithm. This approach incrementally increases wait times between attempts, introducing a randomized component to prevent simultaneous retries and minimize congestion.
Rollout preparations
Initially, the team opted for a one-month proof-of-concept pilot as a cost-effective approach, focusing on validating data accuracy before investing in a comprehensive two-year backfilling process.
What steps were taken to complete the backfilling of a Spark job?write-events-estimation-sketches
Within varying times, they would need to work. The job drew upon information from the data warehouse, crafting conceptual sketches based on the data and encoding them into a specific schema defined by the team. However, the use of partition projection by the crew means they may inadvertently bypass the process of updating the Information Catalog for each newly added partition, potentially leading to inconsistencies and inaccuracies in the data.
By implementing this step-by-step approach, they were able to verify the accuracy of their findings well in advance of processing the entire historical dataset.
With unwavering confidence stemming from the accurate results achieved during the initial phase, the team methodically extended the backfilling process to cover the entire 24-month duration, guaranteeing a seamless and reliable rollout.
Prior to the official release of the updated solution, a rigorous monitoring process was conducted to ensure stability. The key display has been configured to assess crucial metrics, mirroring the evaluation of query response times, API latency, error rates, and API uptime.
Following the successful storage of the data in Amazon S3’s Parquet format, the subsequent deployment plan unfolded.
- Ensure seamless operation of both HBase and Athena workflows, discontinue exploring HBase, and transition to learning from Athena’s robust features and capabilities.
- Cease writing to HBase.
- Sundown HBase.
Enhancements and optimizations with Athena
The successful transition from HBase to Athena, leveraging partition projection and optimized data constructs, has not only yielded a 10% enhancement in query performance but also noticeably enhanced overall system stability by efficiently scanning only the necessary data partitions? While transitioning to a serverless model utilizing Amazon Athena, we’ve successfully secured an impressive 80% reduction in monthly costs relative to our previous infrastructure. By streamlining infrastructure management costs and synchronizing pricing with real-time usage, the organization sets itself up for more sustainable operations, enhanced data analysis, and better business performance.
The summary below outlines the key improvements and efficiencies achieved by the team.
Athena partition projection | Partitions are projected across a diverse range of partitions, with no constraints on the scope of partitions being considered; partitioned by. event_name and app_id |
A significant percentage of improvement in question efficiency? This was likely the most crucial advancement, enabling the response to become feasible. |
Partitioning and sorting | Partitioning by app_id and sorting event_name with day by day granularity |
Significant enhancements have been made to job calculations regarding sketch processing, resulting in a 100% increase in efficiency. 5% latency in question efficiency. |
Time vary queries | Processing complex queries efficiently? | 20% enchancment in question efficiency. |
Lowering outcomes set measurement | Schema refactoring | A 90% enhancement in overall quiz performance. |
Question end result reuse | Supporting Athena question outcomes reuse | A significant 80% increase in query performance was achieved, exceeding previous levels within the designated timeframe. |
Conclusion
We validated the integration of Athena as the foundation for AppsFlyer’s Audiences Segmentation feature. We experimented with diverse optimisation approaches akin to knowledge consolidation, schema redefinition, concurrent queries, and leveraging advanced indexing techniques.
With our specialized knowledge, we aim to provide valuable perspectives that enhance the effectiveness of your Athena-powered functionalities. Furthermore, consider experimenting with additional steering to achieve optimal results.
In regards to the Authors
As a software program crew lead at AppsFlyer, the individual is currently focused on fraud prevention. Prior to delving into this subject matter, she served as the leader of the Retargeting team at AppsFlyer, the focus of this post. During her free hours, Nofar prefers engaging in sporting pursuits and is passionate about guiding women in their areas of specialization. A dedicated advocate for promoting diversity in engineering, she is committed to increasing the number of young women pursuing careers in this field and empowering them to thrive.
As a backend developer expertly knowledgeable about vast complexities, he serves as a key member of the Retargeting team within the esteemed AppsFlyer organization. Prior to joining AppsFlyer, Matan worked as a backend developer for the Israeli Defense Forces (IDF) and earned his Master of Science degree in Electrical Engineering from Ben-Gurion University (BGU), focusing on Computer Systems. When he’s not busy with other pursuits, he appreciates surfing, practicing yoga, exploring new destinations, and strumming the guitar.
Serving as a Principal Options Architect at Amazon Web Services. At this location, he collaborates closely with key AWS clients to design and develop innovative cloud-based solutions that drive their digital transformation. Michael relishes the challenge of crafting innovative cloud infrastructure solutions that balance scalability, reliability, and cost-effectiveness. With a passion for sharing his extensive knowledge of SaaS, analytics, and other domains, he enables clients to elevate their cloud capabilities.
Serves as a Senior Technical Account Manager at Amazon Web Services. He advises buyers as a trusted advocate, leveraging expertise to help clients achieve seamless cloud operations, streamlining processes and aligning AI/ML solutions with their strategic goals.