Monday, March 31, 2025

What’s behind the curtain: Migrating from OpenSearch to Rockset for real-time medical trial monitoring on AWS with DynamoDB?

A suite of software programs widely used in over a thousand scientific studies to simplify data collection and management processes, thereby optimizing trial efficiency and precision. The cloud-based platform seamlessly collects and aggregates vast amounts of scientific trial data in real-time from over 2 million patients across 110 countries, incorporating diverse sources such as electronic health records and wearable devices.

As the COVID-19 pandemic accelerated the shift towards digital scientific trials, Medical ink emerged as a crucial enabler of remote monitoring and data collection capabilities. Rather than requiring trial members to return onsite to report affected person outcomes, they will transition their monitoring to the home. Consequently, trial development and deployment are significantly accelerated, leading to enhanced patient enrollment and retention rates.

Scientific trial sponsors turned to Medical Ink, seeking seamless analysis of global study data in real-time, requiring a comprehensive 360-degree view of patients and outcomes across entire studies within the remote-first setting. With a centralized, real-time analytics dashboard featuring robust filtering capabilities, scientific groups can swiftly respond to individual patient queries and evaluate progress in real-time, ensuring the optimal success of trials. The 360-degree view was created as a comprehensive information hub for scientific organizations, providing an intuitive platform that enables seamless navigation from high-level overviews to detailed insights, allowing research teams to track progress across diverse geographic locations.

Upon receiving the requirements for a brand-new real-time examine participant monitoring system, my team and I recognized that our current technical infrastructure was insufficient to support millisecond-latency advanced analytics on live data streams. Amazon OpenSearch, a fork of Elasticsearch originally used for our software search, proved to be relatively quick, but was not purpose-built for advanced analytics and lacked the specialized features needed. The cloud-based data repository relied upon by our analysts to support high-performance business intelligence operations experienced critical latency issues, thereby failing to meet the application’s speed requirements. Sent to the drawing board was our assignment: crafting a fresh framework that enables seamless data ingestion, advanced analytics, and robust resilience.

The Earlier than Structure

Why medical ink preceded user-facing analytics in importance?

Amazon DynamoDB for Operational Workloads

On the Medical ink platform, personal health data from third-party vendors, internet applications, mobile devices, and wearable machines is securely stored. Amazon DynamoDB’s flexible schema enables effortless storage and retrieval of data in various formats, particularly valuable for Medical Ink’s software, which necessitates handling dynamic, semi-structured information. Since DynamoDB is a fully serverless database, the team was spared the hassle of managing underlying infrastructure and scaling concerns, as these responsibilities were expertly handled by Amazon Web Services (AWS).

Amazon Opensearch for Search Workloads

While DynamoDB excels as a fast, scalable, and highly available option for transactional workloads, it may not be the best choice for search or analytics applications. The medical ink platform’s initial era witnessed a significant optimization move: shifting search and analytics functionality from DynamoDB to Amazon OpenSearch. As data volumes increased, the need arose for advanced joins to facilitate enhanced analytics and provide real-time patient monitoring. Although joins are typically not considered first-class citizens in OpenSearch, they often necessitate complex operational workarounds, data denormalization, parent-child relationships, nested objects, and application-side joins that can be challenging to scale effectively.

When we further encountered issues with information and infrastructure operations. We encountered an issue regarding dynamic mapping in OpenSearch, specifically with the process of automatically identifying and mapping field information structures within a document. Dynamic mapping proved particularly beneficial in our scenario, as it allowed us to efficiently handle numerous fields containing diverse data formats and integrate information from multiple sources featuring distinct schema structures. Notwithstanding its benefits, dynamic mapping often resulted in unexpected consequences, such as mismatched data types or mapping inconsistencies that compelled us to reindex the data.

Although we leveraged managed Amazon OpenSearch, we still bore responsibility for overseeing node management, shard configuration, and index optimization. As our paperwork grew in complexity, we recognized the need to expand the cluster – a manual and labor-intensive process requiring careful planning and coordination. To effectively manage the increasing workload, we found it essential to scale our compute resources in tandem with storage, given OpenSearch’s inherent coupling between these components. This resulted in reduced waste and more competitive pricing, ultimately leading to increased efficiency. Although we leveraged advanced analytics within OpenSearch, we still considered integrating additional databases, recognizing that information architecture and operational management were crucial considerations.

Snowflake for Knowledge Warehousing Workloads

Additionally, we explored the capabilities of our cloud-based data repository, Snowflake, as a potential analytics serving layer within our software ecosystem. Snowflake enabled scientific trial sponsors to receive comprehensive, weekly review summaries and provided sophisticated SQL analytics capabilities, supporting the complex data requirements of the application. The offloading of DynamoDB data to Snowflake encountered unacceptable delays, resulting in a minimum latency of 20 minutes that exceeded the required time window for this specific use case.

Necessities

To address the existing gaps in our current structure, we have identified the essential requirements for an alternative to OpenSearch as a serving layer:

  • Actual-time streaming ingest enables seamless knowledge adjustments from DynamoDB to be instantly visible and queryable within the downstream database in mere seconds.
  • The database must be able to integrate and visualize milliseconds-latency advanced analytics, including join capabilities, to provide a comprehensive 360-degree view of patient data from clinical trials worldwide. The system facilitates sophisticated data manipulation by providing robust sorting, filtering capabilities, and aggregation functions for processing large volumes of diverse entity types.
  • With unwavering reliability at its core, this robust database ensures uninterrupted operations and minimizes data loss in the event of various types of failures or disruptions.
  • Cloud-agnostic and designed for scalability, the database seamlessly adapts to changing demands, allowing for effortless scaling at the touch of a button or through a simple API call, all without compromising availability. With our investment in a serverless architecture utilizing Amazon DynamoDB, we were able to eliminate the need for an engineering team to manage cluster-level operations, paving the way for streamlined development.

The After Structure

Medical Insights Amplified: Real-Time Monitoring of Scientific Trials with Rockset

Rockset initially caught our attention as an alternative to OpenSearch for its support of advanced analytics on low-latency data.

Both OpenSearch and Rockset employ indexing techniques to enable fast query execution on massive datasets. Rockset leverages a unique architecture, combining the power of a search index with columnar and row retailers to maximize query performance. The Converged Index enables a SQL-based query language, thereby satisfying the need for sophisticated analytics capabilities.

With the convergence of indexing technologies, we’ve explored various alternatives that sparked our interest and streamlined the process of commencing efficiency testing for Rockset on our proprietary data and query sets.

  • New data from our DynamoDB tables is rapidly replicated and made queryable within Rockset, typically with only a few seconds of latency. This seamless integration allowed Rockset to effortlessly align with our existing data architecture.
  • The implementation successfully overcame information engineering hurdles associated with dynamic mapping in OpenSearch, ensuring uninterrupted ETL processing and query response times despite schema changes.
  • We’ve also made an investment in a serverless information architecture, ensuring optimal resource utilization and minimizing operational burdens. By leveraging Rockset, we were able to scale our ingest, compute, and storage resources independently, thereby avoiding the need for overprovisioning at the source level.

Efficiency Outcomes

As soon as we confirmed that Rockset met the needs of our software, we moved forward to assess its database ingestion and query performance. We successfully executed subsequent examinations on Rockset by crafting a Lambda function using Node.js.

Ingest Efficiency

The prevalent trend observed is numerous small write operations, spanning a range of sizes from 400 bytes to 2 kilobytes, persistently being written to the database. We assessed ingest efficiency by generating a batch of X write operations on DynamoDB, then measuring the average time in milliseconds required for Rockset to synchronise and render the data queryable, typically referred to as information latency.

To assess efficiency, we leveraged a Rockset medium-sized digital environment featuring 8 virtual CPUs for processing and 64 gigabytes of memory.

What are the key performance indicators for optimizing streaming ingest on a Rockset instance provisioned with 8 vCPUs and 64 GB RAM?

Rockset’s efficiency exams demonstrate its ability to achieve a latency, denoting the interval between data being written to DynamoDB and its subsequent availability for querying within Rockset. Our load testing has confirmed that we can consistently write information to DynamoDB and update customer dashboards within approximately two seconds, ensuring timely access to real-time data. Thus far, our team has faced challenges achieving consistent latency performance with Elasticsearch, whereas our tests have consistently demonstrated predictability and reliability when utilizing Rockset during load testing scenarios.

Question Efficiency

To optimize performance, we issued approximately X queries at irregular intervals of 10 to 60 milliseconds apart. Two separate examinations were conducted using query sets featuring diverse levels of difficulty.

  • What’s the connection between physics and biology? Is it about the study of cells in motion? The dataset has a dimension of approximately 700,000 pieces of information and a size of roughly 0.5 gigabytes.
  • What data analysts and scientists need to know about expanding arrays with unnesting? Knowledge is filtered based on the unnested fields. The two datasets were merged together; the first contained approximately 700,000 records and required 2.5 gigabytes of storage, while the second comprised around 650,000 entries and occupied roughly 3 gigabytes of space.

We re-ran the exams on a Rockset-provisioned digital environment featuring 8 vCPUs for processing and 64 GB of RAM.

The efficiency of a simple query is impressive across multiple disciplines? The query was executed on a Rockset digital platform equipped with eight virtual central processing units (vCPUs) and 64 gigabytes of random access memory (RAM).
What is the purpose behind nesting complexity in this query, and does it genuinely improve its performance? The question was run on a Rockset digital occasion with eight vCPUs and sixty-four gigabytes of RAM.

Despite handling high volumes of concurrent requests, Rockset demonstrated its ability to deliver query response times consistently within a reasonable range.

We assessed Rockset’s scalability by analyzing query performance on a small-scale digital environment featuring 4 vCPUs and 32 GB of RAM, compared to a medium-scale instance. The findings substantiated a significant reduction in query latency: a 1600% decrease for the initial query and a 4500% decrease for the subsequent one, verifying Rockset’s capacity to efficiently handle our workload.

Our evaluation revealed that Rockset’s query performance consistently met expectations, with response times falling within a tight range of 40% and 20% of average speeds, while queries typically executed in under 10 milliseconds, further enhancing the user experience through swift results delivery.

Conclusion

As we transition to real-time scientific trial monitoring in manufacturing, we’re establishing a cutting-edge operational information hub that supports scientific teams. We’ve been thoroughly impressed by Rockset’s remarkable speed and its capabilities in enabling sophisticated filtering, joining, and aggregation operations. By leveraging Rockset’s capabilities, users can enjoy sub-second query latencies of under 10 milliseconds, while also empowering real-time data ingestion from DynamoDB, supporting seamless updates, inserts, and deletions.

While OpenSearch necessitated manual adjustments for peak performance, Rockset has demonstrated a marked reduction in the need for operator intervention. As we transition to larger digital landscapes and welcome additional scientific partners, the process is effortlessly streamlined, requiring merely a single click.

In the coming year, we are eager to deploy real-time examine participant monitoring to all clients, further solidifying our leadership in the digital transformation of scientific trials.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles