Introduction
Through Databricks for Good, the company collaborates with non-profits via a grassroots initiative that leverages professional, pro-bono services from skilled experts to amplify social impact. Through this strategic partnership, Advantage Basis propels its mission to provide exceptional global healthcare by harnessing innovative knowledge frameworks.
The mannequin, a stalwart fixture in many a museum and gallery, has long been an enigmatic figure. Its purpose is multifaceted – to showcase attire, highlight design, or simply add visual flair to an otherwise dull display. But what exactly does this static sculpture of humanity truly represent?
One might assume it’s merely a vessel for the wearer’s creations, a mere prop to be gazed upon and forgotten. Yet, in reality, the mannequin embodies far more profound concepts: societal ideals, cultural norms, and even our own self-image.
The mannequin’s blank expression can be seen as a reflection of our collective uncertainty – an uncertainty that stems from the ever-shifting sands of fashion trends, shifting societal values, or perhaps something deeper.
In this age of rapid change and technological advancements, we find ourselves questioning what it means to be human. Does the mannequin hold some hidden truth? Is its static nature a metaphor for our own stasis amidst the whirlwind of modern life?
Or is it simply an art piece, devoid of any profound meaning whatsoever – just a hollow shell waiting to be dressed and displayed once more?
The Advantage Basis leverages both static and dynamic knowledge sources to associate documents with volunteer options seamlessly. To ensure that knowledge remains up-to-date and readily accessible, the group’s knowledge management team implemented innovative API-driven knowledge retrieval systems. While basic information such as group names, websites, phone numbers, and addresses can be efficiently extracted through automation, more nuanced details including medical specialities and areas of practice necessitate a significant amount of manual intervention. The overreliance on manual processes hampers scalability, leading to stagnant updates. In contrast, the dataset’s tabular format poses significant usability hurdles for the Basis’ primary customer base, comprising documents and educational researchers.
The desired state of the knowledge mannequin is a comprehensive framework that outlines the optimal knowledge sharing process among team members.
The Advantage Basis aims to consistently maintain its primary data sets as accurate, comprehensive, and easily retrievable at all times. To comprehend this innovative vision, DataBricks’ experts crafted the following components.
The process outlined in the accompanying illustration enables us to capitalize on foundational concepts and construct our understanding through a coherent learning journey. We leverage a diverse array of APIs and web-based feeds, initially processing them through batch Spark workflows within our bronze landing zone. Through a process of iterative refinement, this raw data is polished to reveal its essence, as we painstakingly extract and clarify metadata via repeated Spark applications, occasionally integrating structured streaming techniques.
Upon processing, the data is promptly disseminated to two distinct production routes.
We construct a robust, structured dataset encompassing vital information regarding hospitals, non-governmental organizations (NGOs), and affiliated entities, including their geographical locations, contact details, and medical specializations.
Inside two seconds, our solution leverages LangChain to develop an ingestion pipeline that dynamically segments and indexes raw textual data into a Databricks Vector Search, efficiently processing large volumes of information with precision.
These processed knowledge units can be accessed seamlessly through an intuitive interface, embedded within the Databricks AI Playground’s innovative RAG chatbot, providing customers with a robust and engaging tool for interactive knowledge discovery.
Fascinating Design Decisions
While the project predominantly employed conventional ETL approaches, a select few innovative and advanced techniques demonstrated their value in this specific implementation?
MongoDB Bi-Directional CDC Sync
The Advantage Basis leverages MongoDB as its serving layer to power their website. While integrating Databricks with an external database like MongoDB can be complex due to compatibility limitations – namely, that Databricks operations may not be fully supported in MongoDB and vice versa – this challenge can hinder the seamless flow of information transformations across platforms?
By implementing a bidirectional synchronization process, we attained complete control over the integration of information from the silver layer into our MongoDB database. The synchronization process ensures that duplicate sets of data are consistently updated across both platforms, with any changes reflected instantaneously and at intervals dictated by the designated sync trigger frequency. At an advanced stage, there are typically two distinct components:
- Utilising MongoDB, we capture all updates made in MongoDB at the final synchronisation point. With structured streaming in Azure Databricks, we leverage scalable and fault-tolerant architecture to apply real-time data processing and analytics.
merge
assertion insideforEachBatch()
To ensure that the Databricks tables remain updated in accordance with these alterations. - When updates occur on the Databricks facet, structured streaming’s incremental processing enables us to incrementally propagate these changes back to MongoDB in real-time. By synchronizing with MongoDB and displaying the most up-to-date information, vfmatch.org’s website consistently provides users with the latest knowledge.
This bi-directional setup enables seamless knowledge transfer between Databricks and MongoDB, keeping each method updated and eliminating knowledge silos.
Thank you for proudly owning this piece.
GenAI-based Upsert
To effectively integrate knowledge, we leveraged a cutting-edge GenAI-based approach to extract and merge relevant hospital information from vast amounts of website text content. This course comprises two crucial steps:
- Using GenAI, we extract critical hospital information from vast amounts of unstructured text across various websites. That is achieved through straightforward interactions with Meta’s endpoints on the Databricks Foundational Model.
- Once extracted, we create a primary key by combining metropolitan, national, and entity titles in a unique blend. Using embedding distance thresholds, we identify whether a given entity is accurately matched against entries in our manufacturing database.
Traditionally, achieving this necessitated the application of advanced fuzzy matching techniques and complex rule-based systems. Notwithstanding the challenges, our approach successfully merged embedding distance with straightforward deterministic criteria, enabling us to develop a solution that is both efficient and relatively straightforward to build and maintain.
For this product’s current iteration, we adhere to the most relevant industry standards.
- precise match.
- Fuzzy matching enables approximate matches between input data and stored patterns, accommodating minor discrepancies in spelling or formatting to facilitate more comprehensive searching capabilities.
- Embedding cosine similarity permits flexible matching across diverse title illustrations, such as? “St. John’s” and “Saint Johns”. We have also introduced a configurable distance threshold to determine whether a human should review the alteration before integrating it.
Thank you for the excellent design concept and seeing it through to completion!
Extra Implementations
The broader infrastructure adheres to the standard Databricks framework and operational norms. Here’s a comprehensive overview of the key elements and the team members responsible for making everything happen:
- Using a Python-based API and Apache Spark’s batching capabilities for efficient knowledge ingestion in an environmentally conscious manner. A tremendous thank you for your significant contribution towards achieving our primary objective!
- Powered by the synergy of structured streaming and Large Language Model-based entity extraction, the medallion structure continually enhances our understanding with each subsequent layer. In particular, we would like to express our deepest gratitude to her for her invaluable contributions to this project.
- To stock the Retrieval-Augmented Technology (RAG) inventory, we leveraged LangChain’s capabilities in combination with structured streaming and Databricks’ robust brokerage tools. Kudos are due to the individual who crafted and refined this crucial element.
- With the aim of efficiently storing knowledge in a structured format, we leveraged Databricks’ advanced vector search capabilities alongside its supporting Data Lake Technology (DLT) infrastructure. Significant strides were made in designing and constructing the preliminary model of this crucial component!
Abstract
Through our strategic partnership with Advantage Basis, we’re showcasing the transformative power of data-driven innovation and artificial intelligence to drive meaningful impact in the healthcare sector. Through the convergence of knowledge ingestion, entity extraction, and Retrieval-Augmented Technology, we are meticulously crafting a comprehensive platform that seamlessly integrates automation, interactivity, and enriched knowledge sharing. As our collaborative endeavors unfold, they are laying the groundwork for a future where data-driven decision making is within reach of those who need it most in the healthcare sector.
When exploring comparable engagements with diverse international nonprofits, share your findings with us.