Information catalogs and metadata catalogs share certain parallels in their seemingly identical names. While data scientists and AI engineers share some broad competencies, distinct differences also exist between the two professions that knowledgeable professionals should be aware of.
Metadata catalogs, commonly known as metastores or technical knowledge repositories, have been an integral part of our current information landscape.
As a regular reader, you may have recently learned about the significant developments in metadata catalogs at two major industry conferences last month, where long-standing opponents of open-sourcing their respective metadata catalogs, Polaris and Unity Catalog, finally took the step to make their data more accessible.
Metadata catalogs, also known as metadata management systems or metadata repositories, are centralized databases that store and manage metadata across an organization. They enable users to discover, access, and utilize metadata from various sources, making it easier to gain insights, optimize processes, and ensure compliance. We’re delighted you’re interested! Continue learning and discover more.
Metadata Catalogs
A metadata catalog is established to serve as the central repository for technical metadata, storing descriptions of data assets in a tabular format within an information lake or lakehouse.
Here is the rewritten text: The Hive Metastore stands out as a widely utilized metadata catalog, serving as the primary repository for metadata that describes the structure and content of Apache Hive tables. Initially, Hive provided a relational framework enabling Hadoop users to query HDFS-based data utilizing standard SQL syntax, as an alternative to MapReduce programming.
Although Hive and the Hive Metastore have been around for a while, they’re gradually being replaced by more modern technologies. Desk codecs, similar to Apache Iceberg, Apache Hudi, and Databricks Delta Lake, offer numerous advantages over traditional Hive tables, including support for transactions that enhances data accuracy.
To fully leverage these desk codecs, organizations must also establish a technical foundation – a metadata catalog – which serves as a gateway to discovering and governing the data contained within the tables, thereby ensuring controlled access to this information. With Databricks, users can seamlessly operate within their Unity Catalog. Developed by engineers at , Iceberg’s merchandise aimed to become a “transactional catalog,” facilitating seamless access to diverse open and industrial knowledge repositories, including Hive, Dremio, Spark, and AWS Athena, among others.
Snowflake has developed and launched Polaris as the primary metadata catalog for the Apache Iceberg ecosystem, formally committing to its deployment. Utilizing Iceberg’s open-source, REST-based API, Polaris leverages seamless access to the descriptive metadata of Parquet data stored in Iceberg, mirroring Nessie’s own capabilities. This REST API acts as an interface between the information stored in Iceberg tables and knowledge processing engines, similar to Snowflake’s native SQL engine, alongside various open-source alternatives.
Information Catalogs
Firms often utilize third-party instruments, namely information catalogs, to aggregate and organize the collective knowledge amassed within their organizations. Some groups leverage facilities that allow customers to explore and discover knowledge relevant to their interests or needs, often integrating a knowledge discovery component within their catalog systems.
Several knowledge repositories, much like theirs, have evolved to integrate entry administration metrics alongside knowledge heritage tracking and regulatory oversight features. Distributors of knowledge administration tools that initially focused on providing governance and entry management, such as , have since expanded their offerings to include knowledge catalogs and discovery capabilities.
Metadata repositories and corporate knowledge databases also dedicate significant resources to capturing metadata, thereby enabling effective monitoring of diverse knowledge assets. A leading enterprise knowledge catalog vendor prioritizes integration by leveraging distinct datasets and utilizing a metadata “orchestrator” to synchronize information, ensuring enterprise metrics remain aligned and consistent.
What’s the difference?! Each entity monitors metadata and typically maintains a comprehensive “knowledge catalog” for its respective titles. Metadata catalogs are essentially databases that store descriptive information about data, whereas information catalogs are platforms that aggregate and provide access to various types of digital content, often with contextual metadata attached.
So What’s The Distinction?!
I spoke with Felix Van de Maele, CEO and co-founder of Collibra, a leading provider of data catalogs in the big data space, to gain insight into the differences between these two catalog offerings.
“They encompass a diverse range of concepts,” Van de Maele noted. “If you focus on the Polaris catalog from Databricks, Unity Catalog, or other cloud-native data warehousing solutions – all of which have their own catalogs – it’s actually about having the flexibility to store your data anywhere, across any cloud provider…And I can leverage any data engine like Databricks, Snowflake, Google, AWS, and so on, to process that data.”
While Collibra and other enterprise knowledge catalogs may seem distinct, their underlying principles are actually quite similar.
“When discussing our services, he emphasized that we provide a richer understanding of the enterprise landscape.” “We provide a data graph, an enterprise context that enables you to define and manage your insurance policies with precision.” What’s the standard insurance policy for my current level of understanding? To ensure seamless integration into your organization’s ecosystem, your knowledge must align with and conform to the following enterprise guidelines: company-specific policies, industry standards, and best practices. Data privacy regulations that your data may need to comply with include the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), Payment Card Industry Data Security Standard (PCI DSS), California Consumer Privacy Act (CCPA), Gramm-Leach-Bliley Act (GLBA), and other industry-specific regulations depending on your business sector. Who must approve it? How can we seize attestations? How can we do certification? To construct an effective enterprise glossary, begin by identifying key terms that are unique to your organization or industry, then define them in a way that is clear and concise for both technical and non-technical stakeholders.
“That’s significantly distinct from a Polaris catalog placed atop an iceberg, which represents physical metadata.” He highlighted a genuine distinction.
Van De Maele facilitates the open-knowledge Lakehouse architecture, empowering users to store their data in open formats such as Iceberg, Delta, and Hudi, and query it using any engine. His clients, a diverse range of Fortune 500 companies, leverage his expertise across multiple knowledge platforms, utilizing Collibra’s Information Intelligence platform to inform management decisions and ensure governed access to knowledge assets.
Completely different Roles
While prospects may initially associate names, metadata catalogs and knowledge catalogs must be distinguished for their distinct purposes.
“When distinguishing between our approaches, I highlight that we define and administer coverage, whereas they focus on enforcing it,” Van de Maele explained. “I firmly believe that’s the most suitable framework.”
Occasionally, metadata catalogs experienced inadequate performance to enable customers to effectively organize enterprise policies around data entry. Pursuant to industry regulations, data privacy safeguards prevent unrestricted access to customer information; exceptions apply only when specific, labeled data can be anonymized, according to Van de Maele.
“We ensure seamless data classification and masking across all platforms where advertising intelligence is utilized, whether it’s Databricks, Salesforce or Google – ensuring the right information is accurately categorized.” “We deploy our solution to multiple cloud-based data platforms, including Databricks, Snowflake, Google Cloud, Amazon Web Services, and Microsoft Azure.”
Companies can develop their own data governance frameworks without relying on tools like Collibra’s, as noted by Van de Maele. Regardless of the circumstances, it ultimately boils down to SQL at its core. In that case, a more efficient method would be needed to track the numerous columns scattered across various data platforms, totaling in the hundreds of thousands. By providing insight into existing knowledge and its location, Collibra ensures that users can access data in accordance with a corporation’s governance policies, thereby fulfilling its purpose.
On a similar timeline, Collibra relies on metadata catalogs to enforce its mechanisms effectively. Despite efforts to implement various enforcement mechanisms – akin to proxies and drivers, as noted by Van de Maele – such approaches have ultimately proven ineffective.
“We anticipate that the metadata catalog strategy utilizing an open desk format will prove to be a highly suitable option,” he stated. “We aim for our knowledge platforms to support native integration of such features, ensuring scalability and efficiency are consistently maintained.”
Databricks’ Unity Catalog appears to be an exception in this regard. In Unity Catalog, a key feature provided by Databricks, users gain granular control over technical metadata at the same time as leveraging higher-level functionalities akin to knowledge governance, data entry management, auditing, and data lineage capabilities. While Unity Catalog appears to rival the major enterprise knowledge catalog providers.