Blog Post

Why Data Knowledge Management and Data Catalogs Aren't Friends

by
Nick Freund
-
January 31, 2024

In the three years since founding Workstream, one question we have gotten quite often is: “Are you a data catalog?” 

The question is understandable. Data catalogs have been a well known solution category for over a decade, and alongside data observability, it is probably the most well known, discussed and purchased “meta-category” of data tools. 

Alation – often credited with creating the data catalog category – was founded in 2012. Since then, a host of data catalogs have flooded the market, most especially in recent years. The new data catalogs on the block are taking on these legacy metadata management tools, vastly expanding the scope of the solutions on offer and positioning themselves as “modern” alternatives to the OGs. The demand for the category is expanding, with Gartner estimating that the market grew by 21.6% over the past year.

But if I am also being blunt, I find the question downright annoying to answer. The short answer is: no - we are not a data catalog, nor do we have any intention of becoming one. We believe that we are building something new, and forging a new category. And while we salute the success of those building Alation copycats, we quite frankly just don’t find building a better version of the same old mousetrap that interesting or valuable. 

While you may notice one or two similar features, we believe that our vision for data knowledge management stands in stark opposition to catalogs in terms of where they fit in your tech stack, the audience they are built to serve, and the value they provide to the business. 

In fact, our views of the world are so diametrically opposed that I would posit that data catalogs and knowledge management are not even friends.

The Data Catalog Landscape

Data catalogs can trace their lineage to the reasonably long history of metadata management. What has changed in recent years is the complexity of data ecosystems, and the increasing challenges of building the architecture required to orchestrate flows of data throughout a business.

A brief caveat – since we are not building a data catalog ourselves, we do not profess to be experts in the space, and are open to any feedback or corrections. However, we do think it is helpful to lay out some of the key features of a data catalog in order to differentiate what we offer.

Some of the primary components of data catalogs are:

  • Serving as a system of record for inventorying all data assets and their components (table/column definitions, etc.)
  • Ability to trace data lineage throughout the data stack and, increasingly, machine learning capabilities to surface issues
  • Integration with developer workbenches and testing platforms
  • Providing context for: data stewards, analysts, engineers, and data scientists to locate and contextualize the datasets they use on a daily basis
  • Tools for data governance

Built for those who write SQL

What becomes apparent when examining these tools is that their orientation is fundamentally toward a technical audience. The data people, those who can write SQL, and anyone accessing your data warehouse directly, may need a data catalog to successfully do their work.

…and who use the data warehouse 

Another inherently limiting aspect of data catalogs comes from where they sit within your data environment. 

Because a catalog interacts primarily with your data warehouse, or the ETL tools that feed the warehouse, the scope of its value is focused on that part of your data ecosystem. Once you move further toward the consumption layer (i.e. to your BI or visualization tool, and beyond), the content in your data catalog becomes increasingly less relevant and valuable. 

While these tools might catalog a dashboard as an asset, and allow you to certify it, they do nothing to help the business user trying to use it and apply it to their daily job. 

Data engineer Ananth Packkildurai has written a great article about some of the further limitations of data catalogs, arguing that they are designed for an older, simpler data environment. Because of the complexities of the data creation process – and the many different tools where data is processed and consumed – he argues that an effective data catalog would need to actually embed itself in the data creation process itself, rather than merely sitting on top of it. 

A large crowd

The field for data catalogs is incredibly crowded, both with legacy metadata management platforms and new disruptors, and all claim some unique angle. For example, Atlan emphasizes data product deployment, while Data.world works to stand out for the depth of its knowledge graph. What remains consistent, however, is their focus on the technical teams who interact with the underlying datasets in your business.

How Data Knowledge Management is Different

Because Data Knowledge Management is a new category, it is not surprising that it is sometimes confused with more established types of tools. Data catalogs are sometimes the only reference point for people who first come across Workstream, and they are a helpful one – at the end of the day, both tools are about making interactions with data more effective, efficient, and contextually rich.

But that is the extent of the similarities. While data catalogs provide important context for the technical users building your data products, data knowledge solutions are oriented toward a wider array of users, both technical and non-technical, at the point of data consumption.

More importantly, data knowledge solutions are built to enable data teams to turn the massive scale of data into knowledge teams can act on, empowering better business outcomes. And while the data catalog category is over a decade old, novel problems created by the modern data revolution of the last decade drive the need for data knowledge today. 

Modern data technologies have made it easier than ever to analyze, consume and share data. Data is now available to anyone with access to a phone or internet browser, and the need for data has moved out of the domain of just executives, and into the hands of every manager and employee.

The ever growing scale of data available has made it so that everyone in an organization is now expected to find value in it, and be able to make data-informed decisions. While teams have embraced modern data tools, there remains a gap between accessing the data and applying it to the business.

The Benefits of Data Knowledge

Anyone who consumes data – from a Success Manager looking at customer reports in your CRM, to your COO exploring weekly performance – can use a tool like Workstream to find and contextualize data. 

While a data catalog may be full of valuable information for technical users looking to query your data warehouse, most of it is not valuable to business users. 

When end users view a dashboard, they typically don’t care about its data lineage. They just need to know where to find it, how to use it, and how to get help if they get stuck. In parallel, your data team needs to understand how users are leveraging data, so they can best support them and push the data environment forward.

Enabling everyone with data knowledge

One of the biggest challenges we hear about from data teams as they build data assets is that it is often very difficult to provide the necessary enablement for their business users, who are expected to apply data in day-to-day operations. 

Key considerations of how teams might foster data knowledge include:

  • How do you train new employees on the data that is available for their role?
  • How do you enable teammates on new product features and evolutions to your data platform? 

Because data catalogs are not leveraged by business users, they can do little to help in this area. 

Building a trustworthy repository of analytics assets

No matter how good your organization is at leveraging data, it all means nothing if your team cannot find what they should use on any given day. 

In the modern organization, where every user has been enabled to consume and create data, teams experience an ever growing sprawl of data assets that reflect the constantly evolving nature of your business. Amidst this chaos, end users face challenges with simply finding the data assets that are relevant to them in their role. 

Key considerations that teams might make in solving these problems via knowledge include:

  • How do I discover assets relevant to me based on my role, and that others similar to me might leverage?
  • How do I understand what data assets and knowledge I should use to address this specific question, at this point in time?

Counterintuitively, while data catalogs do a great job of enabling technical teams with data discovery, catalogs struggle to add value here because their core integration plane is the data warehouse. They are simply not approachable for an end user with the simple problem of: “Which of the many reports I have access to accurately reflects our team’s Sales achievement this quarter?”

Don’t believe me? Go and ask your end users how many of them have Chrome bookmarks for this use case, and then compare whether those end users have all bookmarked the same thing. (They probably haven’t).

Uncover actionable insights about the data your organization uses

One of the biggest impediments to coherent data enablement strategies is the lack of insight teams have into the use of their data. As more users apply data to their daily job, data teams must support more data users, and more data assets, than ever before. 

Understanding how data assets are being used and applied to the business is critical to maintaining and evolving critical assets, and evolving the capabilities of the data platform. 

Important insights might include:

  • How do I understand not just what dashboards and self service assets are popular, but which are trending in the last 30 days across a specific set of users? (i.e. the Sales department)
  • How can I better understand the experience of my users, and how they interact with various data points and features (filters, etc.)?
  • How can I leverage usage insights to automatically maintain my environment (ex: archive unused assets)?

Summary

When we say that data catalogs and data knowledge management aren’t friends, it is not intended as a knock on the value of a data catalog. 

It is a knock on the creativity of the people building the n-th data catalog in a really crowded space. And it is a reflection of how different we believe data knowledge is, across both the orientation within and value provided to an organization. 

With one oriented toward technical users and those querying your datasets, and the other aimed at transforming data into knowledge for your broader organization, they each have their own – very different – place within your business.

by
Nick Freund
-
January 31, 2024