Data what??

Robert Koch, Principal Data Engineering

I wanted to walk us through the different data terminology that we work and identify differences and figuring out what we know. For sure I’ll be referencing this page for myself from time to time to help me remember what’s what. You can probably bookmark it and use it as a reference for yourself too.

We’ve all encountered terms involving databases/data stores such as data warehouse, data lake, data mesh, data fabric, master data, and data pipeline. Let’s start with my assumptions as to what each is before I come up with a researched definition and come comments to help us remember.

Starting with my ass-u-m(e)ptions: For data warehouse, I would say it is a series of databases on a server or two where we would store structured data that is protected by role-based security tailored for each team that accesses the data. Some data can be ingested, transformed & enhanced, and then written back to the database probably as another table or view.

Data lake is a file storage system where we’d dump all our structured data (CSV/JSON/etc) and unstructured data (images, videos, raw data) into one cheap & centralized location on the cloud, think AWS S3, Google Storage, or Azure Blob Storage, to start with and then using Hadoop or the ilk to extract/read data so it makes sense.

With master data, the concept conveyed to me is a data store that keeps the data consolidated and unduplicated. I believe the concept as fell by the wayside since cloud storage is very cheap and it’s somewhat worth it to have different “views” of the same data – i.e. analytics may want the data as raw as possible to glean as much information out of it, while those in finance might want to see it structured so it’ll fit neatly in their spreadsheets.

When saying, data mesh, assuming it’s something like a network mesh, it means there is a combination of raw unstructured data mixed in with structured data. I can use this terminology to describe a mixture of data storage in different locations and fronted by a common tool such as AWS Glue or Azure Data Factory.

Data pipeline is a term used to describe the flow of data from the source to destination. The pipeline might involve extract, transform, load – some would say extract, load, and transform depending on how you need the data and integration done.

I do get the feeling that throwing around these terms might make you appear to be condescending to others and might only confuse the recipient of the message and getting them to start questioning their involvement in their next data project. ;-) It’s a given that ALL data is important and even some say it’s the “motherlode” of the company, thus it only benefits all of us when we’re on the same page in terminologies and concepts.

Without further ado – here’s the researched form of the terms presented and links to the definition:

Data warehouse: “a central repository of information that an be analyzed to make more informed decisions.” https://aws.amazon.com/data-warehouse/ No surprise there – the catch is the data warehouse appears to be only applicable to STRUCTURED data, and not a place we can just dump any type of data in.

Data lake: “centralized repository for all data, including structured, semi-structured, and unstructured” https://aws.amazon.com/data-warehouse/ This is the architecture I believe we seem to be graviating towards. The availability, resislency, and relatively low cost of cloud storage will enable organizations to leverage the ability to start their data analysis without too much ceremony.

Data mesh: “Data mesh is an architectural pattern for implementing enterprise data platforms in large, complex organizations. It helps scale analytics adoption beyond a single platform and a single implementation team.” https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/cloud-scale-analytics/architectures/what-is-data-mesh Ah ha, so it means that the data isn’t centralized and out-of-control. It’s all chaos! No, seriously, each business unit have their own needs compared to others thus you shouldn’t shoehorn the organization’s dataset into one set of rules nor one set of format. Maybe one team has a preference in using Tableau and another prefers using QuickSight. In that note, we can see one business unit using dbt while another prefers Prefect to do the orchestration.

Data fabric: “Data fabric is an architecture that facilitates the end-to-end integration of various data pipelines and cloud environments through the use of intelligent and automated systems.” https://www.ibm.com/topics/data-fabric This is not what I would thought it means. Basically organizations has data that grows expontentially and this architecture allows teams to design accessible data using a common set of tools to extract/read the data. I can imagine a team (or teams) building, say, Tableau or PowerBI views, to analyze the data.

Having a data pipeline in place is an approach getting your data into a useful state – there’ll be some massaging, some enrichment, modeling, and analytics done as the data traverse through your data store(s) until it gets monetized.