Snowflake continues to set the standard for data in the cloud by removing the need to perform maintenance tasks on your data platform and giving you the freedom to choose your data model methodology for the cloud. There will be scenarios where you may need to consider transaction isolation, and such a scenario does exist for Data Vault: the common hub table.
This post is number 8 in our “Data Vault Techniques on Snowflake” series:
We explored the concept of passive integration in the previous blog post. If we are looking to integrate by and concurrently load to common hub tables, per the business object definition, this is the ideal situation for a Data Vault model. With Snowflake being READ COMMITTED transaction isolation level, how do we guarantee that independent hub table loaders leave the target hub table with the same integrity after the load? Let’s explore how this is done in Snowflake and what this could mean to your Data Vault model.
As illustrated above, any landed source data is staged and loaded to the modeled hub, link, and satellite tables.
Notice how each source loads to a common hub table in the above example. They must leave the hub table in the same integrity as each hub-loader found it; a unique list of business objects defined by business key, business key collision code, and multi-tenant id.
The problem is the same as described in blog post 6, Conditional Multi-Table INSERT, and Where to Use It, in which we discussed conditional multi-table inserts. If each thread attempting to load to the same target hub table is executed at the exact same time, due to the nature of READ COMMITTED transaction isolation, each thread views the target table without UNCOMMITTED transactions coming from other threads. See an illustration of this below:
Duplicates break the integrity of the Data Vault model, and you will start to see information marts—and by association, the data structures—start to suffer as well.
Snowflake does in fact allow for table locking, and the syntax is simple: change the SQL INSERT statement into an SQL MERGE statement and the target hub table defined in the MERGE statement will be locked when it is its turn to update the target table. Of course, this implies that the hub-loader template is changed, and all configured hub-loaders will then use the SQL MERGE INTO statement instead.
insert into HUB_TABLE
select distinct
from STAGED stg
where not exists
(select 1
from HUB_TABLE h
where stg.HASH-KEY = h.HASH-KEY)
merge into HUB_TABLE h
using (select distinct
from STAGED) stg
on h.HUB_KEY = stg.HUB_KEY
when not matched then
insert ()
values ()
Three SQL data manipulation language (DML) statements lock Snowflake tables for updates:
In the grand scheme of automation, our earlier animated example can now be updated like this:
Does this approach add latency? Probably, but it should be minor because hub tables typically only process a few condensed records at a time using anti-semi joins. The beauty of this approach is that we did not need to explicitly lock the target table using externally defined semaphores. Instead, Snowflake randomly decides which thread will get processed first, and for hub tables that works just fine!
SQL MERGE statements allow for SQL INSERTs and UPDATEs, offering an opportunity to explore a Data Vault artifact that was previously deprecated, the “last seen date” column in the hub and link table. Because hub and link tables are wafer thin and tend to be short, the number of micro-partitions that make up these tables is very small. Why not allow for SQL UPDATEs to the hub and link table?
Yes, the cost of SQL UPDATEs are expensive operations, but this might also be true for satellite tables where we might see as little as a single descriptive column to hundreds of descriptive columns. SQL UPDATEs are still not recommended here. Let’s update our previous example to show where last-seen-date can be useful:
merge into HUB_TABLE h
using (select distinct
from STAGED) stg
on h. HUB_KEY = stg.HUB_KEY
when not matched then
insert ()
values ()
merge into HUB_TABLE h
using (select distinct
from STAGED) stg
on h. HUB_KEY = stg.HUB_KEY
when matched then
update set h.LAST = stg.LAST
when not matched then
insert ()
values ()
Other certain Data Vault artifacts were developed to track this very information because they are INSERT-ONLY table structures. These are:
Because “last seen date” only records the latest occurrence of the business object or unit-of-work, it is not a reliable source to check the following:
If the data source does not contain a business date tracking the effectivity of the driving entity of the relationship—or if you wish to track effectivity of a different driving entity than that of what is tracked in the data source—then there’s a need for an effectivity satellite. A LAST_SEEN_DATE column in the link table will give you what the current relationship is without needing one of the most complex Data Vault patterns. Let’s explore this further using the following example:
No matter what the driving key is when utilizing the LAST_SEEN_DATE, you will get the current active relationship for that driving key/relationship. It also does not require that you deploy multiple effectivity satellites for each driving key you want to track on a single link table. However, you will not be able to trace the historical movement of that driving to non-driving key relationship. That is the exclusive realm of the effectivity satellite unless (again) the source system provides this movement.
Always test these scenarios for yourself! The idea behind this technique is to take advantage of what Snowflake technology offers and keep Data Vault automation single purpose.
As quoted from the Zen of Python: “There should be one—and preferably only one—obvious way to do it.”
Leverage Snowflake technology to ensure you meet your automation goals.
The post Data Vault Techniques on Snowflake: Hub Locking on Snowflake appeared first on Snowflake.
Welcome to Snowflake’s Startup Spotlight, where we ask startup founders about the problems they’re solving, the apps they’re building and […]
Every business has key customer behaviors it aims to drive — whether it’s encouraging repeat purchases, promoting product upgrades or […]
The Energy Sector’s transformative shift Energy, the driver of the global economy, is undergoing one of the largest secular shifts […]