A Hybrid Analytic Schema Using Snowflake DB and Looker — Chapter 5

Bring on the Modern — a ‘Hybrid’ Reporting Schema in Snowflake DB

9 min readMar 17, 2021

In the previous chapters we iteratively extended an analytic schema to the point that we lost much of the understandability of a star schema design while simultaneously eroding away our database performance. Now it is time to do it right, using a novel style of database schema in Snowflake DB.

Our first realization is that we are not limited to the features of a “standard” SQL database. Cloud-scale analytic databases — and there are several of them, from Amazon Redshift to Google BigQuery — all offer unique features for analytics, at the quite reasonable cost of moving to a more specialized database schema. For our example we are going to use Snowflake. Stop here and do some Internet research if Snowflake is new to you. In brief, it is a cloud-hosted SQL analytic database with some important features, including:

The usual benefits you expect from a clustered columnar analytic database with compression and a dash of partitioning — great speed and efficiency at scale for typical analytic workloads. Plus largely no database maintenance stuff to worry about.
The ability to load JSON and other “unstructured” data formats VERY easily.
The ability to query JSON very efficiently — often as efficiently as columnar data. The importance of this cannot be overstated. It’s like getting most of the benefits of a Data Lake and a Hadoop-ish cluster without the added complexity.
The ability to create a View over unstructured data so that it looks like nice rows and columns to your BI tool. It is almost magical.
Materialized Views that are never stale, plus the very beginnings of automatic query rewrite to transparently utilize those Materialized Views. Snowflake query rewrite is pretty basic right now, but is extremely helpful when you can leverage it.

Let’s build up our “Modern” schema step by step. I ask your patience here since it’ll take a while. But I also promise that you’ll learn some good stuff at every step.

JSON Storage in Snowflake

I said earlier that Snowflake loves JSON. As proof, we will create a Staging Table that contains exactly one column, into which we can load our giant JSON call documents, one row per call. Our table looks like this:

CREATE TABLE MYDB.MYSCHEMA.CALL_STAGING (

CALL_JSON VARIANT

);

That’s it you say? Yes! This is a one-column table, and the data type is… VARIANT — a Snowflake data type used to store unstructured data, whether JSON, Parquet, ORC, Avro, or XML. Into this table and it’s single column we will insert one row per call, with the variant column containing the entire JSON block for the call — reference the example call JSON we showed in an earlier chapter. Snowflake also offers Snowpipe — an impressive tool for quickly and automatically doing data loading, especially if your source data is in S3, one of the Azure storage products, or (with significant limitations right now) Google Cloud Storage — but Snowpipe is a separate topic. For now let’s just say that we have our JSON loading pipeline running and are populating our single-column table with oodles of new Calls as they are available. So now what?

JSON Query in Snowflake

We said that querying JSON in Snowflake is easy, and performs well, so let’s prove it. With Snowflake storing our calls in a VARIANT column as above, we can query our JSON like this:

// Show me some JSON!

select call_json from MYDB.MYSCHEMA.CALL_STAGING limit 100;

What we get is one row per call, with the column CALL_SON containing a large JSON document for each row — as expected. But that’s not the interesting part. Snowflake lets us use “dot notation” to query the JSON elements like this:

// Let's use Snowflake SQL dot notation to project JSON attributes to columns for easier reporting using <column name>.<json path>::<cast to data type>

SELECT

CALL_JSON.metadata.Agent.Username::varchar as agent_name,

CALL_JSON.tags.AskProbingQuestions::int as TAG_AskProbingQuestions,

CALL_JSON.metrics.agent_overtalk_ratio::number as agent_overtalk_ratio

FROM

MYDB.MYSCHEMA.CALL_STAGING

LIMIT 1000;

While simple, the above query is very important. We have used “schema on read” techniques within a SQL database, turning our JSON call data into rows and columns dynamically. Snowflake lets us use “dot notation” to address the individual JSON elements. Snowflake also lets us cast the results from quoted strings (the default for JSON data) to a specific data type using the “::<datatype>” notation. In our query above we cast a Category to a an integer since we know that the JSON values are zero or one, and we cast the agent_overtalk_ratio attribute to a number. We also cast the agent name to a string, which just removes the quotes.

And finally, we can also use this same dot-notation method to define a database View that more persistently projects the JSON attributes as columns:

CREATE SECURE VIEW MYDB.MYSCHEMA.SIMPLE_JSON_PROJECTION as SELECT

CALL_JSON.metadata.Agent.Username::varchar as agent_name,

CALL_JSON.tags.AskProbingQuestions::int as TAG_AskProbingQuestions,

CALL_JSON.metrics.agent_overtalk_ratio::number as agent_overtalk_ratio

FROM

CALL_STAGING;

Our single JSON column now looks like multiple columns to any SQL client. Nice.

Another very important feature of Snowflake’s JSON query implementation is that if our SQL asks for a JSON element that does not exist for a call or for any calls, we get a NULL value as opposed to getting a SQL error. This behavior is very important for us, since it nicely handles our “sparse” data structure where a given data element won’t necessarily exist for every call.

But is querying JSON via SQL performant? In most other databases it is not. Apache Drill and its cousin Amazon Athena both allow the use of SQL over JSON data, but nobody would call either of them fast and efficient at raw data access. And while most SQL databases now allow SQL query over JSON data, they almost always ask for a large performance penalty. But Snowflake really shines in this regard. Behind the scenes, JSON in a VARIANT column is “automagically” decomposed and stored internally as individual columns. There are some exceptions to this automated columniation — attributes with mixed data types and attributes that have actual JSON null values for example — but in general it just works. So let’s prove it.

We will start by running a query that forces Snowflake to retrieve ALL of the data in the table:

// Grab all of the JSON from all rows and output a single output row

SELECT hash_agg(CALL_JSON) from MYDB.MYSCHEMA.CALL_STAGING;

In my smallish test database this query retrieves 0.6TB (680GB) from storage, which represents around 1 Billion calls. Now let’s query again but ask for only a single JSON attribute from each call by using Snowflake’s SQL ‘dot notation’ described earlier. If our database is being inefficient (looking at you MongoDB) then we’ll have to retrieve ALL 680GB of JSON from disk, and then sort through it to find the single attribute we requested. But if Snowflake really is that smart, then we’ll see far less IO:

SELECT hash_agg(call_json:metadata.Agent.CustomerHoldDuration) FROM MYDB.MYSCHEMA.CALL_STAGING;

This query retrieves all rows but only one JSON data element, and scans only 7.4GB worth of data — a nearly 99% savings in IO. So Snowflake JSON query really is efficient — because our data is secretly being stored as columns. In other testing I have found that Snowflake is able to query JSON just as quickly as the same data modeled as separate columns except in a few cases — querying JSON arrays and JSON time representations that needed to be cast to timestamps for example. But even in those cases Snowflake still never needed to retrieve ‘all’ of the JSON to perform a query, so performance was still surprisingly good compared to other databases.

Snowflake Summary

To summarize, using Snowflake DB we can (1) Store complex semi-structured JSON in a single column, so that JSON data structure changes do not require DB schema changes, (2) Efficiently query JSON using SQL, (3) Use views to present JSON as Rows and Columns to our BI tools and, (4) Smoothly handle queries against JSON attributes that do not exist, returning nulls instead of errors.

Looking Ahead: So How Does This Help Us?

We are working to create an analytic schema that maintains understandability while avoiding the need to break down each call into hundreds or thousands of rows. The ability to dynamically project JSON attributes as columns is a key part of the solution — and we’ll see exactly how later. But I am guessing that you are seeing the benefit already.

Our Snowflake “Hybrid” Schema

With Snowflake’s really useful JSON handling in our toolbox, we can now build V3 of our analytic schema — the Snowflake-specific hybrid schema:

TODO

Physical Schema
Database Views Schema

Query Performance Results

Database: Snowflake “XL” cluster
Data: 100 Million calls, 100 million FACT_CALL_HYBRID rows
Clustering: by date and time
Average large analytic Scorecard query: ~2 seconds

As you can see, our excellent query performance is back — which is not surprising since we are back to having just one fact table row per call. But because of our new hybrid schema is leveraging Snowflake’s JSON features, we have ALL of our data elements available in the table, presented as separate columns for reporting purposes, with nice handling of our sparse data and of schema changes over time. And because our physical schema is still a standard star schema centered around one fact table, we have the ability to use remaining database features for further improvements — materialized view aggregations, fancier clustering schemes, etc. — since we haven’t already “used up” those features to fight other performance problems.

But Schema Maintenance?

If you are thinking ahead, you may have already found a potential problem: While you can use Snowflake features to present your loosely structured JSON call data as nice rows and columns for easy reporting, every time a new Category or Metadata field appears, or a Scorecard changes, you will need to rebuild the associated database views, which could easily become a serious burden.

Bring on the View Automation

But there is a solution. We can automate the maintenance of the database views that we use to present our JSON attributes as rows and columns. In fact, it’s quite easy to do. Described as a step-by-step procedure, our approach looks like this:

Inspect the JSON in our database table and generate a list of all unique JSON attributes and their inferred data types. We really only need to inspect the rows that are new or changed since our last inspection, which is easy enough, but even a naïve solution that reads everything will take less than a minute for a database of a Billion rows, so the script that we present here will be the simple one.
Use the JSON attribute inspection results from above to update the database views to include new attributes as new columns, and to remove no longer present attributes
Optionally consult a “dictionary” where we can define fancier behaviors — like data type overrides and explicit rules for handling attributes that don’t yet exist or have been deleted — an exercise that is easy enough but is beyond the scope of this example.

View Automation Via Stored Procedure

We now need code to implement the JSON inspection algorithm described above. Since our source data is in Snowflake, using a Stored Procedure makes a lot of sense. You will likely want to write your own stored procedure to implement your own customized view automation logic, but for testing purposes, or as a starting point for your own code, you can’t do better than to start with the excellent work from Snowflake’s Craig Warman. https://www.snowflake.com/blog/authors/craig-warman/. You can find his blog article, and his stored procedure, here: https://www.snowflake.com/blog/automating-snowflakes-semi-structured-json-data-handling-part-2/. My version is highly customized, and includes the dictionary feature mentioned, but for the rest of this example we’ll use Craig’s code so that you can follow along in detail.

To create a new view CALL_VIEW_ALL_ATTRIBUTES from our example staging table CALL_STAGING, we can run Craig’s stored procedure like this:

call create_view_over_json(‘mydb.myschema.call_staging’, ‘call_json’, ‘mydb.myschema.call_view_all_attributes’, ‘match col case’, ‘match datatypes’);

The resulting new database view has one row per “call” as before, but instead of just one column it has one column for every JSON attribute — a wide table. And now a reminder that wide tables are not really a bad thing in a columnar database. We could also create additional, more targeted views using the same script. For example, to create a view that exposes only the Categories in our call JSON, we could run:

call create_view_over_json(‘mydb.myschema.call_staging’, ‘call_json:categories’, ‘mydb.myschema.call_view_categories_only’, ‘match col case’, ‘match datatypes’);

The difference in the above version is that instead of referencing the entire column containing all of our JSON, we use the Snowflake JSON dot notation to reference just the Metrics object of the JSON (call_json:categories).

The resulting SQL view definition — generated and run by the stored procedure — might look something like the below, with one column per possible Category:

CREATE OR REPLACE VIEW mydb.myschema.call_view_categories_only AS

SELECT

call_json:categories:”AskProbingQuestions”::FLOAT as “AskProbingQuestions”,

call_json:categories:”ValueProposition”::FLOAT as “ValueProposition”,

call_json:categories:”GreetCaller”::FLOAT as “GreetCaller”,

call_json:categories:”TakeOwnership”::FLOAT as “TakeOwnership”,

call_json:categories:”Cancellation”::FLOAT as “Cancellation”,

call_json:categories:”Complaints”::FLOAT as “Complaints”,

call_json:categories:”EmpathyOnDissatisfaction”::FLOAT as “EmpathyOnDissatisfaction”,

call_json:categories:”RecordingDisclosure”::FLOAT as “RecordingDisclosure”

FROM mydb.myschema.call_staging

Note that the “0” and “1” values from our JSON, which we know are integers, are interpreted as floating point values by Craig’s script. For production we modify this behavior by enhancing the script to match our needs, or by using the Dictionary Table hinted at earlier.

So now we have a relatively simple and performant process for keeping our Snowflake SQL views up to data with the JSON data structure as it changes. This allows us to use a “wide table” style of schema without having to constantly manage physical columns, and lets us maintain “1 row per call” query performance even with a thousand or more attributes per call.

Next Chapter: Chapter Six — Looker BI