A Hybrid Analytic Schema Using Snowflake DB and Looker — Chapter 3

Star schema goodness for basic reporting

6 min readMar 16, 2021

This is Chapter 3 of the series. Chapter 1 introduced our topic. In Chapter 2 we described the Contact Center Analytics use case and the data available for each call. Now we will create our first analytic database schema for calls. This is not the Snowflake Hybrid schema promised in the title — that comes in Chapter Five. This is the more typical schema you’d likely create instead, and will help illustrate how the typical approach fails to evolve and scale.

Star Schema V1 — Basic Call Reporting

Knowing nothing more the the JSON data structure from Chapter 2, a DBA could easily bang out a star schema for call-level reporting, and such a schema could be used to power dozens of the most fundamental Contact Center reports — call volume, call length, hold times, utilization, and so on — all sliced by date, time of day, day of week, agent, manager, contact center, queue, and so on. The schema might look like this:

To elaborate just a bit, imagine the Agent dimension (the DIM_AGENT table) with fields including agent skills and tenure, the agent’s manager, the team they work within, the contact center location, and so on — lots of good slice and dice attributes. The caller dimension might be even larger, including information about their purchase history, lifetime value, demographic and sociographic attributes, and more, potentially thousands of attributes.

We’ve just started the project and we already have some great data. And since we’ve read the books, we already know how we will extend the Star Schema later — by adding facts to the fact table, attributes to the dimension tables, and even adding new dimensions if we need to.

And the good news keeps rolling in: This is classic star schema design, good to Billions of records with very little skill required. There are no performance problems in sight so far.

V1 Schema Query Performance Results

Database: Snowflake “XL” cluster
Data: 100 million fact rows (100 million calls), dimensions are <10K rows except DIM_CALLER with ~10 million rows
Clustering: by date and time
Analytic query: ~2 seconds

We are getting very good performance so far, but our schema does not yet include ALL of the attributes from our JSON example call, so now let’s extend the schema with additional data from our call JSON, starting with Scorecards.

Star Schema V2 — Scorecards

Scorecards are used in the Contact Center to evaluate Agent performance, track the efficacy of call scripts, and more. They are also used outside of the Contact Center for compliance monitoring, lead scoring, and to group and tag calls for Marketing analytics. They are in fact a critical part of any contact center analytics project. In a generalized Scorecard model, each call may be “evaluated” for zero or more scorecards, with each scorecard consisting of a number of questions. Each question, when evaluated, results in an Answer (Boolean yes/no, or less commonly a numeric score) for that question for that call. Looking at exactly one call, this is a relatively simple data set, easily represented in JSON — one data element per question. But not all calls are evaluated for all Scorecards, and the Scorecards and their Questions change frequently over time — resulting in sparse data and a great deal of “schema evolution” in the database as a whole. Thus arises the complexity.

Below is our example call JSON again, showing just the (simplified) Scorecard data for the call, with four Scorecards total, each having one to four questions:

...
  "Scorecards": {
    "Script Adherence": {
      "greeting": 100,
      "identification": 100,
      "discovery": 90,
      "verification": 65,
      "wrap Up": 50
    },
    "Objection Handling": {
      "price objection observed": 1,
      "price objection handled": 0
    },
    "Lead Scoring": {
      "initial lead score": 15
    },
    "Compliance": {
      "required disclosures": 100,
      "affirmative acknowledgement": 100
    }
  },
...

You might be tempted to model Scores as additional facts, with one fact for each possible Question. The ten questions above would map to ten new facts in the fact table. But then your physical schema and your ETL would need to change every time you changed a Scorecard — which the business will do constantly. So instead, you lower the grain of your analysis (no surprise here) and create a Scorecard dimension and a new fact table with one row per question instead of one row per call. Your more granular and therefore more flexible schema now perhaps looks like this:

The new Scorecarding schema looks quite reasonable, doesn’t it? It’s still a nice star schema, but now we have one row per Question instead of one row per Call. We can easily handle sparse data, where not all calls are evaluated for all questions, and we can easily add or remove questions and scorecards over time. But if a call has 100 Scorecard Questions — and it might — we have just increased our row count (and thus our DB resource utilization) by 100x — two orders of magnitude. Luckily, databases these days are fast, so our query times will probably be “just fine” — and we can always use tricks like Materialized Views later to pre-aggregate for common reports. But we have already begun our journey into trading away simplicity and efficiency to gain flexibility. Spoiler alert: It gets worse.

So how is our performance looking now?

V2 Schema Query Performance Results

Database: Snowflake “XL” cluster
Data: 100 million calls, transformed to 10 Billion FACT_SCORE rows (100 rows/call, with one row per Scorecard question)
Clustering: by date and time, and by scorecard question
Analytic Queries: ~45 seconds (up from ~2 seconds with the V1 schema)

Increasing the row count has increased our query times significantly — from two seconds to more nearly a minute — which is worrying. Our core problem is that we now need to process ~100 rows per call instead of just one row. Many queries include only a subset of the Scorecard questions for a given report, but since all of the question scores live in a single table, and since we do not have traditional indexes in our Snowflake DB, we still might need to sift through substantially all of the table to retrieve just the questions that we need. We did alter our clustering strategy to optimize for Scorecard reporting, which helped somewhat, but we just don’t have enough rows in our database to make this partitioning approach efficient, and we end up with too many micro-partitions, and our partitioning doesn’t improve query time as much as we might hope.

And these results don’t illustrate our other big problem: To report on Scorecards along with any other call attributes means a join between two large FACT tables, which in a distributed database usually means performance-destroying data shuffling over the network. To mitigate, we will probably end up duplicating all of the call-level facts from FACT_CALL facts into the new FACT_SCORE table — a small ETL headache and some storage cost, but nothing serious. Disaster avoided.

To summarize, we built a nice star schema and performance was great. Then we extended the schema as requested by the business, going more granular, and query times went up ~20x. Performance nearly cratered even more dramatically due to data shuffling joins, but we mitigated that problem with a bit of data duplication.

Next Chapter

Chapter Four: Our Unified Star Schema