The Importance of Granularity (…to Blend or not to Blend?)
The ‘Granularity’ (or ‘Grain’) of the data is the most important thing I consider for any data analysis. When I get a new data-set, my first question What is the Granularity? When designing a new (Tableau) visual model, my first consideration What Granularity do I need? When calculating a measure (or KPI), my first questions: What is the Granularity? and What Granularity do I need? I’d go as far as saying "Understanding the Grain of your data is 99% of the solution"…
What is Granularity?
Granularity refers to the level of detail of the data stored in a table
…or in other words
Tableau Tip: The term ‘Level of Detail’ (which you may have come across in many Tableau Blogs, Articles…etc.). It’s worth noting as well as ‘Data Level of Detail’ (in which we are referring to the above definition), we also have Viz Level of Detail (VizLoD) in Tableau. This is fundamental to understanding ‘How Tableau Thinks’. The VizLoD is the Level of Detail brought into a Tableau Sheet/Viz, and is ‘What does a single ‘Mark’ represent?’ (Colour and Shape also qualify as a ‘Mark’, so add to the VizLoD). In the below image the VizLoD is Category, sub-Category
Using a worked example, I hope to show how the understanding and consideration of Granularity will make your data-life that much easier (and hopefully more fun!), and take this knowledge to answer one of Tableau’s age old questions…
“Should I Join, or should I Blend?”
So we’re going to use everyone’s favourite (fictional!) on-line retailer...the 'Tableau Superstore'. This is a very typical ‘transaction’ table, and below is a sample of the data:
So what is the 'Granularity' of this data? (aka What does one row represent?). Here we need to think about what are 'facts' and what are just 'attributes' of those 'facts'…We have
- [Order ID]
- [Customer Name]
- [Ship Date]
- [Order Date]
Well, [Order ID] is a candidate fact and the other 3 fields - upon inspection - are only attributes of this fact; they do not increase our Grain. For example, [Order ID] CA-2011-112326, will always be attached to the same [Customer Name], [Order Date] and [Ship Date]; so we only need the fact [Order ID] to account for this level.
We can, however, see that for [Order ID] CA-2011-112326 there are multiple rows, one for each of the [Product Name]s in that [Order ID]. As well as [Product Name] we have [Category] and [Sub-Category], but these are just attributes of [Product Name]. For example, [Product Name] ‘GBC Standard Plastic Binding Systems Combs’ will always be [Category] = ‘Office Supplies’ and [Sub-Category] = ‘Binders’… So the Granularity of this data is [Order ID], [Product Name]; these 2 fields define what one row represents.
Tableau Tip: A very useful thing I always do in Tableau, once the ‘Grain’ has been determined, is to rename the ‘Number of Records’ field to represent this…so here I’d rename this field ‘Order_Product_Count’
I also have another data-set, and sample of which is shown below:
I’d like to include this data in my analysis. It would be a good 'opportunities indicator' to see where the company under-trades with respect to population…and also to see in which states it over-trades (…it’s in our nature to look for negatives to fix, but equally valuable is to look for the positives to repeat, if only we can answer: Why are doing better here?', we can then do more of it and replicate it elsewhere!).
Should we Join, or should we use a Data-Blend?
First let’s determine the Granularity of our population data-set. Hopefully an easy one here as I only have 'facts'…so the 'Grain' of this table is [State], [Year] (as each row represents the population for each state for a particular year).
So what would happen if we were to 'join' this table to our transaction table, using Tableau's 'Connect-to-data UI' (…I also added [Year] to the Superstore data, so I could join on this field, and also 'hidden' the [Year] and [State] fields in the Population table):
So this will bring population count, by state against every row in my data. Lets start looking at our 'Sales per Population Index' calculation here. Below I've brought in the fields we'll need:
Now I can't use SUM(Population); this will create the wrong value; as we have this figure against every row, the SUM would be the population for each state multiplied by the number of rows (remember each row is an Order_Product combination) for each state. I've managed to get round this by using MIN (equally MAX, AVG would have done the same), and I can create my 'Sales Per Population Index' as
(I've multiplied by 1000, as the number is so small, but we are only using this as an index...the actual value is not important just how is varies relative to the other states). However this calculation fails for my Grand Total...and you can see why. Grand Totals in Tableau are calculated at the 'Level above the Viz' so it's just taking the MIN of population of all the states. Now say I actually want to see if there is a 'Regional' pattern, before looking at the state level. So I remove 'State' from the view
Oh dear...this is getting worse! Again Tableau is taking the minimum population state for each region, and using that to calculate our index. We could get round this by bringing state into the detail pane, using
running this by state, then using the
trick to only return a single mark.
This is a level of complexity I don't want to get into, and will have to do this every time I want to use this measure (without State in my Viz); it's going to be hard work, inefficient...and generally not fun!
OK, so what if we use a Data Blend?
Below is the same table (at regional level) with population brought in as a SUM, and setting the Blend-fields to [State] and [Year] (we can see this by little orange link symbol). If we had [State] and [Year] in the Viz (the VizLoD), Tableau would have automatically picked up these as blend fields, but as we don't have [State] we just need to click the 'blend-symbol' next to state:
The calculation for our 'Sales per Population Index' is now
SUM([Sales])/SUM([Population-Tableau (Population by State)].[Population])*1000
...and if by magic everything, at every level is working perfectly!. Lets look at why...
What is Tableau doing when we Blend?
When we blend in Tableau, it creates an aggregated version of the measure we are bringing in (from the secondary data source), with the 'Level of Aggregation' set at the blend-field(s) level, and then joining this back to the primary data source (joined on those fields). This is also the reason you can only bring in blended measures and dimensions as Aggregates, and not row level. This makes blending, as a way of joining 'Master Data' attributes a bad idea.
To explain this further look at what happens when I deselect the [State] as a blend-field
So now it has no concept of [State], and the Aggregation is only blended on [Year], so we no longer have any difference between Regions, and the population is only different for each year.
Tableau Tip: Although we don't want to do this here, this is a useful trick (especially when using a self-blend - that is creating a duplicate of your data source and using this to blend). At Atheon we use this trick in one of our Retail Analytics models; the model lets the user select a series of products to see how they perform, but we also wanted to show the 'Category' performance at the same time...Although we could now do this with a LoD calculation, we built the model pre-LoD-Calculations and used a self-blend, and only select 'Category' as the blend field.
So when should we Blend, and when should we Join?
One use-case is when the data resides in different ‘schema’. A schema, as defined in Tableau, is a single Excel workbook (where each tab forms a single schema), or a single text file (if they are in a single folder), or for a database a ‘schema’ is…well a schema! I see this as a fringe case, as it’s nearly always possible (by hook, or by crook!) to get data into a single schema (even if the tables need to be materialized into text files and those files placed in the same file location), and do a proper join (if appropriate).
The main use case, IMHO, for data blending, and taking full advantage of Tableau’s ‘aggregate-and-join’ functionality, is when the 2 different data sources have different levels of granularity. I love this about Tableau's data-blends, as it gives a different option to joining when the data is at different 'Grains'. I read many comments on the community regarding making blending a 'real' join...this is what Joins are for, leave blending for what blending is for!
I hope this helps remove the 'frustration' of when Blends don't act like 'real' joins, and you too will come to love the Blend as much as I do!
You can find, and download, the example workbook I've been using here (as well as an Opportunities Dashboard)