I recently came across this post by Simona Loffredo in which she describes a certain type of data structure, why it wasn’t ideal for analysis, and how to reshape it using Alteryx. While I love Alteryx and really love how it can be used alongside Tableau, I get especially excited when I can do something without ever leaving Tableau. It keeps me in the flow. So, here is the approach I took to reshape the data using only Tableau (and if you want to follow along, you can download the workbook data)
Materialized Path Structures
When I saw the data structure Simona had encountered, I realized I had worked with very similar data structures before.
When hierarchies are stored in this way, it is called a materialized path. In this structure, the hierarchical path is stored in a single column that gives you the entire lineage. For example, A1 is Food and A1x will be a subcategory of food and A1xx will be a deeper level of description.
This type of structure is not uncommon in many types of industries. In fact industry codes and job types are often stored this way. A lot of government data and medical data (e.g. diagnoses) follows a similar pattern. Sometimes the path uses a separator, such as a dot or pipe (e.g. A.1.1.2 or A|2|1|2). This structure does have some benefit in that all children of an element can be selected very easily using LIKE condition in a WHERE clause of a SQL query.
In her post, Simona does a great job of explaining the problems of this structure when it comes to analysis. The major problem is that values are repeated. For example, A13: Refrigerated Food has a value of 250 which is the sum of the values of its children:
And, in turn, the 250 is counted as part of the 500 value for A1: Food. All of these values are for different levels of information. In Tableau (or any other analytics tool) you’d have an issue with separating the values which should be aggregated and the values which are already aggregates.
Another issue is that it’s very difficult to get the groupings of lower level items. You have to traverse rows of the table to get the description of a higher level, which can be a challenge!
All of this leads to the conclusion that a different structure is ideal for analysis in Tableau.
The Ideal Data Structure for Analysis in Tableau
Here is a structure that would be easy to use in Tableau:
We’ve retained the lowest level items (in most cases, that’s the third level, but it’s a ragged hierarchy and A11: Fruits and Vegetables has no children even though it’s at the second level). We’ve also included columns for the names of higher level groupings. Now the values in the data are fully aggregable – they can all be summed at any level without any concern that we’re counting some more than once.
So how can we get to this shape? Simona’s post examined the use of a 3rd party tool (and, in fact, she never said herend goal was analysis in Tableau). But if you are using Tableau 10.2 you have a way to reshape the data right in Tableau!
Using Tableau 10.2 to reshape the data
I’m going to use a series of self-joins to link each record with the levels above it. It’s going to take one join for each level, so if you have a dozen levels, it’ll get a bit tedious, but not impossible. There might be a point at which this solution becomes untenable at high volumes of data, but in all the cases where I have worked with this structure, this would not have been the case.
I’m going to simply start with the table itself. Ultimately, I will want this to be the lowest level records.
Next, I’ll join it to itself to get the top level (which I’ll call Level 1). I’ll use a left join and the new Tableau 10.2 feature Join Calculation.
The join calculation itself is LEFT([Code], 2) which will give me the first two characters of the code. This will be the top level (e.g. A1 or A2). And you can see the results of the join:
I’ll clean it up to hide the Value (Hierarchy 1) column (I could keep the code column if desired) and rename the description to Level 1. Then I’ll repeat this for Level 2. If I had more levels, I’d do it for each. In each case, I’ll add one more character to the join calculation. For Level 2 it will be LEFT([Code], 3). I won’t need to do it for a Level 3, because that is the base table.
Finally, I’ll have a table that looks like this:
The last step is to remove all records except for the lowest level. If the lowest level was consistent, I could simply check the length of the Code field and keep the longest. But in this case it’s ragged (with a Level 2 Fruit and Vegetables having no children).
I can check to see if something is the lowest with one more join. In this case, we’ll use a left join with Code on the left matched to a Join Calculation on the right. Specifically, that calculation is:
LEFT([Code (Hierarchy3)], LEN([Code (Hierarchy3)]) – 1)
What this does is match a code on the left with its children on the right. But since it’s a left join, even those that don’t find matches are kept and the NULL values on the right indicate that the record is the lowest level and those are the records what we want to keep!
We can hide all but one of the columns from the final join and rename that column something like Keep Null Filter, which is exactly how we’ll use it: as a filter to keep only the records that didn’t find any lower level matches. In fact, if we make it a data source filter, then you’ll have a nice data set that is exactly the set you want:
…leaves you with this:
With a bit of cleanup and organizing, you have a very usable data connection in Tableau that is ready for drag and drop analysis – aggregating nicely at any level you want!