Periodic Snapshots with Tableau Incremental Refresh

2017: The Year of Data

Happy New Year!

Even as 2016 was winding down, I found myself with a challenge dealing with time.  And I thought I’d share some of my thoughts about a possible approach.  Here’s the challenge:

  • You have a database with transactional data, that is data where the rows are inserted, updated and deleted based on events.  One generator goes offline and two others are brought up, the operating status for those records is updated.  Money is allocated to different departments this month, you get updates to rows for the departments.  An employee is promoted, the position value changes in the database.
  • You need to be able to see that state of the data at certain historical points in time and even see trends.  The data as it exists in the transactional system is always current: how many generators are online right now, how much money is allocated right now, how many vice presidents do you have right now.  You cannot see how many generators were online last month or track trends in the money allocation.
  • The database is locked down.  You can read the data, but you cannot create new tables or structures in the database or schedule some kind of automated process in the database.

Periodic Snapshots

If you could build a data warehouse, one possible structure for the data that you would consider would be a periodic snapshot.  That is, you’d capture the current state of the data and you’d do so periodically (hourly, daily, weekly, etc…).  Each snapshot would be stored (as fact table records) with a time-stamp, allowing you to tell when the snapshot was taken.

Periodic Snapshots of Data

It’s kind of like capturing snapshots of your kids: “And look, here’s when little Generator 3 came online for the first time!”

Point Tableau at your fact/dimension tables, and analyzing and visualizing the data is a piece of cake!  And you’ll have a rich history for trends, forecasting, and seeing the changes over time.

…But what if you don’t have the ability to build a data warehouse?

First of all, it’s 2017! Building data warehouses is easier, faster, and cheaper than ever… and there are other tools out there for blending data and building data structures… but I get it… there are times when it isn’t an option (at least in the short-run).

But guess what? It’s 2017!  And Tableau continues to add data prep and data integration features that make life easier.  And in fact, what I’m about to share isn’t even a new feature — it’s been there since version 6.  But when you pair it with some newer features (such as Level of Detail calcs, cross database joins, unions) you start to get the ability to create your own data warehouse right in Tableau! (okay, there are some limitations and even <scary> dangers </scary>, but we’ll gloss over those look at those at the end of the post.

Tableau Incremental Refresh

Incremental refresh is an option when you create a data extract:

Incremental Refresh

 

What the incremental refresh does is use the field you specify and looks for rows of data having values for that field that are greater than the values already existing in the extract.  There is no requirement that these values be unique (for example, you could have 100 rows all having a 1 value today and then a few more new rows with a 2 that would be added on the next incremental refresh — or you could have 1,2…100 and then new rows with 101,102… that would be incrementally added) So, as long as there are rows where the value in the field are greater than what you had previously, those rows would get included in an incremental refresh.

 Be aware that any new rows that have the same or lesser values as those which already exist in the extract will not be added in an incremental refresh.  Also, incremental refresh is only good for adding rows, so it won’t catch anything that was deleted or changed at the source.

Building the Periodic Snapshot in Tableau

But for the purposes of creating a periodic snapshot data structure, it’s okay if rows change or are deleted, because we’re going to capture all available data every period — so we get a snapshot of it’s current state now and then a snapshot of the current state the next period and then the next and so on.  So, we’re building a history as we go.  If records change, are deleted, or we have new ones — no problem we’ll just get what’s available each time.

The main key is to create a field  that we can use as the field that identifies the new rows.  We can’t create a new table, but there’s nothing that prevents us from using a custom SQL statement to create a new field in the SELECT.  Consider a custom SQL statement like:

SELECT 
[Generator], 
[Status],
GETDATE() as [Snapshot_Date] 
FROM Generators

What we’ll get is a snapshot of the data that includes a time-stamp of when the snapshot was taken.  Here a three such snapshots:

Periodic Snapshots - Tables

I’ve simplified the time-stamp to be just the date, but based on the SQL above, it would include the time too.

And if you use that Snapshot_Date field as your incremental refresh field, then each time you run the incremental refresh you’ll get all the rows appended to the extract with the most recent time-stamp (which of course is greater than any values already in the extract).

So, you end up with a data structure looking like this:

Period Snapshots Table

 

Now you’ve got historical data and can do all kinds of fun analysis!  Want to see how many generators were online/offline over time?  Your boss can only see the current state from the transactional database.  But you’ll impress her with the ability to quickly and seamlessly go through the entire history (at least since you’ve been capturing the snapshots)!

 

Tips and Tricks (and Dangers!)

<scary class=’very’> After your initial extract, don’t ever, ever, ever run a full extract again.  If you do, you’ll overwrite all your historical snapshots with the current snapshot of the data.  Read that again and let your blood run cold as you realize that everything you built will collapse like a house of cards with one false step. </scary>  Only run incremental refreshes!

To mitigate that risk, consider some options:

  • Don’t use the incremental refresh to build periodic snapshots in Tableau (if you have other options). I know, I just wrote a whole blog post on how and why. But if you have the option for building a data warehouse or even just exporting a periodic snapshot of the data as a series of flat-files to union together in Tableau, you’ll avoid the risk altogether.  I know those options aren’t always available which is why I detailed the incremental refresh option. And I hit an occasion recently where this was really the only way. So, if you do use this technique, do the following:
  • Backup your extracts periodically.  If you’re scheduling them on Tableau Server, you should be running a backup anyway.  But always keep a backup of your extract for the last few periods in case you need to restore it.
  • Create a view that allows you to see how many snapshots you have.  Subscribe to this view (use conditional alerts in 10.1) so you find out fast if your snapshots get wiped out.

Beyond that, consider these tips:

  • When you write your custom SQL statement, you’ll want to truncate the date to the level of the frequency you are capturing.  Don’t capture minutes and seconds if you’re capturing hourly.  Don’t capture days if you’re capturing monthly.  It makes the SQL a bit more complex, but it’ll save you from some headache if you accidentally run more than one incremental refresh per period.
  • Use filters, Level of Detail expressions (LOD), or carefully determined aggregations to roll up to time periods at a higher level than your periodic snapshots.  For example, if you’re capturing daily but want to report monthly, you won’t be able to fully aggregate measures unless you determine how monthly reporting should work.
  • Want to show the status of things as the 1st of the month?  Then a filter based on a calculated field like: DAY([Snapshot_date]) == 1 should work.
  • Want the last day of data available for each month? {FIXED [Generator], DATETRUNC(‘month’,[Snapshot_Date]) : MAX([Snapshot_Date])} == [Snapshot_Date]

 

Hope that helps!
Joshua

11 Responses so far.

  1. Sameer says:

    Brother please detect the steps to create periodic extract. There are not whole lot of options in tableau for refresh.

    One is full and other is incremental. I don’t see a way to refresh the extract to correct the older data in incremental refresh.

    • Joshua Milligan says:

      Sameer,

      That is correct, the incremental refresh does not update any values that have changed or delete any records that have been removed for older data. Only new data is appended.

      • Mitch says:

        Joshua thanks for this info on incremental refresh. I have a situation where I have a huge extract and only want to append the latest month where the incremental refresh works great.

        Now, my underlying data for the most recent month (Jan), that’s already been extracted had several corrections made to it. I want to refresh the complete months worth of data without having to refresh the entire extract.

        Can I fool Tableau into thinking January data is not in the extract so my incremental refresh will only update January?

        • Joshua Milligan says:

          Mitch,

          That’s a great question. Not really, because even if you “tricked” Tableau (maybe by using another dummy field that you forced to be larger value) it would append the data rather than replace it. Now, you could potentially use LOD expressions to filter the data so that only the most-recent, correct January was used in the data – but that could be complex and with a large data set might have some performance issues too.

  2. M says:

    I have a similar report set up. How do I publish it to Tableau Online with an incremental refresh? I have a scheduled refresh online which keeps running a full refresh and rewriting the previous rows.

  3. Mike says:

    Hi,
    This is intruiging, and potentially exactly what I am looking to do. Problem is, when I am connecting to my data with a Custom SQL Query, Tableau is telling me “function getdate() does not exist;”.
    Same thing when I try TODAY().
    Am I missing something?
    Thanks!

    • Joshua Milligan says:

      Different SQL platforms have different functions. SQL Server uses GETDATE(). MySQL uses NOW(). Oracle SYSDATE. What data source are you using?

  4. Franco Arda says:

    hi Joshua, another outstanding post. Thank you for sharing.

    I fear, that your “Danger” might scare off people who need incremental refresh + full refresh. A potential case is where you have a huge extract of sales data. The data for T-30 (last 30 days) might be edited/deleted. In this case, a daily incremental refresh and a weekly full refresh might make sense.

  5. ravi says:

    Hi Joshua,
    Thank you for this post and the tips you provided to get the latest snapshot for each time period. I have a use case where in I am trying to calculate percent difference of sum of sales between the latest snapshot of current month vs the latest snapshot of previous month based on the user selected snapshot date via a quick filter. Can a similar LOD expression be used to achieve this functionality. I am struggling to wrap my head around calculating the month over month percent difference between snapshots

    • Joshua Milligan says:

      Yes, you could use an LOD to identify the last period in each month and then potentially a table calculation to determine percent difference

  6. kavita arora says:

    Hi Joshua..!
    Need some help..!
    As I have read the blog but I have some confusion.
    Need the current month data on incremental basis but current month data can be update till the end date of the month.
    Also want the previous data i.e., from 2015 to 2019 Oct and this data is static.
    So how to do the same as I dont want to loose the previous data also and also want to do incremental refresh on current month and current month data can be update till end of the month..
    Ex- 1 to 26 days of nov has been gone and on 27th Nov any date of this month data can be update.
    So what is the possible solution in this scenario?

Leave a Reply

Your email address will not be published. Required fields are marked *