• How to remove duplicate records in Tableau Prep

    One of the questions that comes up quite often is “How do I remove duplicate records in Tableau Prep?” or “How do I dedup data in Tableau Prep?” Some data preparation tools have a specific feature to do this. At first glance, the first version of Tableau Prep doesn’t seem to have this feature. BUT it is definitely possible to remove duplicate records in Tableau Prep. And not only that, but it’s very EASY to remove duplicates. (And if you make it to the end, you’ll get a a bonus: An LOD calculation in Tableau Prep!)

    Let’s talk about three possible cases of removing duplicate records in Tableau Prep (each gets a bit harder):

    1: Exact Duplicate Records in Tableau Prep

    Let’s say you’ve got a data set that looks like this:

    You can see the Employee ID 3 has 2 exactly identical duplicate records.  Everything is the same: the ID, the Name, the Date.  It’s a true duplicate.  No matter how many duplicate records you had, you could do the following in Tableau Prep:

    All you have to do is add an aggregate and add all fields to the Grouped Fields section.  Nothing will be aggregated.  Only unique rows are retained while duplicates will vanish before your eyes.  You can continue the Tableau Prep flow with a nicely deduped data set.  (and thank you to Tom Fuller for pointing out this approach!)

     

    2: Similar, but not exactly, Duplicate Records

    Let’s modify the data set slightly and consider how we might eliminate the duplicate records in Tableau Prep:

    In this case, Walvoord was hired once in 1997 and then subsequently re-hired in 2014 (this data set doesn’t indicate the reason – did he have an intermediate job or was it hire to a new position?)  Whatever the cause for this data, let’s say we only care to know the most recent hire date.

    You’ve probably already jumped to a possible solution for removing this duplicate data.  It might look like this:

    It’s very similar to the previous solution, but here, we’re grouping only by Employee ID and Name while the Date Hired field gets aggregated as a MAX.  That means for every unique set of employees, we’ll get the max hire date.  We remove the unwanted near-duplicate record and end up with this output:

     

    3: Extended Near Duplicate Records

    I’ve already covered how to get the latest snapshot of records elsewhere, but it fits nicely here because it is a case of duplicate data that needs to be deduped.  Let’s extend the previous input data set so that not only is there a most recent hire date, but we can’t simply group and aggregate because there is additional data in the record that we need to retain.

    Now we know why Walvoord has two hire dates – it represents two different positions.  So, we need to be able to keep the record with the latest hire date but we also need to keep the other values for that record intact.  We can’t use the previous approach to GROUP by Position because we’d get two records.  And we can’t AGGREGATE by Position either because we’ll end up with an arbitrary value (should it be MIN or MAX – one of those might work in this specific case, but other combinations of Positions will have different results whatever we chose).

    Instead, we’ll extend our solution just as the data set has been extended.   The first step is the same as Case #2.  We’ll group by Employee ID and Name and find the Maximum hire date.  The output shown above is our result.  But we need to take that information and use it to match to the full data set to identify the entire records with the latest date.  It’ll look something like this:

    The Max Date step is simply a renaming of the Dedup step in case #2.  But then, we’ve joined it to the previous step in the flow!  Yes, you can join to previous steps in the flow – simply by dragging and dropping!

    And the join, as you can see above, is done on the Employee ID as well as the Date Hired – which is all dates in the original data set matched to the latest (MAX) dates from the aggregation.  And you can even see the red color of the earlier date for Employee 3 that we wanted to exclude.

     

    Bonus: LOD Calculation in Tableau Prep

    Guess what? The solution to Case #3 is an LOD calculation in Tableau Prep!  If you were wondering if you could write a FIXED LOD – you can! It’s just visual!

    Imagine you were writing a calculation to find the latest record in Tableau Desktop.  You’d write:

    {FIXED [Employee ID] : MAX([Date Hired]}

    which would give you the latest date per employee.  you could extend the calculation to match it to the date of the record and get a boolean to determine whether it was the latest or not:

    [Date Hired] = {FIXED [Employee ID] : MAX([Date Hired]}

    When the Hired Date matched the LOD result, you have the most recent record for that employee.  Filter to keep only true values and you’ve deduped your data set.

    “But wait!” you say.  “Tableau Prep doesn’t support LOD syntax (at least in version 1).”

    Ah, but we just matched the logic of the calculation with our flow in Tableau Prep.  Check it out:

    And there you go! You can now remove duplicate rows in Tableau Prep and even write LOD expressions using a Tableau Prep data flow!

3 Responsesso far.

  1. Paul O says:

    On the LOD issue of using LOD in Prep, i suspect that it should be the other way round – bringing the visualization of Prep into Vizualizer.

  2. YupengWu says:

    good article ,and thanks very much .
    From a China reader

Leave a Reply

Your email address will not be published. Required fields are marked *