• How to handle new input files in Tableau Prep

    How to handle new input files in Tableau Prep

    New Periodic Files in Tableau Prep

    Let’s say you’ve got a situation where you get new input files for Tableau Prep every month (or, maybe in your case it’s every week or every day or every so often). These aren’t files to be added to your data source in addition to existing files.  They are replacements with a brand new, up-to-date data set that needs to be processed with the same logic and flow every time.  How do you handle this scenario?  Here are some options:

    1. Stick with the same filename month after month.

    That is, literally replace the old file with the new input file of the exact same name (archiving the old ones is probably a good practice).  The first month you have a file named data.txt.  The next month you replace it with the new file also named data.txt.

    • Pro: all you have to do is just open the Tableau Prep, run the flow it and it should work fine.  In my experience, you may have to click the refresh button in Tableau Prep
    • Pro: when scheduling is released in future versions, you won’t have to do anything other than replace the file and let the flow keep running on it’s schedule.
    • Con: it’s a bit harder to keep track of the files and you may find yourself asking, “Did I replace it already?”  “Is that last month’s file or this month’s?”

    2. Add the new input files to the directory and remove the old files

    (or move the old file to archive directory)

    When you open Tableau Prep or click Refresh, Tableau Prep will detect that the source file is no longer available and prompt you to locate it.  Simply point Tableau Prep to the new file instead.

    • Pro: fairly easy
    • Pro: you know which file you just selected, so no second-guessing
    • Con: a bit more manual, scheduling of flows (when available) won’t automatically pick this up
    • Con: the input node retains it’s name (e.g. “January” even though you just selected “February”) so a bit of confusion might ensue.

    3. Add the new input files to the directory and swap out the input node

    Here, you’ll add the new connection and replace the input node with the new file.

    Swapping Inputs in Tableau Prep

    • Pro: There’s no question in your mind what you’ve done and the input node keeps the name of the file so the flow is showing you exactly which month/file you’ve used.
    • Con: Definitely a bit more involved than other options.
    • Con: Won’t allow any flow schedules to continue without intervention.

    4. Just keep adding new input files to the directory and use features of Tableau Prep to filter to latest

    The idea is that you’ll use a wildcard union to pull data from all the files, past and present. Then you can filter the File Paths field to keep only the newest:

    Filtering to keep latest file in Tableau Prep

     

    •  Pro: You can compare new data to previous months if needed because it’s all there in the flow and you may quickly spot errors or differences that you might miss otherwise
    • Con: Definitely manual, forget setting a schedule and letting it just go
    • Con: higher potential for human error (“oops, I forgot to filter”)
    • Con: potential performance issues depending on size and number of files

    4a. Modify the previous approach to automate selecting the latest file

    Let’s assume your filename or something in the file (a timestamp or date field) allows you to identify the latest file.  Now it becomes a latest snapshot type of situation:

    Automatic Filtering to keep latest file in Tableau Prep

    You use an aggregate step to identify the MAX(File Path) or MAX(Timestamp), then join it back to the previous step to keep records that match that maximum.  Now you have a flow that automatically picks up the latest file each time without ever having to come back to change the flow!

    • Pro: Fully automatic, just drop the file in and run the flow (okay, so you have to click the refresh button to get the wildcard union to pick up the new file- I hope this changes in later versions.  Please upvote Owen Price’s idea here)
    • Pro: Should work without having to modify the flow when scheduling is available.
    • Con: potential performance issues depending on size and number of files

     

    Some final thoughts on swapping new input files in Tableau Prep:

    Personally, I’ve used option 1 and it works really well in cases where it’s just me and I don’t care that it’s not fully automated.  Option 2 is also attractive.  When a scheduling feature is released, I’ll be evaluating cases where option 4 makes sense.

    With any of these approaches the assumption would be that all the fields (columns) in each new file stay the same.  If they change, you could run into the following:

    • Newly Added column – no problem.  it will just show up in the flow
    • Changed name – could break parts of the flow downstream.  You’ll have to fix it (potentially by just changing it’s name back to the original in the input node)
    • Removed column – could break parts of the flow downstream.  You’ll have to decide how to handle this one.  Might need to add a calculated field of same name and type to fill in values for the flow

    The order of the columns shouldn’t matter, unless you have specified that the first row does not contain column names and thus Prep is naming the columns automatically based on position.  In that case, the file would need to maintain the same order of columns month after month.

    What do you think?  Have you used any of these options?  What’s worked and what hasn’t?  Let me know in the comments!

    -Joshua

Leave a Reply

Your email address will not be published. Required fields are marked *