Tableau Prep Builder 2019.3 is currently in beta and I’m loving it! (along with all the other goodies: Tableau Desktop, Tableau Server, Tableau Catalog, Project McKinley). One of the coolest new features is the ability to use Python in Tableau Prep, giving you the ability to transform your data with Python Scripts. (and for those of you who love R, you can use it too!)
My goal in this post, is not to describe how to install Python, or TabPy (the Tableau Python engine) or how to use PIP to install the Pandas module — all of that documentation is available when you join the beta program and I’m sure will be available when the production version lands. Instead, I wanted to share some of the scripts I’ve created to fill in some gaps that Tableau Prep currently has (after all, it’s less than 1.5 years old). I’m sure some of the capabilities I’ve created with Python scripts will be built into Tableau Prep and available via a single click or two. But for now, Python gives me a way to overcome some of the more complex challenges I have faced with my data…
(what needs to be said first)
I’m not a Python expert. In fact, I’m just a beginner and learner. When I started testing the beta, I didn’t know what pandas was or how a dataframe worked and had only written one or two very simple scripts in Python, ever. In fact, most of the scripts I’m about to share are very simple. To me, that’s good news, because if I can accomplish some really useful things with simple scripts, then others (and me, as I grow and learn) will be able to do amazing things with more complex scripts.
Having said that, you may see some things I’ve done wrong or that could be done better. Please share any thoughts you have in the comments!
How Python in Tableau Prep Works
Starting with Tableau 2019.3, you’ll be able to add a Script step from any step in the flow:
Selecting the Script step, allows you to configure the step, connect to TabPy or Rserve (if you want to use R scripts instead) and then browse to a script file you have created containing the function you want to call (which you specify under Function Name)
The function you call takes in a dataframe as an argument, which is an object containing the table of rows and columns that come into the Script step and returns a dataframe containing the rows and columns that are returned.
And with that, let’s turn our attention to some examples of useful scripts:
Adding Row Number / Index to the Data
The ability to assign a unique, sequential value to rows in the data set has innumerable uses and sometimes a row number encodes data about the record that you need.
In fact, being able to number rows or add an index would be so useful, I’ve requested (and you can vote for) the idea of having it be a built-in part of Tableau Prep. But in the meantime, Python gives us a way to approach the problem.
The script is very simple:
What it does is to apply an index to the column Index. That of course requires such a column in the data – and it must be an integer. I can simply add the field as a calculation in a previous step, using code like 0 or INT(NULL) to give me a numeric field. It doesn’t matter what the value is, I’ll overwrite it with the Python script.
Note: in every case I tested (which wasn’t many), the index matched the order of rows from the file. But I’m very much aware that most databases don’t guarantee an order and I’m not 100% sure files read into Tableau Prep data flows come with any guarantee either. You should be able to control grouping and sorting using the applicable methods for the dataframe object.
I’ve previously written about Ranking in Tableau Prep with a workaround that is usable in some cases – but is somewhat limited. The previously mentioned idea would allow for it too. But with Python, you have a rank() method built right into the dataframe object.
Here’s the script:
As I’ve defined the function, it requires the data set to have a Rank field and a ToRank field – both integers. So in a previous Tableau Prep step I’ll create a Rank field with a hardcoded value and create a calculation of ToRank duplicates a field of values I want to use for the ranking.
In this case I decided to use the min method which treats all ties as the lowest rank
You could potentially create a TopN filter with an index or rank of a unique field, but as Python gives you a method – nlargest() – made specifically to return the largest rows, why not use it?
The function I created looks like:
The first argument is the numeric value for N. But I wanted to dynamically adjust how many rows to return in the Tableau Prep flow. So I defined a field where the max value determines the number of rows. If I define the TopN field as a calculation with the value 5, for example, I’ll get the top 5 rows.
The other field is TopN Value which references the values to compare. In this case, I simply duplicated the field I wanted to use and renamed it to TopN Value.
More to come!
This post is just the first. I’ve got a few more scripts to share featuring Python in Tableau Prep. They get a bit more complex, but enable all kinds of useful possibilities – converting text to Title Case, determining the number of business days or holidays between dates, filling in sparse data…. There’s no end to what can be done!