
Tableau Prep Builder 2019.3 is currently in beta and I’m loving it! (along with all the other goodies: Tableau Desktop, Tableau Server, Tableau Catalog, Project McKinley). One of the coolest new features is the ability to use Python in Tableau Prep, giving you the ability to transform your data with Python Scripts. (and for those of you who love R, you can use it too!)
My goal in this post, is not to describe how to install Python, or TabPy (the Tableau Python engine) or how to use PIP to install the Pandas module — all of that documentation is available when you join the beta program and I’m sure will be available when the production version lands. Instead, I wanted to share some of the scripts I’ve created to fill in some gaps that Tableau Prep currently has (after all, it’s less than 1.5 years old). I’m sure some of the capabilities I’ve created with Python scripts will be built into Tableau Prep and available via a single click or two. But for now, Python gives me a way to overcome some of the more complex challenges I have faced with my data…

Prolegomena
(what needs to be said first)
I’m not a Python expert. In fact, I’m just a beginner and learner. When I started testing the beta, I didn’t know what pandas was or how a dataframe worked and had only written one or two very simple scripts in Python, ever. In fact, most of the scripts I’m about to share are very simple. To me, that’s good news, because if I can accomplish some really useful things with simple scripts, then others (and me, as I grow and learn) will be able to do amazing things with more complex scripts.
Having said that, you may see some things I’ve done wrong or that could be done better. Please share any thoughts you have in the comments!

How Python in Tableau Prep Works
Starting with Tableau 2019.3, you’ll be able to add a Script step from any step in the flow:

Selecting the Script step, allows you to configure the step, connect to TabPy or Rserve (if you want to use R scripts instead) and then browse to a script file you have created containing the function you want to call (which you specify under Function Name)

The function you call takes in a dataframe as an argument, which is an object containing the table of rows and columns that come into the Script step and returns a dataframe containing the rows and columns that are returned.

And with that, let’s turn our attention to some examples of useful scripts:
Adding Row Number / Index to the Data
The ability to assign a unique, sequential value to rows in the data set has innumerable uses and sometimes a row number encodes data about the record that you need.
In fact, being able to number rows or add an index would be so useful, I’ve requested (and you can vote for) the idea of having it be a built-in part of Tableau Prep. But in the meantime, Python gives us a way to approach the problem.

The script is very simple:

What it does is to apply an index to the column Index. That of course requires such a column in the data – and it must be an integer. I can simply add the field as a calculation in a previous step, using code like 0 or INT(NULL) to give me a numeric field. It doesn’t matter what the value is, I’ll overwrite it with the Python script.

Note: in every case I tested (which wasn’t many), the index matched the order of rows from the file. But I’m very much aware that most databases don’t guarantee an order and I’m not 100% sure files read into Tableau Prep data flows come with any guarantee either. You should be able to control grouping and sorting using the applicable methods for the dataframe object.
Rank
I’ve previously written about Ranking in Tableau Prep with a workaround that is usable in some cases – but is somewhat limited. The previously mentioned idea would allow for it too. But with Python, you have a rank() method built right into the dataframe object.
Here’s the script:

As I’ve defined the function, it requires the data set to have a Rank field and a ToRank field – both integers. So in a previous Tableau Prep step I’ll create a Rank field with a hardcoded value and create a calculation of ToRank duplicates a field of values I want to use for the ranking.

In this case I decided to use the min method which treats all ties as the lowest rank

TopN Filter
You could potentially create a TopN filter with an index or rank of a unique field, but as Python gives you a method – nlargest() – made specifically to return the largest rows, why not use it?
The function I created looks like:

The first argument is the numeric value for N. But I wanted to dynamically adjust how many rows to return in the Tableau Prep flow. So I defined a field where the max value determines the number of rows. If I define the TopN field as a calculation with the value 5, for example, I’ll get the top 5 rows.
The other field is TopN Value which references the values to compare. In this case, I simply duplicated the field I wanted to use and renamed it to TopN Value.

More to come!
This post is just the first. I’ve got a few more scripts to share featuring Python in Tableau Prep. They get a bit more complex, but enable all kinds of useful possibilities – converting text to Title Case, determining the number of business days or holidays between dates, filling in sparse data…. There’s no end to what can be done!
This looks interesting. What would be interesting is to see how statistical models can be built using, say, TensorFlow in these scripts and have Tableau read the output.
Will keep an eye out for future posts :).
Hi Nick,
That’s a bit beyond my capabilities at present, but I too am excited about the possibilities for statistical and predictive models!
@Nick,
– You can already use TensorFlow (or Keras) with TabPy in Tableau.
– For numerical data, you get better results with statistical-based Machine Learning (e.g. Scikit-Learn) rather than Deep Leaning (e.g. TensorFlow).
– On Kaggle (Machine Learning competition) competitions for numerical data, you won’t see Deep Learning winning, but rather tree-based algorithms.
– When you need to pickle data (serialize data in Python), you get into horrible problems.
Summary: if you want to do Machine Learning with Tableau, focus on Scikit-Learn.
Hi, Joshua. What a wonderful blog post !! Thank you for sharing. Is it possible to introduce this blog to Japanese customer (with translation) ? I think I can add simple machine learning scenario. Like K-means clustering.
The other ML predict model like decision trees or Neural Net is going to be same like this..
# Define Functions
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, max_iter=30, init=”random”, n_jobs=-1)
def Clustering(df):
cluster = kmeans.fit_predict(df.values[:,0:4])
df[‘cluster’] = cluster
return df
Hi Tomohiro,
Thank you for the comment! yes, of course you may share this and definitely add your ideas!
Thank you Joshua ! I posted article to my blog ” Start Data Science From Tableau ” http://lovedata.main.jp/2019/07/13/tableau-prep-builder-tabpy/ We have Google Translate button (Orange One) on the left bottom . I hope Japanese customer will jump to your site from mine. – If you’d like to add your bio or other links let me know!!
Hey Tomohiro, feel free to link to this little introduction video I made on the use of scripts as well: https://www.youtube.com/watch?v=RWkipWk5bpM&t=2s
Kr,
Tim
Hi Joshua,
As always, fantastic post. Thank you for sharing your knowledge!
I strongly believe that by adding Python in Tableau Prep, Tableau is going to increase the value of Tableau developers.
There are several reasons why Python is so popular, one of them is Panda (kind of Excel for Python). It’s very powerful, but really hard to master (see http://www.data camp.com for some free courses on Panda). I suspect we only need Panda’s for some basic transformation, which is not that hard.
PS: for those who want to learn Pandas, don’t make the mistake of trying to study Python too long (we’re not software engineers). You need only basic Python to learn Pandas (quote from Wes McKinney’s book on Pandas).