Python in Tableau Prep: simple useful scripts

Python in Tableau Prep

Tableau Prep Builder 2019.3 is currently in beta and I’m loving it! (along with all the other goodies: Tableau Desktop, Tableau Server, Tableau Catalog, Project McKinley). One of the coolest new features is the ability to use Python in Tableau Prep, giving you the ability to transform your data with Python Scripts. (and for those of you who love R, you can use it too!)

My goal in this post, is not to describe how to install Python, or TabPy (the Tableau Python engine) or how to use PIP to install the Pandas module — all of that documentation is available when you join the beta program and I’m sure will be available when the production version lands. Instead, I wanted to share some of the scripts I’ve created to fill in some gaps that Tableau Prep currently has (after all, it’s less than 1.5 years old). I’m sure some of the capabilities I’ve created with Python scripts will be built into Tableau Prep and available via a single click or two. But for now, Python gives me a way to overcome some of the more complex challenges I have faced with my data…

Here we see the Python in its natural environment. It approaches the data, ready to shape it according to its will…

Prolegomena
(what needs to be said first)

I’m not a Python expert. In fact, I’m just a beginner and learner. When I started testing the beta, I didn’t know what pandas was or how a dataframe worked and had only written one or two very simple scripts in Python, ever. In fact, most of the scripts I’m about to share are very simple. To me, that’s good news, because if I can accomplish some really useful things with simple scripts, then others (and me, as I grow and learn) will be able to do amazing things with more complex scripts.

Having said that, you may see some things I’ve done wrong or that could be done better. Please share any thoughts you have in the comments!

My first encounter with the Python was a bit rough. I knew it was powerful, but I had no clue how to approach it.

How Python in Tableau Prep Works

Starting with Tableau 2019.3, you’ll be able to add a Script step from any step in the flow:

Adding a Script Step in Tableau Prep

Selecting the Script step, allows you to configure the step, connect to TabPy or Rserve (if you want to use R scripts instead) and then browse to a script file you have created containing the function you want to call (which you specify under Function Name)

Configuration options for Python in Tableau Prep
You can also define a getOutputSchema function if the output will contain different columns / data types than the input. None of the scripts in this post use this option.

The function you call takes in a dataframe as an argument, which is an object containing the table of rows and columns that come into the Script step and returns a dataframe containing the rows and columns that are returned.

An example Python function called from Tableau Prep

And with that, let’s turn our attention to some examples of useful scripts:

Adding Row Number / Index to the Data

The ability to assign a unique, sequential value to rows in the data set has innumerable uses and sometimes a row number encodes data about the record that you need.

In fact, being able to number rows or add an index would be so useful, I’ve requested (and you can vote for) the idea of having it be a built-in part of Tableau Prep. But in the meantime, Python gives us a way to approach the problem.

Python Indexing
Python Indexing

The script is very simple:

SetIndex() function for Python in Tableau Prep

What it does is to apply an index to the column Index. That of course requires such a column in the data – and it must be an integer. I can simply add the field as a calculation in a previous step, using code like 0 or INT(NULL) to give me a numeric field. It doesn’t matter what the value is, I’ll overwrite it with the Python script.

SetIndex() function for Python in Tableau Prep

Note: in every case I tested (which wasn’t many), the index matched the order of rows from the file. But I’m very much aware that most databases don’t guarantee an order and I’m not 100% sure files read into Tableau Prep data flows come with any guarantee either. You should be able to control grouping and sorting using the applicable methods for the dataframe object.

Rank

I’ve previously written about Ranking in Tableau Prep with a workaround that is usable in some cases – but is somewhat limited. The previously mentioned idea would allow for it too. But with Python, you have a rank() method built right into the dataframe object.

Here’s the script:

Rank() function for Python in Tableau Prep

As I’ve defined the function, it requires the data set to have a Rank field and a ToRank field – both integers. So in a previous Tableau Prep step I’ll create a Rank field with a hardcoded value and create a calculation of ToRank duplicates a field of values I want to use for the ranking.

In this case I decided to use the min method which treats all ties as the lowest rank

Python in Tableau Prep can rank data.  Pythons in the wild rank by who can swallow the others
In the wild, Pythons determine rank by which can swallow which

TopN Filter

You could potentially create a TopN filter with an index or rank of a unique field, but as Python gives you a method – nlargest() – made specifically to return the largest rows, why not use it?

The function I created looks like:

The first argument is the numeric value for N. But I wanted to dynamically adjust how many rows to return in the Tableau Prep flow. So I defined a field where the max value determines the number of rows. If I define the TopN field as a calculation with the value 5, for example, I’ll get the top 5 rows.

The other field is TopN Value which references the values to compare. In this case, I simply duplicated the field I wanted to use and renamed it to TopN Value.

More to come!

This post is just the first. I’ve got a few more scripts to share featuring Python in Tableau Prep. They get a bit more complex, but enable all kinds of useful possibilities – converting text to Title Case, determining the number of business days or holidays between dates, filling in sparse data…. There’s no end to what can be done!

5 Responses so far.

  1. Nick Gardner says:

    This looks interesting. What would be interesting is to see how statistical models can be built using, say, TensorFlow in these scripts and have Tableau read the output.

    Will keep an eye out for future posts :).

    • Joshua Milligan says:

      Hi Nick,

      That’s a bit beyond my capabilities at present, but I too am excited about the possibilities for statistical and predictive models!

  2. Hi, Joshua. What a wonderful blog post !! Thank you for sharing. Is it possible to introduce this blog to Japanese customer (with translation) ? I think I can add simple machine learning scenario. Like K-means clustering.
    The other ML predict model like decision trees or Neural Net is going to be same like this..

    # Define Functions
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=3, max_iter=30, init=”random”, n_jobs=-1)

    def Clustering(df):
    cluster = kmeans.fit_predict(df.values[:,0:4])
    df[‘cluster’] = cluster
    return df

Leave a Reply

Your email address will not be published. Required fields are marked *