You can’t beat the convenience of online grocery shopping. Especially with four young children, I find it wonderful to pull into a spot, have everyone stay in the car, and collect all the groceries that are ready to go. But with four young children, the cost of groceries is not convenient. And so, as I started collecting the emailed receipts, I saw an opportunity – to turn Tableau Prep loose on the Walmart receipt data and get it in shape for Tableau to visualize and analyze and maybe even find some savings.
As a side-note, in the summer of 2003 I worked as an intern at Walmart Information Systems as part of a team that was developing the first generation of online ordering / grocery pickup for Walmart’s British retailer, ASDA . I wonder if any of the code I wrote still lives on…
Gathering the Data
Every time I pick up an order, Walmart emails me a receipt. It includes every item, substitutions, quantities, prices – a wealth of data. But not an easy format to analyze:
There are lots of options for how to capture data from email. I’ve used IFTTT in the past to capture email data in Google Sheets. I could just copy and paste. But since I’m using Outlook, I’ve got another option. Here’s what I did:
- Create a rule to move all incoming receipt emails to a separate folder.
- Use the Import / Export wizard to export the contents of the folder to a .csv file. Turns out the Body is the only useful field to map. But the body of the email contains all the data I’ll need.
I end up with one large file containing the body of every email receipt. The only problem is that it retains the line breaks from the email:
That’s a problem for now (as of Tableau Prep 2020.1.1) because I don’t have any way to work across rows of data. So as the quantity and price of the Great Value Buttery Whole Wheat Snack Crackers are all on separate lines, means I’ll have no way to associate them in Tableau Prep. (The Tableau Prep developers are working furiously on some solutions, but for now you can lend your voice to the priority of one possible feature here)
In the meantime, I have a work-around. If I can remove all the line breaks, I’ll have all the data on a single row and then I can split it up however I want. Turns out that’s an easy PowerShell script (and I’m sure it’s easy in Python or any other scripting language too!) In PowerShell, it’s one line:
Add-Content 'WalmartOrders.txt' "Data"
(Get-Content 'WalmartOrders.csv') -join ' ' | Add-Content 'WalmartOrders.txt'
Okay, technically that’s two. The first simply adds a header so Tableau Prep will understand that it’s a single column of data. The second line takes all the text in the .csv file and outputs it to a .txt file, sans line breaks. An optional third line (not shown) could actually run the flow to automate everything as much as possible.
With that, I’m ready to tackle the Walmart receipt data with Tableau Prep!
The Tableau Prep Flow that Tackles the Walmart Receipt Data
Here’s the Tableau Prep flow:
There’s complexity, but the overall approach is simple. Take the single line of text. Find the breaks between orders, split the string and pivot so that there is now a single row per order. Then find the breaks between items in the order, split the string and pivot so we end up with a record per item per order. Here are some highlights:
- In the first clean step, I looked for a unique string or character that would separate individual orders. Turns out “We look forward to seeing you again” occurs only once per email and works great for separating the orders. However, Tableau Prep’s split feature only splits on a single character. So I first replace the nice message from Walmart with a single character that doesn’t occur elsewhere (vertical bar | works great in this case) and then the split feature gives me a separate field for each order.
- I then Pivot all those orders to rows. I don’t need the Pivot Names field.
- Once I have individual Orders, I take a moment to extract some Order Level Detail such as Tax, Total, and Promo Code Savings. Why doesn’t Outlook let me export the date of the email? I’ll never know. But fortunately the day and time is in the email body – along with the year of the copyright message.
- Then I split the order string per item. Again, I look for a pattern that divides each order. In this case, I see that every item ends with a $#.## or $#.##/LB. But the amounts can vary. So, when I replace, I use a REGEX_REPLACE() function to insert the vertical bar after the final price:
REGEXP_REPLACE([Data Split], "(\$[0-9.\/LB]+[ \t]+\$[0-9.]+)", "$1|") //I suspect I didn't get every pattern in the data because I ended up having to do another split in the flow. I probably could have caught them here and avoided a step or two.
- I pivot each item to a row, again removing the unnecessary Pivot Names field.
- I do the additional split to catch the items I missed splitting the first time and do another pivot.
- Finally, I extract the data I want from each item – name, quantity, price whether it was a substitution, out of stock, or rejected by me. And I pickup a few order level items that I missed earlier, such as the order number.
And now, I can start to dive into the data and find the items that have increased in price over time, or that are great buys during certain seasons, or that are draining our bank account. I wonder if three-year-olds can be convinced with data…