This article details the process of building two BigQuery tables for path analysis, with a focus on creating Looker Studio reports that visualize user journeys through page views and events. It accompanies a GitHub repository featuring code for automating these transformations using Google Dataform.
I’m often frustrated by GA4’s limited ability to visualize user journeys through a website, and I know many people feel the same. Understanding how users navigate from blogs to product pages, which paths lead to conversions, or where they go after viewing service pages should be straightforward. Clear insights from these paths could offer significant value.
GA4 does have the Path Exploration report, but it is a bit clunky, and the data is not exportable. So when a client recently asked me about path exploration reporting, I thought to myself, “I’ve got all the data I need in BigQuery, how hard can it be?”
It turns out, it is pretty hard. I found it difficult for various reasons, but the main one is what I dubbed “the angels are in the details” principle.
Usually, when we analyze data, we are most interested in aggregated metrics that pertain to a dimension. We tend to abstract away from individual users and look at the top traffic sources, landing pages, keywords, etcetera. But with path behavior, the most common paths are boring and teach us little. I can pretty much guarantee that the number one path on your website looks like this:
/
In other words, it’s a single-page visit to your home page. As I dug into path behavior, I realized that I could learn little from paths like that, but a lot from exploring detailed path behavior. The goal of charts and graphs is typically to highlight trends and simplify, but sometimes, the angels are in the details.
In this and subsequent articles, I will share the methods I’ve developed for finding the angels.
I’m going to be doing the BigQuery data transformation in Google Dataform. Google Dataform is part of the Google Cloud Platform – there are no fees to use Dataform itself, but the data transformations will incur some BigQuery analysis fees. What those fees are depends very much on how many events your GA4 property collects. For twooctobers.com, which is what I will be using as an example, we get about two thousand events per day and the analysis fees are pennies per month.
The reason I’m using Dataform is because it facilitates automating data transformations in BigQuery and integrates with GitHub source control. It is also very helpful for organizing projects and testing and documenting code.
I will be sharing Dataform code examples, but I will not be spending a lot of time explaining how Dataform works. If you are not familiar with it, I recommend adding it to your ‘things-I-need-to-learn’ list. Or, if you are already comfortable with dbt, converting my code samples to dbt syntax should be fairly straightforward. One feature of Dataform I will point out, because it is super-cool, is that you can define a query as ‘incremental’. This updates a table without any risk of duplicating rows. You can schedule the query to run daily, and set the date range of the query to several days to allow for the GA4 export data to settle.
(Note, if all of this sounds too complicated, hire us and we’d be happy to set it up and get it working for you. If you mostly care about reporting on user paths in Looker Studio, I also created a Looker Studio BigQuery custom query version, but be mindful that it could get expensive if you run it on high-volume GA4 properties or for long date ranges.)
I’ve shared my path analysis Dataform project on github, and will walk through the components below. To work with the code, at a bare minimum you’ll need a Google Cloud Project with billing enabled, and an export of your GA4 data in BigQuery. Assuming you have those already:
It is also possible to create a Dataform repository without connecting to github. If you want to do that, you can just manually recreate & copy my files in your repository.
In my first pass at doing this, I created a single query that generated a table suitable for reporting in Looker Studio. Then, when I started working with the data, I realized that that schema was good for that purpose, but pretty limiting otherwise, so I broke it out into multiple stages. This query extracts just the data we need for path analysis from the GA4 BigQuery export, making subsequent steps easier and saving on analysis costs. Think of it as your path-analysis foundation. I’ll show you a few things you can build on top of the foundation, but the possibilities are limitless.
This step is executed by the ‘sequenced_paths.sqlx’ file and creates/updates a table named the ‘name’ property in the ‘config’ setting at the top of the file. By default, Dataform creates tables in the ‘dataform’ dataset, but you can change that in the workflow_settings.yaml file.
Here is a data dictionary of the table it creates:
See comments in the code for additional details.
This is a User-defined Function (UDF) that is part of an excellent library of BigQuery utilities authored by Google and community contributors. The UDF is referenced in the above query and removes the domain name and URL parameters from the GA4 page_location parameter.
Whereas the previous table has one row for each step, this query builds a table where one row equals a session and each step is broken out into its own column. In my query, I am getting the first ten steps in a session, but you can easily modify the last part of the query to add more.
Here is a data dictionary of the table it creates:
Below is an example of a Looker Studio report you can build with the session_paths table.
What’s in the report?
To build a report like this:
Fuss about as much as you’d like with formatting. Personally, this is my favorite part 🙂
For ad-hoc queries, I find the sequenced_events table very useful. It’s great for answering questions like, “what are the pages that precede or follow a particular page view?” It is possible to query the GA4 export table directly to answer this, but much easier with sequenced_events.
Here’s an example of a query to report on previous and next pages for a given page:
WITH page_sequences AS (
SELECT
session_id,
seq_num,
page_path,
COALESCE(LAG(page_path) OVER (PARTITION BY session_id ORDER BY seq_num),"entrance") AS prev_page_path,
COALESCE(LEAD(page_path) OVER (PARTITION BY session_id ORDER BY seq_num),"exit") AS next_page_path
FROM `yourproject.sandbox.sequenced_events`
WHERE date BETWEEN "2024-09-10" AND "2024-09-16"
)
SELECT
page_path,
prev_page_path,
next_page_path,
COUNT(*) AS count
FROM page_sequences
WHERE page_path = '/' -- Replace with your desired page_path
GROUP BY page_path, prev_page_path, next_page_path
ORDER BY count DESC;
Note that the output of this query is great for building a sankey diagram in Looker Studio. With a few adjustments you can replace the dates and page_path with Looker Studio custom query parameters to make a dynamic dashboard!
You may also want to modify my sequenced_events and session_paths queries to add additional dimensions such as device category or geography to sessions. I wanted to keep my example relatively simple, but you could also compare paths and path lengths based on other factors.
I got good results feeding path data to ChatGPT and asking for insights. I’ll be sharing that process in another article, but here is a teaser of the sort of thing I got back.
Analyzing the user paths that include /blog/ pages on Two Octobers’ website reveals several important insights about user engagement, content effectiveness, and potential areas for improvement. Here are the key findings:
1. User Engagement with Blog Content
- Users who visit /blog/ pages are likely seeking information related to marketing trends, apprenticeship, and help with features of GA4. The presence of multiple paths involving /blog/ indicates that this content attracts significant traffic, suggesting that users find it valuable.
2. Common Paths Including /blog/
- After visiting /blog/, users often navigate to other pages, such as:
- /about-us/
- /services/
- /team/
- This indicates that users may be seeking more context about the company, its values, and leadership after consuming news content. The inclusion of /services/ is particularly notable, as it suggests an interest in the company’s offerings following blog consumption.
3. Drop-Off Rates
- Some paths show abrupt drop-offs either at /blog/ or immediately after navigating away from it. Understanding where users leave the site can help identify potential issues
- If a user accesses a specific /blog/ article but does not proceed to further pages, it could suggest that the article did not meet their expectations or that the next logical steps weren’t clear.
It keeps going, but that gives you a taste. I’m not worried about being replaced by AI (yet), but that’s a really great starting point for sharing insights with a client.
As I described at the outset, sequence behavior is really hard to visualize. One approach I am experimenting with is starburst diagrams. Using the plotly visualization library in python, you can make a dynamic starburst that allows you to expand and collapse nodes to explore paths to or from a particular content type or page. I will be sharing how to do this in a future article.
I also think it would be interesting to use this data to create nodes and weights on a network or force-directed tree diagram, but I haven’t tried that yet.
I would really appreciate feedback and suggestions on what I have shared, in the comments below or by contacting me directly. In my research, I was not able to find a lot of in-depth content on this topic. I like what I have done so far, but I feel like there is so much I have yet to explore and I would love to connect with anyone who is interested in collaborating.
Nico loves marketing analytics, running, and analytics about running. He's Two Octobers' Head of Analytics, and loves teaching. Learn more about Nico or read more blogs he has written.
Preview five great dashboards for SEO reporting and analysis, and find the one that works…
Ways Google is emphasizing the value of authenticity in organic results, the intro of search…
"It's not me, it's you", Google Ads measurement, and dynamic metrics in Looker Studio...all in…