This week I was working for a client - they had a performance issue with a Power BI report. The data on the workbook was not too large, around 400,000 rows, but the file size was 110 megabytes and the performance of the model was relatively slow given the number of records. When I looked at the report, I noticed that the report was using GUIDs between primary and foreign keys in multiple tables. Generally speaking, it is not a good practice to use a GUID to join tables, as GUIDs do not compress well and have a negative effect on the efficiency of physical 1-to-many relationships.
What is a GUID?
GUID is an acronym for Globally Unique IDentifier. In short, a GUID is a hexadecimal number that is equivalent to a decimal number with 39 digits, something like this 1,000,000,000,000,000,000,000,000,000,000,000,000,000 (a Duodecillion). In other words, a GUID is a very unique ID that is unlikely (based on probability) to reproduce even if it was generated randomly, which it is. Read more about it here:
What's wrong with using GUIDs in relationships?
A physical 1-to-many relationship in Power BI is a critical part of the underlying database structure. The Power BI engine (Vertipaq) materializes and stores these relationships in the database and then uses them to quickly propagate filters from one table to another (via the 1-to-many relationship). It is very common (even desirable) for the ratio logic to be loaded into the fast L1 or L2 cache on your PC chip so you can get your work done super fast. If the relationship does not fit in the cache, the whole process will be slower. Things that can increase the size of the relationship are the number of unique values in the columns used in the relationship, but also the data type used in the relationship. Suffice it to say that a GUID is much less efficient than an integer value as a key column in Power BI.
Enter a substitute password
One way to solve this problem is to replace the GUID with a surrogate key. A surrogate key, as the name suggests, is a new "key column" that is a "surrogate" (or replacement) for the original key column. In a perfect world where you have a SQL server on the back end and you have an IT department that can do the work for you, I suggest that you have your IT department create the surrogate key for you and make it available in a sight. so you don't have to use the GUID. But the world we live in is not perfect, therefore this article will show you how to replace the GUID with a surrogate key using Power Query.
Steps to complete
Assuming you can't get the work done at the data source, this is essentially a problem for Power Query to solve. The steps to complete this process are as follows
- create a connection to the raw dimension table
- create a branch in the power query
- add a new column of integer id (surrogate key)
- Rejoin the surrogate key table with the original dimension table and replace the GUID using the join
- repeat the step to replace the GUID in the fact table
This process assumes that your dimension table contains a complete list of keys that exist in your fact table. If this is not the case, you have a problem anyway. It is possible to vary this pattern to include the GUIDs that come from the fact table as well, however it doesn't really solve the root problem (you might have IDs in the fact table missing from the dimension table) so Therefore, I have not provided / suggested this as part of the solution. Also, doing this will certainly slow down the update time without solving the root problem.
Connect to data
As you can see in the image below, I have two queries (RawCustomer, RawSales shown as # 1 below) that connect directly to my sample data (I modified Adventure Works so that the customer number uses a GUID # 2 below). Note that my two raw data queries are simple connections. This is the technique that Ken teaches in the Power Query Academy Trainingand I think it's great practice. From there I created two test queries (shown as # 3 below) that are simple references to the raw data queries
So at this point it just looks like this
Create branch in Power Query on dimension table
- The next step I took was to right click on the customer readiness query and select "reference" to create a new branch. I called this new query CustomerGUIDs. In this query I just saved the GUID column and removed everything else. As a security step, I remove the duplicates in case there is a duplicate in the GUID column (unlikely though).
- Then I added an index column starting at 1 and named it CustomerID.
The new CustomerID column is the surrogate key. The beauty of this approach is that the surrogate keys will grow over time if the original GUID list grows (which is very likely). Also, if there are deletions from the customer table, the surrogate key will simply rebuild itself with the new dataset.
Rejoin the surrogate key table to form a new dimension table
You can see the before (# 1) and after (# 2) version of the customer table below.
Repeat the process to replace the GUID in the fact table
- I created a new query from the SalesStaging query (right click, "reference") and merged it with the CustomerGUIDs table.
- Then I extracted the surrogate key
- Then I removed the original GUID from the final sales table.
My final query dependency view looks like this.
I made sure all the tables were configured not to load apart from the Customers and Sales tables.
The demo I've shown you here is with a very small data set (19,000 rows). In my case, the surrogate key approach reduced the file size by more than 30%, and there was only one surrogate key table in my file. For my client who had around 400,000 rows of data, 3 GUID key columns, and a much larger number of unique GUIDs, the file size reduction was much larger, from 110MB to just 11MB (90% reduction with the surrogate key). Probably most importantly, there were noticeable performance improvements after the change.
Are there any negative impacts?
At this point, you might be wondering "won't this slow down the update time?" If you were thinking this, you are absolutely right, it will most likely slow down the performance of the update. However, it is much better to have a slower update time and faster runtime performance than the other way around.
What if I need my GUIDs for auditing?
A second problem is that you may need the GUIDs to be able to trace the data in your report back to the transaction on the source system. If this is a common need with your data, I suggest that you still create and use the surrogate key, but also load the GUID into the dimension table as an additional column. That way, the GUID is available but not used in the relationship. Better yet, remove the GUID from the dimension table and then bring it back later only if you have a problem that requires you to trace the source.
Conclusion and where to learn more
Here's a copy of workbooks I used in this article in case you are interested in taking a closer look.
If you want to learn how to be great with Power Query, I recommend that you take a look at the Power Query Academy online training at Skillwave.training. Ken, Miguel and I have joined forces to create the best and most comprehensive Power Query training course available.