The goal of this article is to help you understand the differences between shaping your data (using Power Query) and modeling your data (using the modeling tools within Power BI). You will need to do both to create a robust Power BI report. I first wrote this article in September 2016 and am now updating it to reflect the latest developments in Power BI.
The four phases of a self-service BI project
I like to think of a self-service BI project as having four logical phases, as shown below.
The image above shows the logic data flow from source to delivery of final reports. While the image shows that the data flows from left to right, the reality is that when you build the solution, you will likely jump back and forth through the different phases iteratively until the job is done. Two of the previous phases are part of the current theme; Data acquisition (modeling) and Data modeling. Both are IT BI terms that business users may or may not have heard before.
Shaping and modeling your data
Data acquisition using Power Query is where the "modeling" is done, and data modeling using the Power BI Analysis Services engine is where the "modeling" is done. You must do both to create a good Power BI report.
Definition of modeling
Shaping is the process of extracting your source data, transforming the data in the form you need, and then loading in Power BI. This process is often referred to as ETL (extract, transform, and load). The reason I like the word shape is that it clearly describes what you are doing in this step. The process of shaping your data includes:
- Restructure the data from your source tables into new tables that meet the reporting database requirements. The table structure in a relational database is rarely (if ever) appropriate for Power BI. Don't fall into the trap of simply loading what's in your source data.
- Selecting only the necessary columns and rows. Load everything you need and nothing you don't need. It's easy to make changes later if necessary.
- Pre-adding the data (if you want / need it), although I generally don't recommend it in most cases (there are always exceptions).
- Name / rename columns and tables to make them easy to use for business.
- Loading everything into the Power BI database (the data model).
Whenever possible, it is best to do this setup work within views in your relational database (if you have one). If this is not possible for some reason, Power Query is there to do the job for you.
Definition of modeling
Modeling is the process of design the structure of your data tables, edifice the relationships between your tables, and writing steps to extract the business information you need in your reports.
Data modeling includes:
- To decide what data it loads and what it doesn't load.
- To decide in a table schema; Is it one table, two tables, many tables? You should always prefer a star scheme if possible.
- To decide what names do you give to the columns and tables
- To decide whether your tables are wide and short (many columns) or long and narrow (fewer columns)
- Join tables using relationships (when needed)
- Write business logic in measures to extract added value from your raw data.
- Correctly format measures and columns to meet business requirements (e.g. currency, percent, appropriate number of decimal places)
- Give good trade names to measures that accurately describe what is being calculated.
Notice above, that the deciding what to load is part of modeling, but The process of structuring data is part of shaping. Clearly, these 2 concepts are closely related and mutually dependent.
Things are not always clear
One thing that creates a lot of confusion for new Power BI users is that there are at least 2 ways to add columns to your data. You can add a column using Power Query or you can add a calculated column using DAX. Of course it may be possible in a third way; instead, add the data in the source database. Regardless of the method you use, the decision on whether to add the column (or not), and what should be in the column, it is a data modeling decision. It is debatable whether the process of actually adding a calculated column It's modeling or modeling, but I think about it as a molding task.
So which one should I use?
So there are 3 places you can add a column, so which one should you use? Well, it depends on what you are trying to do. My general advice on adding columns to your tables is to make the change as close to the source as possible. So, in order of preference, you must:
- Add the column in your source data (for example, your data warehouse) with the help of your IT friends if necessary.
- Add the column using Power Query on data load.
- Add the column within Power BI using a calculated column.
The reasons for this recommendation are as follows:
If you add the column in the source, you will have access to it for your current BI needs, but also for future BI needs and other BI users.. Adding the column in the source provides the most reusability, and if you need to make a change later, it will be the least maintenance. However, the reality is that it often cannot be added (or changed) in the source for various reasons, or at least cannot be made to happen quickly, so this is often not a particularly option for BI from self service.
If you add the column within Power Query (or in source) it will load and compress with all other columns during data loading. This usually means a better overall compression of the table and therefore a good result for the overall model. It also keeps all your modeling tasks together, which can make finding things more intuitive and easier to maintain.
The most common reason you should choose to add a calculated column (using DAX) is if you need to leverage existing parts of the model to create the new data. An Adventure Works example here should help. Suppose you want to classify your customers into bands, high sales, medium sales, or low sales. You need a new column in your customer table (a modeling decision) in order to use that column in a slicer in one of your reports. If you were to try to accomplish this task within Power Query, it would be a lot of extra work. You would have to calculate the total sales for each customer in Power Query, and that would require that
- Create a join between the customer table and the sales table
- Pre-add the sales data of each customer
- Group customers into size bands using the business logic you need.
- Add the column (high, medium, low).
The important point is that items 1 and 2 above probably already exist in the data model itself. In the case of Adventure Works, the model already has a customer table, a relationship to the sales table, and a measure that aggregates the sales. These model features can be used to easily add the new calculated column using DAX. So in short, you should prefer a calculated column when you leverage your model logic (measures and relationships) so you don't have to repeat this logic within Power Query.
Now, just because you can, doesn't mean you should. Please don't fall into the trap of using too many calculated columns instead of writing measurements. I talk about it here. Use calculated columns when you need them; otherwise, avoid them.
We hope this explanation of the Shaping and Modeling feature in Power BI helped clarify a few things. If this article was helpful to you, you might want to check out the other helpful articles in my knowledge base here.