Part of my MCSA sertification preperation is reading Analyzing Data With Microsoft Power BI and Power Pivot for Excel by Alberto Ferrari and Marco Russo. The book is introduced with the topic data modeling.
I have to admit I am a bit blown away when I am reading about data modeling. I have done my fair share of work-arounds of raw-data to make sure the output is correct, but I had no idea about the strucure that is (potentially) in the background. This is going to be interesting!
Keywords from chapter 1:
- Granularity, level of detail
- Relationship between tables (and one vs many)
- Design techniques, normalization vs denormalization
- Star-schemas, dimensions and facts tables
- Ambiguity in data modeling
- Best practise of naming conventions
Different tables naturally have different level of details. A customer table has information on a customer level, while an order table has information on order level. These levels are not nessesarily the same and when aggregating your output may be incorrect.
One specific example I have seen is the use of discounts for orders above a specific level. Say the customer get a discount for placing an order above € 1000. The discount belongs in the order table. But what if you want to look into discounts given per product sold? The sales table has one row for each sale, which means potentially multiple rows for the same order. How do you assign an order discount to each of your products? Granularity!
In Power BI you set up the relationship between your tables. Setting it up requires an unique column in both tables to link the two. You must also understand which table is your source table and which table is your target table to set the correct direction in the relationship.
Reading about design techniques was really interesting. This I can relate to. When I have set up models in Excel, I have typically had loads of different tables holding different information – normalization (BTW, used in OLTP systems). I then included all the data I needed into one seperate table to set up the data needed for analysis – denormalization.
A star schema is a data-modeling technique or a set-up of related tables where you have one fact-table in the middle and multiple dimensions linked to it through relationships.
A dimension can typically be a customer, product or employee. It has attributes such as name and adresse, description, manufacturer and so on.
A fact table has metrics such as amount and quantity.
In a star schema, the facts table is on the ‘many’ side of the relationships, while the dimensions is in the ‘one’ side (example, ‘many’ sales to ‘one’ customer). Facts are related to dimensions (=> the direction of the relationship).
Ambiguity must be avoided at all cost! Let’s leave it at that.
Best practive for naming of columns.
- Use casing to seperateWords
- Facts tables are always plural (indicating the ‘many’- side of the relationship, “Sales”)
- Use singular for Dimensions (Customer)
- No acromnyms
- No long names
- Include the word ‘Key’ naming the primary key in each of the dimension tables (fex CustomerKey in the customer table)
That’s it for the introduction.