Flat file to dimension model example

This article goes through an example of turning a flat file into a dimension model in Power BI.

Detailed Information

Data can be visualized in Excel, however, there are some limitations in Excel that Power BI can fix. The below is the example when you have a flat file and make it to dimension model to perform better in Power BI.

Background of this project:

The usage of Box is needed to be visualized. Here are some issues that Excel cannot handle easily.

  • Data is too big (7 GB, 35 millions rows)
  • Level of the tree (hierarchy) is complicated
  • need to add classification (faculty/staff/ student)

Goal:

Create star schema data model based on the current flat file.

showing a visualization of the goal to convert a flat file to a star schema.

Steps:

Data set up:

Make sure data is accessible and check the data.

  1. The data is in SharePoint as csv file
  2. Connect data in PowerBI Desktop (since this data will NOT be refreshed, used get data> csv option)
  3. click Transform data
  4. check column quality box
  5. Change date column 
    1. Since there are "N/A" values, replace them to null. 
    2. change type from abc → date

Creating dimensions:

Once you figure out which fields can go to dimension table, use reference function to create dimensions.

  1. Create path Dimension
    • right-click on the base table and click "reference"
    • showing the menu that appears after you right-click on the base table.
    • rename the new table (folder)
    • remove columns
    • remove duplicate
    • split column by delimiter
  2. Create File Dimension
    • right-click on the base table and click "reference"
    • rename new table to file name
    • remove columns (file_name & file_id)
    • Remove duplicate
  3. Go back to the base table and remove the columns (path_id, file_name & file_id)
  4. Create Box Data table
    • right-click on the base table and click "reference"
    • rename new table to "Box Data"
  5. Replace slash ( / ) with pipe( | )
    • right-click path_ids column and put values on the boxes
    • showing the pop up menu for replace values already completed to replace a slash with a pipe.
    • Remove the first pipe by using "Extract" function
    • showing the Text After Delimiter pop-up window set up to remove the first pipe.
    • replace with "10" for the last characters
    • showing the extract last characters pop-up window set for 10.
    • delete 10 and type Text.Length( [path_ids ] ) 
    • showing the DAX code for Extract text after delimiter.
    • = Table.AddColumn(#"Filtered Rows", "First Characters", each Text.Start([path_ids], Text.Length([path_ids]) - 1), type text)
    • showing the DAX code to replace the characters.
    • extract > text after delimiter with 5
    • showing the location of the text after delimiter button on the extract drop down menu.
    • change the column name to "path"
  6. Go to Box data table and remove file name & folder name columns
    • remove null date
    • showing the location of the equals button on the data filters dropdown menu.
  7. Go back to "Home" and "Close and Apply"
  8. Create Calender table (moved to Create Date/ Calendar table)
    • File > Options > Data Load > uncheck the  box for auto date/time
    • Check the earliest date and latest date by dragging the field on the visualization pane
    • Go to the Modeling tab on the ribbon 
    • Go to table pane and right-click the Calendar table, then choose Mark as date table
    • add new column and create "year"
    • Go to modeling pane and create relationships
  9. Go back to the visualization pane and create a new measure
    • showing the DAX code creating a new measure.
    • create some table
      • Choose Don't summarize
    • this guy shows count=2
    • Copy value > edit query > filter row
    • showing the Filter Rows pop up window.
    • showing the original DAX code for the table.
    • change "=" to "<>"
    • showing the DAX code for the table with the "=" changed to "<>".
    • filter null
    • showing the location of the number filter to filter nulls.
    • GO back to relationships
  10. New Measure: file size
    • showing the DAX code for the file size new measure.
    • hide the original column (size) by right-clicking the field
  11. New Measure: number of files (distinct)
    • showing the DAX code for the new measure for the number of distinct files.
  12. New Measure:  size (MB) 
    • showing the DAX code for the new measure of size in MBs.
  13. Go back to visualization, and drag fields to create some visuals
  14. Go to Data view and click folder and create new column: Hierarchy depth 
    • showing the DAX code for the new column of Hierarchy Depth.
    • data looks like this
    • showing an example of what the data looks like with the addition of the Hierarchy Depth column.
    • the earlier steps (changing slash to pipe) were needed because of this
    • New Column: level 1
      • showing the DAX code to add a new Level 1 column.
      • error; insert text before the delimiter then change path_ids to path
      • Showing the DAX code to code an error to read insert text before the delimiter then change path_ids to path.
      • null is the problem
      • showing an example of a empty/null cell which will cause issues.
      • right-click the cell > replace value
      • showing the Replace Values pop up window set to replace the null values with 99999999.
      • showing the DAX code to replace the null cells.
  15. Open Chrome and search for DAX formatter(daxformatter.com)
    • copy the DAX from PBI Desktop
    • Paste the code on the website
    • copy the DAX from website and paste it on PBI Desktop
  16. create new columns: Level 2, Level 3, ....Level 23.
    • showing the DAX code to create the new columns using creating the Level 4 column as an example.
    • create a hierarchy by dragging level 2 over level1 and so on...
  17. Change the source (need full data now)
    • Edit query
    • click source
    • add new path
    • showing the DAX code to add a new path.
    • error: comma on data breaks the report without double quotes (")
      • right-click on file_owner data cell > text filter
      • showing an example of the error that comma on data breaks the report without double quotes.
      • enable data for box_data partial (2) and folder, and box data
      • showing an exaple of the possible data loss warning pop-up window that will come up when you enable data for box_data partial and folder, and box data.
      • still error
      • showing the Apply query changes error pop-up window.
      • Tried to open the data file with Notepad ++ but the file is too big to open
      • These five rows are the errors
      • showing an example of five rows that contained errors.
      • modify deleted column
      • still error
      • showing the Apply Query Changes pop-up window indicating an error still being there.
      • showing the DAX code for the error.
  18. Notes: 
    • Integer is better than text 
    • PBI changes data type based on the top 1000 rows.  Change to "based on entire data set" on the bottom left- corner of the Query Editor. Look at the data by sorting the values.
    • showing an example message of the the column profiling being based on entire data set not on the top 1000 rows.
  19. Error: 
    • click the error cell
    • data in this row got shifted due to the double quote (") in data 
    • showing an example of the column with a error cell.
    • Solutions: Type CsvStyle=CsvStyle.QuoteAlways
    • showing the DAX code for the solution to resolve the errors.

Resource:

Flat file to star schema

Environment

  • Power BI