Tuesday, January 19, 2021

Power BI reading Parquet from a Data Lake

Data Lakes are becoming more usual every day and the need for tools to query them also increases.

While writing about querying a data lake using Synapse, I stumbled upon a Power BI feature I didn’t know was there.

When reading from a data lake, each folder is like a table. We store in the folder many files with the same structure, each file containing a piece of the data.

Data Lake tools are prepared to deal with the data on this way and read the files transparently for the user, but Power BI required us to read one specific file, not the folder. That’s until last November. If we google (verb: To google) about Power BI and Parquet files we can find many work arounds to read Parquet files in Power BI, but no mention to the new Parquet connector released on last November (https://powerbi.microsoft.com/en-us/blog/whats-new-in-power-query-dataflows-november-2020/), so I had to write about it.

The feature I’m illustrating on this article is in fact a combination of two features:

  • The feature to combine multiple files from Azure Data Lake Gen 2 storage. This was in preview in October 2019 in is available for a while, but I was surprised I couldn’t find any article really explaining the M code used to combine the files and how to customize the code.
  • The Parquet connector is the responsible to read Parquet files and adds this feature to the Azure Data Lake Gen 2. This connector was released in November 2020.

In order to illustrate how it works, I provided some files to be used in an Azure Storage. You can download the files here. You will also need to provision a new storage account and it will need to be an Azure Data Lake Storage Gen 2.

On the examples, I will use the address https://lakedemo.dfs.core.windows.net/opendatalake/trips for the storage, but you need to replace it with the DFS endpoint of your own storage.

Let’s make a step-by step:

  1. Open Power BI
  2. Select Get Data option on the main screen
  3. Select Azure Data Lake Storage Gen2. We will test directly with one of the most efficient options

There are 3 storage options:

  • Azure Blob Storage
  • Data Lake Storage Gen 1
  • Azure Data Lake Storage Gen 2

It’s important to choose the correct option according your storage type, this affects the performance.

  1. On the URL box, type this URL: https://lakedemo.dfs.core.windows.net/opendatalake/trips

Graphical user interface, text, application Description automatically generated

  1. Click Ok button
  2. Click the Combine button

Graphical user interface, text, application Description automatically generated

This screen has the traditional Transform and Load buttons but also has the Combine button, which has both options, Transform and Load, below it.

The traditional Transform and Load will be dealing with the list of files inside the Azure Storage folder. From this point, it will be our decision what to do with each file.

The Combine button, on the other hand, will bring to us a pre-built M script to combine all the files in the folder. It’s easy to mistake this feature believing Power BI will read only the current files, but in fact the script is flexible in such a way to read all the files in the folder, even future files included there.

  1. Select the option Combine & Transform
  2. Click Transform Data button

The M Code – How it Works

On Power Query window, you may notice the pre-built steps in the Applied Steps window. It’s also very interesting the way the queries were built: The final query is in a folder called Other Queries while you also have a folder called Helper Function containing a parameterized function.

Graphical user interface, application, Word Description automatically generated

Let’s analyze the M code to better understand how it works. Using the menu View-> Advanced Editor you can access the M code.

This is how our M code looks like:

let
  Source = AzureStorage.DataLake(“https://ift.tt/38T4pQ3;),
  #”Filtered Hidden Files1″ = Table.SelectRows(Source, each [Attributes]?[Hidden]? <> true),
  #”Invoke Custom Function1″ = Table.AddColumn(
    #”Filtered Hidden Files1″,
    “Transform File”,
    each #”Transform File”([Content])
  )
,
  #”Renamed Columns1″ = Table.RenameColumns(#”Invoke Custom Function1″, {“Name”, “Source.Name”}),
  #”Removed Other Columns1″ = Table.SelectColumns(
    #”Renamed Columns1″,
    {“Source.Name”, “Transform File”}
  )
,
  #”Expanded Table Column1″ = Table.ExpandTableColumn(
    #”Removed Other Columns1″,
    “Transform File”,
    Table.ColumnNames(#”Transform File”(#”Sample File”))
  )
,
  #”Removed Columns” = Table.RemoveColumns(#”Expanded Table Column1″, {“Source.Name”}),
  #”Grouped Rows” = Table.Group(
    #”Removed Columns”,
    {“Month”},
    {{“Trips”, each List.Sum([Trips]), type number}}
  )

in

  #”Grouped Rows”

These are the steps this code is executing:

  • Filter all files, making sure to not include hidden files
  • Use the AddColumn method to call the function “Transform File” for each row
  • Remove additional columns, leaving only the file name and result of the function
  • Expands the column containing the result of the function

This main script calls the Transform File function for each file in the folder. There is no fixed file name, all the files will be transformed and returned. This means that at any time new files are included in this data lake folder, a simple refresh will bring the data to the dashboard, leaving the solution flexible as a client solution for a data lake needs to be.

The M code for the Transform File function is this:

let
    Source = Parquet.Document(Parameter1),
    #”Changed Type” = Table.TransformColumnTypes(Source,{{“DateID”, type text}}),
    #”Added Custom” = Table.AddColumn(#”Changed Type”, “Month”, each Text.Middle([DateID],4,2)),
    #”Grouped Rows” = Table.Group(#”Added Custom”, {“Month”}, {{“Trips”, each Table.RowCount(_), Int64.Type}}),
    #”Changed Type1″ = Table.TransformColumnTypes(#”Grouped Rows”,{{“Month”, Int64.Type}})
in

    #”Changed Type1″

The function is using the Parquet connector released in November to process the file.

Additional Transformations

Probably we would like to make additional transformations to the data. For that, we have a choice to make: If we make the transformations on the main query, all the files will be combined first and only after the combine our transformations will be executed.

On the other hand, we have the option to make the transformations inside the function. If we do so, the transformations will be applied for each file before combining them. When they are combined, they will already be with the transformed result set.

For each transformation, we will need to identify if it will perform better when executed for each file or when executed over the combined result.

Let’s compare both options.

Transformations on the Combined Result

  1. Select the main query, Query1
  2. Select the DateID column
  3. On the top bar, change Data Type to Text

Graphical user interface, application Description automatically generated

  1. Click the Add Column menu
  2. Click Custom Column button
  3. On New Column Name box, set the name as Month
  4. On Custom Column Formula box, set the expression as =Text.Middle([DateID],4,2)

Graphical user interface, text, application Description automatically generated

  1. Click Ok
  2. Click the Group By button
  3. Select the Month column
  4. On New Column Name box type Trips
  5. Keep the default Operation, Count Rows

Graphical user interface, application Description automatically generated

  1. Click Ok
  2. Select the Month column
  3. On the top bar, change Data Type to Whole Number

Table Description automatically generated

That’s it, our ETL is ready to be used on the dashboards. Let’s check the execution time of the ETLs

  1. Click Tools menu

Graphical user interface, application, Word Description automatically generated

  1. Click the Start Diagnostics button
  2. Click Home menu
  3. Open the Refresh Preview drop down
  4. Click Refresh All menu item

Graphical user interface, application, Word Description automatically generated

  1. Click Tools menu
  2. Click Stop Diagnostics button

On the left side of the screen, in the query window, you will find a new folder called Diagnostics with two queries inside the folder, holding the results of the diagnostics.

  1. Click the Diagnostics_Aggregate query

Graphical user interface, application Description automatically generated

  1. On the StartTime column header, open the drop down
  2. Click the Sort Ascending menu item

Graphical user interface, text, application Description automatically generated

  1. Take note of the time on the first record
  2. On the StartTime column header, open the drop down
  3. Click Clear Sort menu item
  4. On the EndTime column header, open the drop down menu
  5. Click the Sort Descending menu item
  6. Take note of the time on the first record
  7. Calculate the time difference between the first time and the 2nd time you took note

In my example, the total time was 7 seconds. You may find slight differences.

Transformations on Each File

Let’s build the example again, this time building the transformations inside the function, so they would be applied for each file instead of the final result.

  1. Repeat the steps 1-7 from previous steps again
  2. Select the Transform Sample File query
  3. With the Transform Sample File query selected, repeat the steps 9-23
  4. Select the main query, Query1
  5. Remove the last step in Applied Steps window, the Change Type step
  6. Select the first column, Source.Name

Table Description automatically generated

  1. Click the Remove Column button
  2. Click Group By button
  3. Select the Month column
  4. In the New Column Name box, type Month
  5. On Operation drop down, select Sum
  6. On Column drop down, select Trips

Graphical user interface, application Description automatically generated

  1. Click Ok
  2. Repeat Steps 24-39

On my example, the execution time results in 4 seconds.

Conclusion

As you may notice, on this example the transformations made on each file performed better than the transformations made after the final combination. This illustrates how important it is to understand this structure and test the performance, deciding which one will perform better for your transformations.

This new feature may not be so obvious, hidden in a Combine button and a complex M code structure, but it’s still much better than possible work arounds for the problem.

 

The post Power BI reading Parquet from a Data Lake appeared first on Simple Talk.



from Simple Talk https://ift.tt/3bNbT97
via

No comments:

Post a Comment