Azure ML Pipelines and MLOps with GitHub Actions – Part 3

As this is Part 3, you might want to review Part 1 and Part 2 for context if you haven’t yet.

In Part 3, I will finally focus on the GitHub action.

All of the code for this can be found on my GitHub.

How to run the training process from GitHub Action

To put the "Ops" in "MLOps" the pipeline execution should be automated whenever possible. The scenario being used in this repository is when code is checked in, the training pipeline is automatically kicked off from a GitHub Action and if the newly trained model has a higher accuracy than the previous model, then it will be registered in the model repository.

To review, iris_supervised_model.py does the training and register_model.py obviously registers the model. The pipeline that runs these two steps is built and executed from train_pipeline.py. So it is this train_pipeline.py (also referred to in this repository documentation as the driver script) that needs to be executed from a GitHub action.

Creating the GitHub Action

Actions allow for automation of processes. They are YAML scripts.

I won’t pretend to be an expert on YAML or the action schema, but i pulled a sample and made some modifications. Action scripts should be placed in .github/workflows directory of the repo.

name: iristrain
on: [push]
jobs:
  run:
    runs-on: [ubuntu-latest]
    container: docker://dvcorg/cml-py3:latest
    steps:
      - uses: actions/checkout@v2
      - name: train_pipeline
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
          AZUREML_CLIENTID: ${{secrets.AZUREML_CLIENTID}}
          AZUREML_TENANTID: ${{secrets.AZUREML_TENANTID}}
          AZUREML_SECRET: ${{secrets.AZUREML_SECRET}}
          AZUREML_SUBSCRIPTION: ${{secrets.AZUREML_SUBSCRIPTION}}
          AZUREML_RESOURCE_GROUP: ${{secrets.AZUREML_RESOURCE_GROUP}}
          AZUREML_WORKSPACE: ${{secrets.AZUREML_WORKSPACE}}
        run: |
          # Your ML workflow goes here
          pip install -r requirements.txt
          python azureml/train_pipeline.py

The above code shows that on [push] so when any code is pushed to the repository, the jobs: will be performed. Unique to this example is the environment variables env: that are all populated from GitHub secrets. They are passed into the train_pipeline.py script which is invoked with the last line of the yaml file above. Any libraries required to run the python script need to be installed first on the container therefore the pip install -r requirements.txt is there. The current requirements.txt has more than what is needed to run the script, specifically azureml-sdk is the entry required for this scenario.

Service Principal Authentication to Azure ML

Unlike building a pipeline and running it from a Jupyter notebook interactively, to use automation the credentials to login to Azure ML have to be stored. Using an Azure Active Directory user principal is the right way to do this.

Creating a service principal in Azure will likely require elevated permissions in Azure. Work with your Azure administrator to enable this.

In this notebook there is a section for Service Principal Authentication that walks through the setup. Once a service principal is created, it can be used from the train_pipeline.py script.

from azureml.core import Workspace
from azureml.core.authentication import ServicePrincipalAuthentication

svc_pr_password = os.environ.get("AZUREML_SECRET")
svc_pr = ServicePrincipalAuthentication(
    tenant_id=os.environ['AZUREML_TENANTID'],
    service_principal_id=os.environ['AZUREML_CLIENTID'],
    service_principal_password=svc_pr_password)

ws = Workspace(
    subscription_id=os.environ['AZUREML_SUBSCRIPTION'],
    resource_group=os.environ['AZUREML_RESOURCE_GROUP'],
    workspace_name=os.environ['AZUREML_WORKSPACE'],
    auth=svc_pr
    )

Note the heavy use of environment variables. This is to ensure that credentials and other private variables are not stored clear text in this repo for the world to see. These are being passed in from the action yaml shown above.

More comprehensive information on Azure ML authentication can be found here

GitHub Secrets

The above section highlights the environment variables being used. These variables are initiated from the GitHub repositories action secrets.

GitHub Secrets

These secrets are created in the repository from the "Settings" tab. In the "Secrets" section, click the "new repository secret" button to add a secret.

GitHub New Secret

It should be obvious that the secret names need to match the secrets given in the yaml action script: ${{secrets.AZUREML_CLIENTID}}.

Invoking the GitHub Action

Any change to the source code based on this simple action definition will invoke the train_pipeline.py which defines and then submits the iris_train_pipeline with the snip of code below.

iris_train_pipeline = Pipeline(workspace=ws, steps=[trainingScript,registerModelStep])
print ("Pipeline is built")

exp = Experiment(ws,experiment_name)
exp.set_tags({'automl':'no','working':'no'})

pipeline_run1 = exp.submit(iris_train_pipeline)
print("Pipeline is submitted for execution")

pipeline_run1.wait_for_completion()

This will happen from master or a branch. To see the action in action, in iris_supervised_model.py change the n_splits from 5 to 3.

code change example

Do a commit and push. Note: I have been using VSCode. The Python editor and GitHub integration is legit!

On the Actions tab of the repo, a new action is now visible with a "yellow" icon to indicate "in progress" actions list

Click on the run to see the details in progress action

In the details section click on the run itself and the logs can be reviewed in real time. First GitHub is acquiring a container image to install the necessary python configuration to run the train_pipeline.py script. action detail 1

Once the image is acquired, it kicks off the Azure ML Pipeline which can be reviewed in Azure ML Studio action pipeline

pipeline_run1.wait_for_completion() is an important line in train_pipeline.py that keeps the script from completing until the Azure ML pipeline completes. Without this line the action will finish while the pipeline is still running. This may be desired depending on the scenario.

The logged pipeline outputs will be displayed in the run output. action complete logs

When it is complete, the status will show green. acton complete

Conclusion

That is it. Now your training process is fully automated based on code check in. And yes, YAML is still the most awful definition language but you have to just shut up and deal with it ūüôā

Azure ML Pipelines and MLOps with GitHub Actions – Part 2

It is usually better to start at the beginning so you might want to head over to Part 1 if you missed it.

In Part 2 I will focus on managing the pipeline run with the “run context” and then registering the model in a way that ties the model back to the pipeline, artifacts, and code that published it. In Part 3 I will conclude with pipeline schedules and GitHub Actions.

All of the source code can be found on my GitHub so don’t be shy to give it a star ūüôā

Leveraging Run Context

Azure ML allows execution of a Python script in a container that can be sent/run on AML compute clusters instead of a local machine. This could be a data transformation script, a training script, or an inferencing script. The below examples shows how to do this for a simple training script (stand alone, in absence of a pipeline)

from azureml.core import Experiment
experiment_name = 'train-on-amlcompute'
experiment = Experiment(workspace = ws, name = experiment_name)

from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory=project_folder, 
                      script='train.py', 
                      compute_target=cpu_cluster, 
                      environment=myenv)
 
run = experiment.submit(config=src)

These "runs" are executed via a submit command from an experiment. Being able to log information to the run from within the script itself (in the above example train.py) is key.

In this repo, iris_supervised_model.py leverages run context to log metrics, tables, and properties. run = Run.get_context() This is the magic line that connects a vanilla Python script to the context of the run, inside the experiment, inside the Azure ML workspace.

Now, metrics can be logged
run.log("accuracy",best_score)
and tables
run.log_confusion_matrix('Confusion matrix '+name, confusion_matrix(Y_train, model.predict(X_train)))

See this sample notebook for all the things logging.

TIP When relying on run context of Azure ML (such as environment variables being passed in from the driver script) performing the following check early in the script can allow defaults to be set for anything that would have been passed in. This allows for local testing which is a time saver.

if (run.id.startswith('OfflineRun')):
	os.environ['AZUREML_DATAREFERENCE_irisdata'] = '.\sample_data.csv'
	os.environ['AZUREML_DATAREFERENCE_model_output'] = '.\model_output'

Managing the pipeline execution

In a pipeline, each run is at the step level, or a child of a parent run which is the pipeline itself.

Pipeline Parent Child

It may be best to log important metrics or properties at the pipeline level rather than at the step level (or both). run.parent will get the parent run context. The code below sets the two properties by passing in a dictionary as the parameter and those same values on two tags as well.

run.parent.add_properties({'best_model':best_model[0],'accuracy':best_score})
run.parent.tag("best_model",best_model[0])
run.parent.tag("accuracy",best_score)

Properties are immutable while tags are not, however tags are more predominant in the Azure ML Run UI so they are easier to read. tags

To review the added properties click "Raw JSON" under "see all properties".
see all properties

properties

Now that the results of the training are published to the parent pipeline tags (and properties), they can be used to control what happens in execution of later steps. In register_model.py, the accuracy score is going to control if this model will be registered or not.

Model Registration

The model artifact should be registered as it allows "one click" deployment for real time inferencing hosted on AKS or ACI. Even if the intention is to use it for batch inferencing with Azure ML pipelines it is a more organized way as shown below to keep full context of how the model was built verse just storing the pickle file off in a cloud storage location.

In context of this example pipeline, training has been completed in iris_supervised_model.py. The best model accuracy has been recorded in the tags of the pipeline run.

In the next step of the pipeline register_model.py, retrieve the parent pipeline run context with parentrun = run.parent and review the tags that have been set.

The below code block shows getting the accuracy score from the tag dictionary for the current pipeline run, but also an alternative method to interegate previous steps in the pipeline to retrieve the tags by using parentrun.get_children()

tagsdict = parentrun.get_tags()
if (tagsdict.get("best_model")) != None:
    model_type = tagsdict['best_model']
    model_accuracy = float(tagsdict['accuracy'])
    training_run_id = parentrun.id
else:
    for step in parentrun.get_children():
        print("Outputs of step " + step.name)
        if step.name == training_step_name:
                tagsdict = step.get_tags()
                model_type = tagsdict['best_model']
                model_accuracy = float(tagsdict['accuracy'])
                training_run_id = step.id

The model can be registered directly to the workspace, but the context of how the model was built is then disconnected from the training pipeline. Instead, the model will be registered from the pipeline run object. To do this the model artifact (model.pkl file) needs to be uploaded to the parent run.

# to register a model to a run, the file has to be uploaded to that run first.
model_output = os.environ['AZUREML_DATAREFERENCE_model_output']
parentrun.upload_file('model.pkl',model_output+'/model.pkl')

Next, see if the model name is already registered. If so, record the accuracy score of the previous model to compare against the new model. If this is the first time the model has been trained it won’t exist in the registry so set the accuracy to beat equal to 0.

try:
    model = Model(ws, model_name)
    acc_to_beat = float(model.properties["accuracy"])
except:
    acc_to_beat = 0

Compare the new model accuracy with the previous model accuracy to beat and if the model is better, register it. Note: the model is being registered via parentrun.register_model and not Model.register_model. This is important as it nicely ties the registered model and artifact back to all the context of how it was created.

if model_accuracy > acc_to_beat:
    print("model is better, registering")

    # Registering the model to the parent run (the pipeline). The entire pipeline encapsulates the training process.
    model = parentrun.register_model(
                       model_name=model_name,
                       model_path=model_path,
                       model_framework=Model.Framework.SCIKITLEARN, 
                       model_framework_version=sklearn.__version__,
                       sample_input_dataset=dataset,
                       resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5),
                       description='basic iris classification',
                       tags={'quality': 'good', 'type': 'classification'})

Set additional properties for accuracy and model_type so that the next time training is ran the current accuracy will be compared against that model (just like above)

model.add_properties({"accuracy":model_accuracy,"model_type":model_type})
model.add_tags({"accuracy":model_accuracy,"model_type":model_type})

Access the run logs, outputs, code snapshots from registered model

In the model registry, when registering from the run itself, it hyperlinks to the run id.
Model

This links back to the pipeline run.
Pipeline

Notice that when clicking on the iris_supervised_model.py step, there is access to the outputs/logs, metrics, and even the snapshots of the code used to generate the model artifact that is registered.
Snapshot

Conclusion

Registering the model from the pipeline run gives complete context of how the model was built and registered! Its sets up real time and batch inferencing deployment as next steps.

Up Next Part 3

Azure ML Pipelines and MLOps with GitHub Actions – Part 1

I have been working on customer projects with Azure ML pretty regularly over the last two years. Some common challenges:

  • Microsoft highly promotes the AKS deployment for real time inference, yet most of the time customers are still looking for an effective way to do batch scoring.
  • When customers leverage Azure ML pipelines for batch processes they struggle with the concept of pushing datasets and files between steps. This erodes the true power of splitting an ML process into steps.
  • MLOps is hard and overwhelming.

This is not a “start from the beginning” blog post. This is going to assume that you have familiarity with Azure ML If you are not, the sample notebooks are seriously EXCELLENT! However, they seem to get you 90% there but miss out on implementation details that are key for success.

The scenario I am using below and can be found on my GitHub. It is an Azure Pipeline that trains several iris classification models. It picks the best one and logs it. In the next pipeline step, if that model is better than the previous training run, it will register the model. This training pipeline can be put on a schedule or it can be triggered from a code check in. In this case, from a GitHub action.

In a later blog post, i will discuss in more detail the model registration process (some production tips there) and the GitHub action, but I will start with properly passing datasets and files between steps.

Passing datasets and files between steps

Other than a few blogs I have found on the internet, instructions on how to properly pass files or datasets between steps are hard to find.

pipeline_image.PNG

In the above image you can see that irisdata is passed into iris_supervised_model.py and then model_output is the output. When you define the pipeline in the driver script, the input data is a DataReference object and any data passed between steps is a PipelineData object.

from azureml.core.datastore import Datastore
from azureml.data.data_reference import DataReference
ds = ws.get_default_datastore()
print("Default Blobstore's name: {}".format(ds.name))

dataset_ref = DataReference(
    datastore=ds,
    data_reference_name='irisdata',
    path_on_datastore="data/sample_data.csv")
print("DataReference object created")
from azureml.pipeline.core import Pipeline, PipelineData
model_output = PipelineData("model_output",datastore=ds)
print("PipelineData object created for models")

In the PythonScriptStep, utilize the input and output parameters.

from azureml.pipeline.steps import PythonScriptStep
trainingScript = PythonScriptStep(
    script_name="iris_supervised_model.py", 
    inputs=[dataset_ref],
    outputs=[model_output],
    compute_target=aml_compute, 
    source_directory="./azureml",
    runconfig=run_config
)

Simply pass the “model_output” from outputs as input to the next step (the register_model.py that will be a focus of the next blog post) and so on.

Using these references in the script

When you submit a pipeline job to run, a container is created and all the files in the source_directory specified in the PythonScriptStep are imported into the container. The input and outputs effectively become mount points for blob storage to that container. In the iris_supervised_model.py script step this mount point is accessible via an environment variable that looks like the below.

os.environ['AZUREML_DATAREFERENCE_irisdata']

This is also the same environment variable format used for the output location (the PipelineData object) which appears to be a randomly created storage location given to you from AzureML.

mounted_output_path = os.environ['AZUREML_DATAREFERENCE_model_output']

Looking at the mounted_output_path variable above gives a location like: mnt/batch/tasks/shared/LS_root/jobs/amlworkspacesjh/azureml/715a1dca-fafc-4899-ae78-ffffffffffff/mounts/workspaceblobstore/azureml/71ab64d9-bc4c-4b74-a5a5-ffffffffffff/model_output


You should be able to treat these environment variables as a file location just like a local path. So for the irisdata which was a csv file in the data reference you can read it like normal.

df = pd.read_csv(os.environ['AZUREML_DATAREFERENCE_irisdata'], names=column_headers)

For the model_output we pickle the model file and save it to the mounted_output_path.

pkl_filename = "model.pkl"
mounted_output_path = os.environ['AZUREML_DATAREFERENCE_model_output']
with open(os.path.join(mounted_output_path, pkl_filename), 'wb') as file:
    pickle.dump(best_model[1], file)

Now look into register_model.py, we utilize the PipelineData object (model_output) as our input and reference the same environment variable as in iris_supervised_model.py

mounted_output_path = os.environ['AZUREML_DATAREFERENCE_model_output']
print("model path",model_output)
print("files in model path",os.listdir(path=model_output))

In the file list, model.pkl is there right where it was created in the training script.

Conclusion

The ability to pass data between pipeline steps is pretty easy, but the documentation on using the magic “AZUREML_DATAREFERENCE_***” environment variables is lacking in most of the sample notebooks I have found. Just remember that these are mount points and can be interacted with just like local files basically.

Up next, Part 2

The “where have I been?” footnote

It has been 21 months since my last blog post. My role at Microsoft has led me to focus much more on cloud data services for only a couple of customers. I loved to blog about Power BI but I just haven’t been in that space for awhile as my day to day responsibilities were handed over to the much more capable @notaboutthecell. I have been working a lot on real time stream processing (with Databricks / Cosmos DB / Azure Functions) and ML engineering activities with Azure ML. Blog posts have been difficult as so much of my work is implementation oriented and it is hard to recreate everything in a publicly sharable way.

Or maybe i have just been lazy ūüôā

Anyway, I am sure that the blog posts in my future are probably going to be more narrow in application and probably won’t be “marathon reads” that explain everything in detail but hopefully enough to connect the dots for the people who need it.

Power BI Governance – Approve/Reject all dataset creation in your tenant

This is a post about Power BI Governance that will not use any Power BI capabilities. We are going outside the box here to get the job done on keeping your environment governed.

If you can complete 3 simple steps that I will show below, then you can get an email like the one below every time someone adds an unauthorized dataset into your Power BI environment.

This blog post was inspired by some great work by Kent Weare and our Flow team here: https://flow.microsoft.com/en-us/blog/automate-flow-governance/ It was just close enough to what I needed to do that it worked out. In this post if I skim over a few details, refer back to this post and it might help.

Scenarios

Power BI has introduced more governance controls recently with the GA of Workspaces V2. I encourage everyone to use the out of the box features that now exist especially the contributor role discussed here.

In my scenario, the problem happens to be that V1 workspaces don’t have the same governance capabilities and any Power BI Pro user can go create a Team or O365 group and they will automatically become an administrator of the corresponding Power BI workspace (V1) that gets created. I want to monitor for any datasets getting created in V1 workspaces and be able to delete them even though I am NOT an admin in those workspaces. This is NOT possible to do without intervention I will outline below. Of course, i want to be able to contact these users and coach them about using new V2 workspaces.

Another scenario that can be done with this is to monitor for “rogue” gateway datasets. Would need a few modifications to the below workflow but would also be a popular scenario. I have talked to some organizations that have found dozens of gateways being used that they didn’t know about.

Three simple steps

I am lying as these steps aren’t so simple. They require skills that aren’t necessarily what a BI administrator may have. The good news however is that if you can get a little help from an O365 global administrator and maybe a developer friend, you can knock this out in less than a day. I have done most of the hard work on this and collectively built this solution in probably two developer days.

The three steps described below are as follows

  1. Need to register an app in Azure Active Directory to be able to use the Office 365 Management API, the Graph API, and the Power BI API. You will need a global administrator grant a set of permissions to this App before steps 2 and 3.
  2. In Microsoft Flow (or in Azure Logic Apps) import the provided “Flow” that contains all of the logic required for this.
  3. Use Postman (or REST API tool of choice) to call the O365 Management API to set the Flow created in step 2 as a “webhook” to any O365 audit log events (which of course includes “PowerBI-CreateDataset”.

Once the above is completed you will be amazed at how you now have the control that you didn’t think was possible. You no longer have to constantly audit for bad actors.

Let’s get started!

Azure Active Directory (O365 permissions) Setup

You may need a trained professional for this as I won’t be listing out all the necessary steps here. What makes it even harder is that we recently changed the user interface for this so catching up with some old docs may be confusing.

Go to the Azure Portal and from Azure Active Directory go to “App registrations” and click on¬†“New registration”. Give it a unique name and the redirect url can just be localhost

After it is registered, open it up and record/copy the following properties on the “Overview” dialog:

Application (client) ID
Directory (tenant) ID

You will need both of these GUIDs later.

Go to the “certificates & secrets” dialog and create a¬†“new client secret”. Also record/copy this value. It will be referred to as “App Secret” later in this post and we will need it.

On the “API permissions” dialog you will need to add permissions from 3 services: Office 365 Management API, Graph API, Power BI Service. These should be listed as groupings that you can select specific permissions out of. The screen shot below shows the permissions out of each group you are going to need. Pay attention to the¬†“Type” indicator of Application or Delegated.

Until this point, you might be surprised that you have been able to do all of this without a global administrator. Well, none of this will work if you don’t do a very important task of clicking the “Grant Consent” button at the bottom of this dialog page. This is going to either not exist or be grayed out if you are not an administrator.

Note: Go back to the overview dialog for this app and simply take note of the “Endpoints” button at the top. These are the different API authorization endpoints for the different services. I have provided what you need in the source code for the Flow and Postman, but it is nice to know how to get back to verify these later if something isn’t working for you.

Go back to Azure Active Directory and in the left nav go to “Users”

We need to find a Power BI Administrator that we don’t mind recording a password value for to be used in our Flow app. Often this is going to be a service account of some sort that gets used for API purposes ONLY and has different restrictions on password rotation/etc. This should have “Power BI Admin” privledges.

In my example i will be using the same account as a Power BI Admin Service Account and Power BI Admin email notification but you don’t want this in real life.¬† Get the “Object ID” from the Profile dialog and record/copy it for later use.

That’s it. Step #1 complete.

Import package to Microsoft Flow and Configure

In the source code repository¬†on GitHub I have provided both a Zip file and a JSON file that can be used. I haven’t fully tested the JSON file as that is supposed to allow you to import directly to Logic Apps. Microsoft Flow after all is built on top of logic apps so if you prefer that route go for it. For this blog post we are going to use the zip file (aka package).

Two connections that I use in this flow that you should be aware of is “Office 365” and “Flow Approvals”

When you import the package, you are going to need to use existing connections for these or create them in the “Data” section.
Note: I have no idea about Flow licensing or if your organization is setup for it. It could be possible that doing Flow Approvals requires a certain SKU or something. Literally no idea. If the below import steps fail simply because of the Approvals connector, let me know via Twitter @angryanalytics and I can create a version without approvals and just notifications. 

The import package dialog looks like this. You can use the “Action” button to make some modifications upon import (not sure why anyone wouldn’t like the name “angry governance”)

The related resources “select during import” link is where you can define new connectors if you don’t already have them in the “Data” section of your Flow app.

Once imported hover over the Flow item and you will see an “Edit” icon appear, click that.

The below is a screen shot of the not even the entire flow, only about 90% of it because it was too big to zoom out enough. You can see that it is quite complex.

I will attempt to explain each step of this Flow below. This turned into a dissertation so if you don’t care and just want to move on to making it actionable, skip below.

  1. The Flow is triggered by an O365 webhook (which we will configure later). Any time an audit event is created in O365 (which includes Power BI events) this Flow is triggered.
  2. Variables need to be set (some of the stuff I had you record earlier will need to be filled in here… i will explain below). These variables drive communication with the APIs we need that all use the permissions setup in Step #1.
  3. Call the Get Log Event API from O365 Management APIs and pass it an authorization token and the contentUri from the trigger event (this is effectively a link to lookup what happened).
  4. In parsing the response of the Log Event, there are going to be multiple records return. It enters a for each loop and if the “Operation equals CreateDataset” we do further inspection of the event, otherwise we do nothing.
    1. Note that you could modify this to look for any events listed here. The “CreateGateway” event comes to mind as a good one to notify a BI administrator on. Maybe good for the next blog post.
  5. For my scenario, If we do find a CreateDataset event I want to check if the Group created is a V1 or V2 workspace. This is done in two parts.
    1. Call the Graph API to “GET” the Group (in Power BI we call them Workspaces but in O365/Graph they are Groups).
    2. If the Workspace that the dataset was created in is a V2 workspace I don’t care about it as I already have governance controls there like a Contributor role and an admin setting that allows me to restrict users from creating workspaces. V2 workspaces do not show up in O365/Graph so therefore if I don’t get a status code of 200 (which means the call was successful) then I assume it is a V2 workspace and therefore i can stop processing.
      Note: What i did also find out late in testing is that datasets created in “My Workspace” will also throw a non 200 status code so I am not auditing for datasets there either.
  6. If this is a V1 workspace, i am now going to use the Graph API to add my Power BI Admin service account to that workspace. I have to do this because if i choose to reject this dataset as an administrator, the Power BI API WILL NOT work to delete the dataset unless the user is an admin of that workspace. Adding this account also gives the option to the Power BI administrator to login with that service account and do a manual inspection before approve/reject.
  7. I now check for a status code of 204 which indicates I added/created the user assignment successfully. If it wasn’t successful, that could actually mean that my Power BI Admin service account was already an admin of that workspace (might be rare but could happen). If this happens it throws a 400. I am not doing further inspection on a 400 error at this time so be aware that something else could trigger that and could mess up your Flow. If 400 is thrown, i set a variable via a “Compose” element to indicate the user is already a member/admin of the group. I use this later. If I didn’t get a 204 and I didn’t get a 400 I assume some other error occured and i send an email to the Power BI admin that a flow has failures
  8. Now we send a polite email warning to the user who created the unauthorized dataset in a V1 workspace and let them know it will be reviewed by an admin.
  9. We start an approval process that sends an email to the Power BI admin email address that was set in the Variables section in Step #2 with all the important information about the dataset.
    Note: For some reason the ReportName and ReportId are not working and therefore the link to the item is broken. Need to fix this.
  10. If the admin clicks the¬†“Approve” button on the email, an approval email is sent to the user who created the dataset, if not, proceed to step #11
  11. Now we have to call the Power BI API to Delete the dataset automatically. We send a rejection email to the user so they know it was deleted.
    Note: I don’t believe that V1 workspaces allow for use of user principals for authorization so this is why we had to set a Power BI Admin service account id and password in step #2. It would be cleaner to not have a user identity at all and just use the app permissions.
  12. Check the variable we set in Step #7 to see if the power bi admin service account needs to be deleted from the Group and if so we use the Graph API to do that so that there is no residual effect to that O365/Team from the Power BI admin inspection process once it is completed.

In the workflow, the only section that requires configuration is the “Set Variables” in the second action. Click on it to explode the “Scope”.

Other than the last variable for the email address to be used for the approvals, all of these are used for the API calls. They are required for authorization to do some pretty intrusive things like adding administrators to O365 groups and deleting Power BI datasets. This is why we needed an O365 global administrator to approve our app permissions earlier as this shouldn’t be taken lightly.

Let’s go through each of these and click on them to set the values:

Client Key Р(aka App Secret) You recorded this in the first section. Often these will have special characters so we need to html encode it. Use https://www.url-encode-decode.com/ and encode your client key and take that value and paste it in here
Client ID – paste exactly as copied (shouldn’t contain special characters)
Tenant ID – paste exactly as copied
Power BI Admin ID (for Graph) –¬†the Object ID for the admin service account exactly as copied
Power BI Admin email-username (for Power BI API) Рthe email address of the admin service account discussed in the first section
Power BI Admin Password (for Power BI API) Рthis is the password of the service account discussed in the first section
Power BI Admin Email (for Flow Approvals)¬†– This is an email address of a REAL PERSON who is checking emails on this account. Someone who can make the desicion to approve/reject datasets. Can it be a distribution list or security group? I don’t know for sure, try it out ūüôā

Once you have updated these variables, you have completed the configuration of your Flow (or logic app). The last thing you need to do is go to the first activity in your Flow (the trigger) and copy the HTTP POST URL. This is the “webhook” you need for Step #3.

At the top of the page, click “Save”.

That’s it. Step #2 complete.

Create the Webhook

This is by far the easiest shortest step of the 3… however it can be intimidating if you aren’t familiar with REST APIs and tools like Postman. This is where a developer friend can make very short work out of this task.

Every time an event happens in Power BI, we want our Flow to be called to inspect it. We can do this by setting up a webhook. A webhook is simply an automated endpoint that gets called every time something occurs.

One thing to note about Power BI events: It could take as long as 30 minutes after the event occurs for the event to show up in the audit logs. The webhook fires immediately after that, but when you go to test this, realize that it is pretty consistently 20-30 minutes before your Flow will get executed.

In the source code that i provided i included a Postman collection that can be used to do testing on a lot of the API calls that are used in the Flow. These are provided for troubleshooting purposes.

Import¬†the collection from the my GitHub repository. You can either download it and import or simply “Import from Link”.

In that collection is an O365 Management API folder that has an “Auth” and then a “Create Webhook” API that need to be executed to hook up your Flow with the URL we recorded at the end of Step #2.

Run the “Auth” first. When you go to run it you will notice that i have used variables for the url and the body (x-www-form-urlencoded).

Create an environment with these variables or simply replace these values with those recorded in Step #1. For the redirectUri just use http://localhost or the custom value you used when setting up your app registration in step #1.

Send the request (should be a POST) and copy the value returned in the “access_token”. This value is how you authorize subsequent calls to the O365 Management API. It is good for one hour and holds information about the permissions your app has (those api permissions you setup in step #1).

Now open the “Create Webhook” request. In the “Authorization” tab replace the “Token” value with the access_token you just recorded. In the “Body” replace the {{flowurl}} variable with the Flow trigger url you recorded at the end of Step #2. Replace the {{activityfeedauthid}} with some arbitrary name. Of course i used “angrygovernance

If you got a success then your webhook should be created and you are now ready to test.

Step #3 complete!

Testing

Upload a Power BI report to a V1 workspace in Power BI. Go to the Flow you imported and Saved. After about 30 minutes you should see a Flow that is running (it has manual approval steps to be taken so it will run for months until approved/rejected.

you will see other “succeeded” runs of your flow that are not important. These are other events occuring in O365 such as you saving your Flow… We have a condition looking for only “PowerBI-CreateDataset” events before we are kicking off approval process.

You should see emails generated to the administrator email id you specified and also back to the user that created the dataset. If you reject the dataset, it should automatically get deleted (along with the report) and the Power BI Admin user should get removed from the V1 workspace if it was not already a member of it before the Flow started.

Conclusion

I know that your scenarios may vary from mine that i have provided above, but with this Flow to use as a basis, you should be able to do some adjustments and create really kick ass governance Flows that give you more control in Power BI than you ever had before.

 

 

 

 

Sync your on-prem DW to Azure DW with 3 ADF pipelines

Most organizations are trying to move to cloud for advanced analytics scenarios, but they have one big problem: They have invested a decade in an on premises data warehouse that has too much spaghetti architecture around it to untangle. I will discuss full migration options in Part 2 of this blog post, but will be focused in this article about using Azure Data Factory to keep an on prem DW (whether that is Teradata, Netezza, or even SQL Server) synchronized to Azure SQL DW on a nightly basis. Synchronization is important because it will allow all of your “new development” to happen in the cloud while allowing the 12-24 months that you may need to do a true DW migration project to pull the legacy connections to the new environment.

This article is going to assume at least some basic understanding of Azure Data Factory V2 (ADF). It is also a good idea to be familiar with this documentation on copying data to and from Azure SQL DW with ADF.
https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-sql-data-warehouse

TL;DR

This post focuses on most difficult part of data synchronization: dealing with updates! Most data warehouses still see updated records and the ADF templates provided for “delta loads” only deal with inserts. On my github¬† https://github.com/realAngryAnalytics/adf I have provided an ARM Template that can be imported that includes the control tables and stored procedures needed to make this work.

Azure Data Factory V2 and Azure SQL DW Gen 2

These products have matured over the last couple of years so much that it has made me (an analytics guy) excited about data movement and data warehousing. With the addition of Mapping Data Flows https://docs.microsoft.com/en-us/azure/data-factory/data-flow-create ADF now has a real transformation engine that is visually aesthetic and easy to use. I have also found when using SQL DW Gen 2 my Direct Query times from tools like Power BI have significantly decreased and starts to make Direct Query a real option for BI tools.

All you need is 3 pipelines!

Of course, the devil is always in the details, however for most large enterprise data warehouses, these 3 pipelines should cover 98% of your scenarios. You may need a custom pipeline here or there but the idea is to write 3 generic pipelines and use control tables and parameters to handle 100s or 1000s of tables.

Delta Loads and dealing with updates

In this post, we will deal with the most difficult part of data synchronization which is updated records in a delta load. Azure Data Factory provides a template for delta loads but unfortunately it doesn’t deal with updated records.

Delta load template:
https://docs.microsoft.com/en-us/azure/data-factory/solution-template-delta-copy-with-control-table

Accompanying this blog post is my GitHub repository https://github.com/realAngryAnalytics/adf  where you will find all the resources needed to perform the below example.

Example Dataset

This example uses an Azure SQL Database with Adventure Works DW 2016 installed. Can be downloaded here: https://www.microsoft.com/en-us/download/details.aspx?id=49502

We are using the FactResellerSales table and will do an initial sync to Azure SQL DW and then perform inserts and updates on the source dataset and see them properly be reflected.

Source Database Prep

Run the data prep script:
https://github.com/realAngryAnalytics/adf/blob/master/sqlscripts/delta_load_w_updates/sourcedatabaseprep.sql

Two main things are occurring here:

  • Create a new schema called [ANGRY] in the source database, create FactResellerSales table and copy original data here so that we do not disturb the original sample tables that you may want to use for something else later.
  • Alter table to add a “ModifiedDate” and initially populate with the original OrderDate

Must have Modified Dates on Delta Load tables

This is the one prerequisite you will have on your data warehouse. Tables that you are unable to truncate and reload are going to be Fact/Transactional type tables and there should usually be a Modified Date concept. But sometimes there is not and that is an effort that will need to be made to make this work.

Destination Database Prep

The syntax in the below data prep script is for Azure SQL DW. There are some SQL differences from a standard SQL Server. Keep that in mind.

Run the data prep script:
https://github.com/realAngryAnalytics/adf/blob/master/sqlscripts/delta_load_w_updates/destdatabaseprep.sql

There is a lot going on in this script and worth reviewing. The basic stuff is as follows:

  • Creating new [ANGRY] schema (as done in source also)
  • Creating watermark table and update_watermark stored procedure (explained in detail in the base delta load template: https://docs.microsoft.com/en-us/azure/data-factory/solution-template-delta-copy-with-control-table)
  • Populate watermark table with initial date to use (1/1/1900 will mean our first pipeline run will replicate all data from FactResellerSales source to the destination SQL DW).
  • Create FactResellerSales table (alter with ModifiedDate)

How to handle updates

To perform updates most efficiently you should have a Staging table for every table that will be doing delta loads. It will have an identical DDL to the actual table (FactResellerSales) except it will be a HEAP table

There will also be a delta control table that will contain the primary key columns that will determine uniqueness of a record for each table. My implementation handles up to 6 key columns. If you have more, you will have to modify the table and stored procedure.
Note: Being able to handle a multi column unique constraint is very important to handle source DWs such as Teradata.

/* This is specific to the concept of being able to handle delta loads 
	that may contain updates to previously loaded data */
CREATE TABLE [ANGRY].[deltacontroltable](
	[TableName] [varchar](255) NULL,
	[WatermarkColumn] [varchar](255) NULL,
	[KeyColumn1] [varchar](255) NULL,
	[KeyColumn2] [varchar](255) NULL,
	[KeyColumn3] [varchar](255) NULL,
	[KeyColumn4] [varchar](255) NULL,
	[KeyColumn5] [varchar](255) NULL,
	[KeyColumn6] [varchar](255) NULL
) 
WITH  
  (   
    CLUSTERED INDEX (TableName)  
  ); 
  GO

  
/* Insert into deltacontroltable the metadata about the destination table and the key fields that are needed to 
	get a unique record. For FactResellerSales this is SalesOrderNumber and SalesOrderLineNumber */
INSERT INTO ANGRY.deltacontroltable
values ('ANGRY.FactResellerSales','ModifiedDate','SalesOrderNumber','SalesOrderLineNumber',null,null,null,null);

Our FactResellerSales table requires two keys to determine uniqueness: “SalesOrderNumber” and “SalesOrderLineNumber”. The insert statement is included in the code above.

Non used key columns are left null. I also have a WatermarkColumn in my implementation however I did not determine a need for it so it could be omitted.

The Magic Stored Procedure

The delta_load_w_updates¬†stored procedure (in the destdatabaseprep.sql script linked above)¬† uses the control table, sink table and staging table to build a dynamic sql that will delete records from the sink data source if they exist, and then insert all records from the staging table in bulk. Note that an example of the generated SQL below doesn’t use an INNER JOIN directly in the DELETE as SQL DW will not support it.

/* Example output of the stored procedure */
DELETE ANGRY.FactResellerSales 
WHERE  CAST(SalesOrderNumber as varchar(255)) + '|' +  CAST(SalesOrderLineNumber as varchar(255)) 
IN (SELECT  CAST(t2.SalesOrderNumber as varchar(255)) + '|' +  CAST(t2.SalesOrderLineNumber as varchar(255)) 
		FROM ANGRY.FactResellerSales_Staging t2 
		INNER JOIN ANGRY.FactResellerSales t1 ON  t2.SalesOrderNumber = t1.SalesOrderNumber AND  t2.SalesOrderLineNumber = t1.SalesOrderLineNumber)

With a stored procedure that does the above DELETE and INSERT based on records that already exist and INSERT all the data that has a modified date greater than the last watermark, we can now move on to the ADF pipeline nuances to consider between the base delta load template and this new version that will handle updates.

Deploy the Data Factory

First we will deploy the data factory and then we will review it.

In the Azure Portal (https://portal.azure.com), create a new Azure Data Factory V2 resource. I named mine “angryadf”. Remember the name you give yours as the below deployment will create assets (connections, datasets, and the pipeline) in that ADF.

The button below will deploy the data factory assets including the delta_load_w_updates pipeline. Note: it will fail if you haven’t created the data factory yet



This is a template that requires three connection strings:

  • Blob storage account connection string – will be used for staging the load to SQL DW. This can be found in your “Keys” dialog of your blob storage account
  • Source azure sql database connection string – will be the source. Found in “settings -> Connection Strings” dialog. Using SQL Authentication is probably easiest. Replace “User ID” and “Password” entries with your real values or the deployment will fail.
    • Even if you choose to use Teradata or Netezza or SQL on prem for the source, go ahead and use an azure sql database so that the template deploys, then you can modify the source. (it will cost you $5 a month)
  • Destination azure sql data warehouse connection string – will be the destination. Just like Azure SQL DB, it is found under “settings -> Connection Strings” dialog. Don’t forget to change your User ID and Password entries

Or from https://adf.azure.com you can import the azuredeploy.json file (ARM Template) from the root of my github https://github.com/realAngryAnalytics/adf/blob/master/azuredeploy.json

Review the Pipeline

This pipeline is based on the original delta load template so I will only review the main differences I have implemented.

Note: In the below screenshots there is liberal use of parameters and if you are not familiar with the “Add Dynamic Content” feature it may look more difficult than it is. Using “Add Dynamic Content” allows you to construct expressions visually. See more about expressions here.

Using Parameters to pass into the pipeline is how you can keep it generic to work with any delta load. I have added defaults to all of them to work with this example and one additional parameter has been added for the destination staging table:

In the copy activity, instead of inserting directly to the same table in the sink, it is inserting into the staging table.

Note: As you can see from the above screen shot, I also prefer to parameterize the table name for the source and sink dataset objects. ADF has a bad habit of creating individual datasets for every table which gets out of control. The use of a TableName parameter allows you to create one source dataset and one sink dataset. 

Also on the copy activity, there is a pre-copy script action that is going to truncate the staging table before every run. This means that after each run the records are left in the staging table (presumably for a day) to investigate for any errors until the next time the pipeline runs and truncates the rows before inserting the next day’s records.

Lastly the stored procedure activity I have changed from the original that just updated the watermark table to the delta_load_w_updates procedure we created above. This procedure does the work to move from staging to the main table and then updates the watermark table inline.

Initial Run (initial load of all data to SQL DW)

In the pipeline, ensure it is valid by clicking “Validate” check box and after fixing any errors click “Debug”.

Note: Default values for all pipeline parameters are entered. If you changed the schema name or used a different table, these will have to be modified.

As you are debugging you will get an output for each activity. In the “Actions” column you can retrieve the inputs, outputs, and look at the details of the copy activity by clicking on the glasses icon.

Details of the copy activity:

You can run the below SQL statements to see that all rows have been copied to the staging table and then to the main table and that the watermark table has been updated with the MAX ModifiedDate in the source (which happens to be 11/29/2013 for my old AdventureWorks dataset)

   select count(*) from [Angry].[FactResellerSales] 
   select count(*) from [Angry].[FactResellerSales_Staging] 
   select * from [Angry].[watermarktable]

Insert and Modify records in the source

Run the below script to create a very basic adventure works FactResellerSales data generator that takes the first input as number of inserted records and second input as number of updated records you want to use.
https://github.com/realAngryAnalytics/adf/blob/master/sqlscripts/delta_load_w_updates/advworks_datagenerator.sql

So let’s synthetically create 1000 new records and 500 modified records against the source dataset.

exec ANGRY.sp_generate_advworks_data 1000, 500

Verify the creation of new and modified records in the source dataset.

/* First query retrieves the inserts, second query retrieves the updates, third query should be total */
select count(*) from ANGRY.FactResellerSales where OrderDateKey = 0; 
select count(*) from ANGRY.FactResellerSales where OrderQuantity = 42;
select count(*) from Angry.FactResellerSales where ModifiedDate = (select MAX(ModifiedDate) from ANGRY.FactResellerSales);

Note: The purpose of this example is to demonstrate capability, not to have realistic data, so all inserted records have zeros for nearly all the integer values and the updates simply updated the OrderQuantity to 42.

Run the pipeline again. Note you can review the outputs of the first two lookup activities by clicking on the actions icon below.

Below, the first value is last watermark (retrieved from the watermark table) and the second value is the current watermark (retrieved from the max value in the source table)

{ "firstRow": 
    { "TableName": "ANGRY.FactResellerSales", 
      "WatermarkValue": "2013-11-29T00:00:00Z" }, 
  "effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US 2)" }

{ "firstRow": 
    { "NewWatermarkValue": "2019-03-26T19:56:59.29Z" },
  "effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US)" }

The copy activity is going to retrieve everything from the source table between these two dates based on the source SQL that is being constructed in the “Query” field.

Reviewing the copy activity details, you should see 1500 records

Run the same queries you ran against the source table now against the staging table to see the inserted and modified records.

Unfortunately there is not a pretty view of what happened in the stored procedure activity. All we can see are the inputs and outputs as shown below:

{ "storedProcedureName": "[ANGRY].[delta_load_w_updates]", 
  "storedProcedureParameters": 
           { "LastModifiedDatetime": 
               { "value": "2019-03-26T19:56:59.29Z", 
                 "type": "DateTime" },
             "StagingTableName": 
               { "value": "ANGRY.FactResellerSales_Staging", "type": "String" }, 
             "TableName": { "value": "ANGRY.FactResellerSales", "type": "String" } 
            } 
}

{ "effectiveIntegrationRuntime": "DefaultIntegrationRuntime (East US 2)", 
   "executionDuration": 4239 }

To implement this in production you will probably want to harden the stored procedure by adding transaction/rollback and some validation output.

Nonetheless, you should see in your destination table (by running the same queries as above) that you have successfully inserted 1000 records and updated 500 existing records.

Congratulations, you now have SQL code and an ADF pipeline that handles the hardest part of syncing a data warehouse to the cloud which is dealing with updates.

How does this change if source is Teradata or Netezza?

This example used a SQL source, however not too much should change with Teradata or Netezza. Double check the query syntax for your source, but both the “LookupCurrentWatermark” activity as well as the “DeltaCopyFromDB” activity should both be standard T-SQL, but nonetheless, worth checking if you have errors.

One thing to triple make sure however is that in the “DeltaCopyFromDB” activity under the sink tab that “Use Type default” is NOT checked.

In SQL Server, this doesn’t seem to matter but when copying null values for numeric fields from Teradata, this will cause polybase to error out with an “empty string cannot be converted to a decimal” error.

Expanding to 100’s of tables

So we have only done this for one table yet I say that with this pipeline and two others I can sink an entire data warehouse. If you are familiar with ADF (which you have to be if you endured this entire article) than you know you can create Triggers with different input parameters to handle different tables. For each new table you want to do delta loads it will require an entry in the deltacontroltable as well as a new staging table. This is simply the cost of doing business to achieve cloud success. This is fairly straight forward data entry work or through a couple days of scripting you can automate it.

One other thing to note, if you are sinking a data warehouse, remember that the source DW is downstream from existing ETL processes. Now we are adding on additional processing that needs to be completed at the end of your existing nightly loads. So even though you could create 100s of scheduled triggers in ADF with the different parameters required for each delta load, more realistically you are going to want to kick off the delta load pipeline as an event as soon as that table is loaded into the source DW. This can be achieved with the ADF APIs

https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-rest-api#create-pipeline-run

Next in Part 2

Handling updates in ADF was the part of this I really wanted to cover as it seems to be what everyone struggles with. However, there are two other pipelines that are needed to complete the synchronization scenario, so I will cover those in a little more depth based on my experience. If I would have done this in logical order I would have started with the bulk load but I am not logical.

I also want to discuss how to pull forward all of your legacy integrations with your on prem DW.

Dealing with Budget vs Actuals Closed Dates

This post is an extremely practical one targeted at business analysts. Most of you have a scenario that includes a general ledger and monthly budgets that have corresponding actuals. Often there are actuals in the system that you don’t want to include because they haven’t “closed” yet.

If you are somewhat new to Power BI, you are probably handling this by manually setting filters in your Power BI report and having to adjust them each month. When i see report authors try to automate this i typically see a single measure including a FILTER expression to account for ONLY closed actuals. The problem with this approach is that you end up writing this FILTER expression many times over if there are many measures that need to observe closed vs unclosed actuals. And this approach also doesn’t allow for adjusting axis on bar charts displaying your budget and actuals values.

In this blog post I will show an alternative option that is pretty easy and allows it to be applied to any measure that you may include in your model. For the PBIX file referenced below, you can find it on my GitHub

In the report below, notice that my main “% Variance to Budget” measure on the left side is including unwanted actuals from 2016, a few months in 2017 and 2018. I am using the same “% Variance to Budget” measure on the right side but is only including months that have a budget defined (2017 only) as well as ONLY months that actuals have closed (January through October excluding November and December).

The two visuals on the right are utilizing a calculated date table based on our “closed actual dates” instead of the full date dimension. Lets review the simple data model being utilized in this example.

I have a budget table that includes the status if that month has been “closed” or not.

Date Budget Status
01/01/2017 300 Closed
02/01/2017 500 Closed
03/01/2017 500 Closed
04/01/2017 400 Closed
05/01/2017 400 Closed
06/01/2017 500 Closed
07/01/2017 500 Closed
08/01/2017 400 Closed
09/01/2017 400 Closed
10/01/2017 400 Closed
11/01/2017 400 Open
12/01/2017 500 Open

 

Notice that it does not include any dates from 2016 or 2018 but also shows the status of November and December as “Open”.

Now in my actuals table, you can see that the dates include 2016, November and December of 2017, and 2018.

Date Actuals
01/01/2016 800
03/01/2016 100
09/12/2016 4000
11/22/2016 250
01/04/2017 100
01/28/2017 300
02/01/2017 500
03/12/2017 200
04/05/2017 400
04/22/2017 100
05/02/2017 300
05/28/2017 100
06/29/2017 500
07/01/2017 100
07/04/2017 800
07/23/2017 200
08/09/2017 400
09/21/2017 100
09/23/2017 500
10/11/2017 300
10/20/2017 100
10/31/2017 250
11/12/2017 100
11/21/2017 200
12/01/2017 500
12/22/2017 2000
12/30/2017 100
01/02/2018 200
02/02/2018 1000

I have pulled both of these tables into my data model and have also included a Date dimension table that i typically use. However,¬† have also created a “calculated table” from the “New Table” icon on the modeling tab that is going to reflect our “closed actuals dates”.

The formula for this calculated table is below:

ClosedDates = CALENDAR(
    MIN(Budget[Date]),
    CALCULATE(EOMONTH(MAX(Budget[Date]),0),Budget[Status]="Closed"))

 

The CALENDAR() DAX function expects two dates that it will use as the beginning of the date range and the end of the date range that will be produced. All dates between these two dates will be produced as a table.

The “Start Date” is pretty easy as we will use the minimum date from the budget table “Budget[Date]”. The “End Date” however is a bit more complex. A Calculate statement is used so that we can apply a FILTER to the table to ensure the Budget[Status] is “Closed”. This is the expression is the second parameter to the CALCULATE function. The first parameter is doing two things:

  • First it is getting the MAX Budget Date which in our budget table after the filter for status is applied would be 10/1/2017. The problem is that our budget is monthly while our actuals are daily. The monthly budget dates are being stored as the first date of the month.
  • So, the second function being applied is EOMONTH which stands for “End of Month”. This will give the last date in the month based on a date value in its first parameter and an offset (how many months from the date that is given) in which our case we want to use the same month so the offset is zero.

 

Now we have a calculated table that includes ONLY dates we want to use when comparing actuals vs budget. In the data model, lets connect this to the Date Dimension table as that is the table that would likely hold the relationship to the Actuals and Budget table.

You can see from my model i did add two additional calculated fields to the ClosedDates table for Month and Year. Those formulas are below:

Month = Month(ClosedDates[Date])

Year = Year(ClosedDates[Date])

 

To test this, i used a percentage difference formula that i created from a “Quick Measure” by clicking the “…” next to a table you want to create the measure in.

I named mine “% variance to budget” which automatically produced the following DAX expression:

% Variance to Budget = 
VAR __BASELINE_VALUE = SUM('Budget'[Budget])
VAR __VALUE_TO_COMPARE = SUM('Actuals'[Actuals])
RETURN
 IF(
  NOT ISBLANK(__VALUE_TO_COMPARE),
  DIVIDE(__VALUE_TO_COMPARE - __BASELINE_VALUE, __BASELINE_VALUE)
 )

 

Now that the measure is complete, add a card visual displaying it.

Initially you will get a number that contains all actuals being used as the comparison value to the budget, but by simply dragging into the FILTERS pane your ClosedDates[Date] field as a filter in the visual or even at the page level (if everything on this page is comparing actuals against budget) equal to “is not blank” it will ONLY show actuals for dates that are closed and a budget is defined.

Now your measure is only showing variance based on the dates that exist in the Closed Dates calculated table

Because we created a calculated table, it can also be utilized in a bar chart showing our “actuals” and “budget” fields without having to wrap them in measures or manually filter out the time frames that don’t have budget associated or closed actuals.

 

Conclusion

Not the flashiest blog post but I hope can help solve a real practical problem for many business analysts trying to compare budgets vs actuals or even forecasts.

Power BI Audit Log Analytics Solution

As Power BI adoption in your organization grows, it becomes more and more important to be able to track the activity in the environment.

When you start to think about deploying a Power BI Audit Log solution that is repeatable there are a few challenges that you will face.

  • Going to the O365 Audit Logs portal each time you want to extract log events is a manual process and doesn’t scale
  • When automating this process through API or PowerShell, there is a limit to how much data you can pull, therefore examples that are currently available also don’t scale very well
  • The AuditData field is a JSON format by default and although Power BI can parse JSON beautifully, when doing this over several thousand record entries may result in data load errors

Based on these factors, i have put together a PowerShell script that can be scheduled on a nightly basis that can iterate MORE than 5000 records so that no data is lost. Also, the screenshot below is of an initial template that you can use to start with to analyze your audit logs for your organization.

TL;DR

.

  • The required files can be found on my GitHub
  • Update the PowerShell script with a UserID and Password that has O365 audit log privileges
  • Use Task Scheduler to schedule the PowerShell script to run each night at midnight (run as admin).
  • At the end of the script, specify the directory you would like the script to generate CSV files in
  • In the PBIX file, it was challenging to get a parameter to work for the file location that the CSVs are in, so in the Query Editor the script for the AuditLog table needs to be manually modified to include your file path.
  • Enjoy

Quick look at the PowerShell

First, there is a PowerShell script.

Set-ExecutionPolicy RemoteSigned

#This is better for scheduled jobs
$User = "<<enter o365 admin user email here>>"
$PWord = ConvertTo-SecureString -String "<<enter password here>>" -AsPlainText -Force
$UserCredential = New-Object -TypeName "System.Management.Automation.PSCredential" -ArgumentList $User, $PWord

#This will prompt the user for credential
#$UserCredential = Get-Credential


$Session = New-PSSession -ConfigurationName Microsoft.Exchange -ConnectionUri https://outlook.office365.com/powershell-liveid/ -Credential $UserCredential -Authentication Basic -AllowRedirection
Import-PSSession $Session



$startDate=(get-date).AddDays(-1)
$endDate=(get-date)
$scriptStart=(get-date)

$sessionName = (get-date -Format 'u')+'pbiauditlog'
# Reset user audit accumulator
$aggregateResults = @()
$i = 0 # Loop counter
Do { 
	$currentResults = Search-UnifiedAuditLog -StartDate $startDate -EndDate $enddate `
								-SessionId $sessionName -SessionCommand ReturnLargeSet -ResultSize 1000 -RecordType PowerBI
	if ($currentResults.Count -gt 0) {
		Write-Host ("  Finished {3} search #{1}, {2} records: {0} min" -f [math]::Round((New-TimeSpan -Start $scriptStart).TotalMinutes,4), $i, $currentResults.Count, $user.UserPrincipalName )
		# Accumulate the data
		$aggregateResults += $currentResults
		# No need to do another query if the # recs returned <1k - should save around 5-10 sec per user
		if ($currentResults.Count -lt 1000) {
			$currentResults = @()
		} else {
			$i++
		}
	}
} Until ($currentResults.Count -eq 0) # --- End of Session Search Loop --- #

$data=@()
foreach ($auditlogitem in $aggregateResults) {
    $datum = New-Object ‚ÄďTypeName PSObject
    $d=convertfrom-json $auditlogitem.AuditData
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName Id ‚ÄďValue $d.Id
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName CreationTime ‚ÄďValue $auditlogitem.CreationDate
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName CreationTimeUTC ‚ÄďValue $d.CreationTime
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName RecordType ‚ÄďValue $d.RecordType
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName Operation ‚ÄďValue $d.Operation
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName OrganizationId ‚ÄďValue $d.OrganizationId
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName UserType ‚ÄďValue $d.UserType
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName UserKey ‚ÄďValue $d.UserKey
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName Workload ‚ÄďValue $d.Workload
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName UserId ‚ÄďValue $d.UserId
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName ClientIP ‚ÄďValue $d.ClientIP
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName UserAgent ‚ÄďValue $d.UserAgent
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName Activity ‚ÄďValue $d.Activity
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName ItemName ‚ÄďValue $d.ItemName
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName WorkSpaceName ‚ÄďValue $d.WorkSpaceName
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName DashboardName ‚ÄďValue $d.DashboardName
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName DatasetName ‚ÄďValue $d.DatasetName
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName ReportName ‚ÄďValue $d.ReportName
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName WorkspaceId ‚ÄďValue $d.WorkspaceId
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName ObjectId ‚ÄďValue $d.ObjectId
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName DashboardId ‚ÄďValue $d.DashboardId
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName DatasetId ‚ÄďValue $d.DatasetId
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName ReportId ‚ÄďValue $d.ReportId
    $datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName OrgAppPermission ‚ÄďValue $d.OrgAppPermission

    #option to include the below JSON column however for large amounts of data it may be difficult for PBI to parse
    #$datum | Add-Member ‚ÄďMemberType NoteProperty ‚ÄďName Datasets ‚ÄďValue (ConvertTo-Json $d.Datasets)

    #below is a poorly constructed PowerShell statemnt to grab one of the entries and place in the DatasetName if any exist
    foreach ($dataset in $d.datasets) {
        $datum.DatasetName = $dataset.DatasetName
        $datum.DatasetId = $dataset.DatasetId
    }
    $data+=$datum
}

$datestring = $startDate.ToString("yyyyMMdd")
$fileName = ("c:PBIAuditLogs" + $datestring + ".csv")
Write-Host (" writing to file {0}" -f $fileName)
$data | Export-csv $fileName

Remove-PSSession -Id $Session.Id

 

  • Notice that you need to enter O365 audit log privileged credentials at the top so that this can be ran automatically. If you have more clever ways to pass these credentials in so they are not exposed in the file by all means, do that
  • The Do/Until loop handles if there are more than 5000 records in the result set which would easily be the case for a large Power BI community.
  • The foreach loop extracts the AuditData column JSON format and creates an individual record for each entry. This makes the Query Editor in Power BI less complex and easier to accomplish retrieving several hundred thousand records without import errors
  • finally we create a CSV for the data with the date of the file entries (yesterdays info if this is ran at midnight every day). This dumps each file in c:PBIAuditLogs. You can obviously change this file location to wherever you want to store your CSV extracts

You can use Task Scheduler to run the above PowerShell script every night at midnight.

The PBIX file

In the Power BI file, we are connecting to the content of the entire folder shown above. I went ahead and included the PBIX file WITH the sample data so you could get an idea of what your data may look like.

This is where i have to admit that i tried to use a parameter for this but ran into some Query Editor challenges with how Power BI creates a Sample File transform to import multiple files from a single folder. If you can see what i did wrong here I would love your feedback, but for now, you can ignore the file directory parameter in the Query Editor and need to go to “Advanced Editor” on the “AuditLog” query and modify the file location to be the location you are dumping files from the PowerShell script.

Change the below file location as needed.

Once you have made this change, you should be able to “Close and Apply” and your data will now be populated in this basic audit log analytics view.

Using the file

I created a couple basic pages to get this blog post shipped and so you can start taking advantage of the solution, but it is nowhere near as complete as you can eventually make it. I have a simple overview page that was screenshotted above. It can help you determine number of active users, reports, dashboards, and datasets being used for any time period your audit log data covers.

The second page is a user activity view i created from a calculated table. It helps you determine which users in the system may be inactive so you can re-assign power bi licenses and also shows detailed activity for an individual user that you can select from the slicer.

Other things you can mine from this data:

  • Who has signed up for Free Power BI trials
  • What “Apps” are being created
  • what embed tokens are being generated and used
  • many other possibilities

The PowerShell script and the PBIX file are located on my GitHub here

 

 

Regular Expressions will save your life!

I am closing out 2017 with a refreshing project that has led me away from Power BI for a bit. However, even for the Power BI community, I think the below information is valuable because at some point, you are going to run into a file that even the M language (Power BI Query Editor) is going to really have a hard time parsing.

For many of you, its still a flat file world where much of your data is being dropped via an FTP server and then you have a process that parses it and puts it in your data store. I recently was working with a file format that I have no idea why someone thought it was a good idea, but nonetheless, i am forced to parse the data. It looks like this:

display > e1
Site Name   : Chicago IL                  Seq Number     : 111
Mile Mrkr   : 304.40                      DB Index #     : 171

Direction   : South                       Arrival        : 00:02  09-22-2017
Speed In/Out: 33/18 MPH                   Departure      : 00:03:45
Slow Speed  : 38 MPH                      Approach Speed : 0 MPH
                                          Approach Length: ~0.0 Feet

Amb Temp    : 81 F                        Battery Voltage: 12.03

Axles       : 5                           Truck Length   : 56.0 Feet
Alarms      : 0                           Cars           : 1
Integ Fails : 0                           Gate A Cnt     : 1
System Warn : 0                           Gate B Cnt     : 1
Weight      : 72000
HBD Filter  : 13 Point Median Filter
Car   Axle   Weight   Ch1   Ch2
Num    Num   (LBS)    (F)   (F)   Alarms
-------------------------------------------------------------- Weight Units = LBS
  1      1    17000.0   N/A   N/A
         2    17000.0   N/A   N/A
         3    17000.0   N/A   N/A
         4    17000.0     0     0
         5    17000.0     0     0

This data simulates truck weigh-in station data. There is a lot of “header” information followed by some “line” items.

Just think for a moment how you would approach parsing this data? Even in Power BI, this would be extremely brittle if we are counting spaces and making assumptions on field names.

What if a system upgrade effecting the fields in the file is rolled out to the truck weigh stations over the course of several months? Slight changes to format, spacing, field names, etc… could all break your process.

 Regex to the Rescue

In my career as a developer, I never bothered to understand the value of regular expressions (regex). With this formatted file I now see that they can save my life (well, that may be dramatic, but they can at least save me from a very brittle pre-processing implementation)

For anyone unfamiliar with regex, a regular expression is simply a special text string for describing a search pattern. The problem is, they are extremely cryptic and scary looking and you want to immediately run away from them and find a more understandable way to solve your text string problem. For instance, a regular expression that would find a number (integer or decimal) in a long string of characters would be defined as

d+(.d*)?

What the heck is that?

The “d” represents any decimal digit in Unicode character category [Nd]. If you are only dealing with ASCII characters “[0-9]” would be the same thing. The “+” represents at least one instance of this pattern followed by (.d*) which identifies an explicit dot “.” followed by another d but this time with a “*” indicating that 0 to n instances of this section of the pattern unlike the first section that required at least 1 instance of a digit. Therefore this should result in true for both 18 as well as 18.12345. However, regex are greedy by default, meaning it expects the full pattern to be matched. So without adding the “?” to the end of the string, the above regex would NOT recognize 18 as a number. It would expect a decimal of some sort. Because we have included the “?” it will end the match pattern as long as the first part of the match was satisfied, therefore making 18 a valid number.

So, the regex was 11 characters and it took me a large paragraph to explain what is was doing. This is why they are under utilized for string processing. But if you are looking at it the other way, it only took 11 characters to represent this very descriptive pattern. Way cool in my book!

Regex language consistency

My example above was from Python. As i am a “data guy”, i find Python to have the most potential for meeting my needs. I grew up on C# and Java however so understanding regex may have some slight variations between languages. Some interesting links on this are below:

Stack Overflow: https://stackoverflow.com/questions/12739633/regex-standards-across-languages

language comparison on Wikipedia: https://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines

Building a Parser using Regex

This file has all kinds of problems. Notice the value formats below:

Temperature: 81 F
Length: 56 Feet
Datetime: 00:02 09-22-2017
Time: 00:03:45
Speed: 33/18 MPH

In addition to text and numeric values, we also have to deal with these additional formats that should be treated as either numeric or datetime values.

I am going to use Python to parse this file and will use a “tokenizer” pattern discussed in the core Python documentation for the re (regex) library:¬†https://docs.python.org/3.4/library/re.html

This pattern will allow us to assign a “type” to each pattern that is matched so we do not have to count spaces and try to look for explicitly named values which could break with any slight modifications to the file.

Below is a function that returns a named tuple with values for the type, the value, the line, and the column it was found in the string.

import re
import collections

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(line):
    token_specification = [
        ('SPEED_IN_OUT',    r'(d+(.d*)?/d+(.d*)?s{1}MPH)'),  # speed with multiple values (ex. 15/10 MPH)
        ('SPEED',           r'(d+(.d*)?s{1}MPH)'),  # speed with one value (ex. 10 MPH)
        ('LENGTH',          r'(d+(.d*)?s{1}Feet)'),  # length in feet (ex. 10 Feet)
        ('TEMP',            r'(d+(.d*)?s{1}[F])'),  # Temperature in Fahrenheit (ex. 83 F)
        ('DATETIME',        r'(d+:(d+(:d)*)*)+s+(d+-d+-d+)'),  # Datetime value (ex. 00:00:00  12-12-2017)
        ('TIME',            r'(d+:(d+(:d)*)*)+'),  # time value only (ex. 00:02   or   ex.  00:02:02)  
        ('ID_W_NBR',        r'(d+(.d*)?s([/w]+s?)+)'),  # ID that is prefixed by a number    
        ('NUMBER',  r'd+(.d*)?'),  # Integer or decimal number    
        ('ID',      r'([/w]+s?)+'), # Identifiers
        ('ASSIGN',  r': '),           # Assignment operator
        ('NEWLINE', r'n'),           # Line endings
        ('SKIP',    r'[ t]+'),       # Skip over spaces and tabs
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)

    line_num = 1
    line_start = 0
    for match in re.finditer(tok_regex, line):
        kind = match.lastgroup
        value = match.group(kind)
        if kind == 'NEWLINE':
            line_start = match.end()
            line_num += 1
        elif kind == 'SKIP':
            pass
        else:
            column = match.start() - line_start
            token = Token(kind, value.strip(), line_num, column)
            yield token

In my list of token specifications, i have included the most restrictive matches first. This is so that my value for “56.0 Feet” won’t be mistaken for “56.0 F” which would have it identified as a TEMP instead of LENGTH. (I should also be accounting for Celsius and Meters too but i am being lazy)

Let’s look a bit closer at a couple more of these regex.

¬†¬†¬†¬†¬†¬†¬† (‘ASSIGN’,¬† r‘: ‘),¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† # Assignment operator

The assign operator is very important as we are going to use each instance of this to identify a rule that the NEXT token value should be ASSIGNED to the previous token value. The “little r” before the string means a “raw string literal”. Regex are heavy with “” characters, using this notation avoids having to do an escape character for everyone of them.

¬†¬†¬†¬†¬†¬†¬† (‘DATETIME’,¬†¬†¬†¬†¬†¬†¬† r(d+:(d+(:d)*)*)+s+(d+-d+-d+)),¬† # Datetime value (ex. 00:00:00¬† 12-12-2017)

Datetime is taking the numeric pattern I explained in detail above but slightly changing the “.” to a “:”. In my file, i want both 00:00 and 00:00:00 to match the time portion of the pattern, so therefore I use a nested “*” (remember that means 0 to n occurrences). The + at the end of the first section means at least 1 occurrence of the time portion, therefore simply a date field will not match this datetime regex. Then the “s” represents single or multiple line spaces (remember that regex is greedy and will keep taking spaces unless ended with “?”). Then the last section for the date will take any integer values with two dashes (“-“) in between. This means 2017-01-01 or 01-01-2017 or even 2017-2017-2017 would match the Datetime date section. This may be something I should clean up later ūüôā

¬†¬†¬† tok_regex = ‘|’.join(‘(?P<%s>%s)’ % pair for pair in token_specification)

 

I wanted to just quickly point out how cool it is that Python then allows you to take the list of regex specifications and separate them with a “|” by doing the “|”.join() notation. This will result in the crazy looking regex below:

‘(?P<SPEED_IN_OUT>(\d+(\.\d*)?/\d+(\.\d*)?\s{1}MPH))|(?P<SPEED>(\d+(\.\d*)?\s{1}MPH))|(?P<LENGTH>(\d+(\.\d*)?\s{1}Feet))|(?P<TEMP>(\d+(\.\d*)?\s{1}[F]))|(?P<DATETIME>(\d+:(\d+(:\d)*)*)+\s+(\d+-\d+-\d+))|(?P<TIME>(\d+:(\d+(:\d)*)*)+)|(?P<ID_W_NBR>(\d+(\.\d*)?\s([/\w]+\s?)+))|(?P<NUMBER>\d+(\.\d*)?)|(?P<ID>([/\w]+\s?)+)|(?P<ASSIGN>: )|(?P<NEWLINE>\n)|(?P<SKIP>[ \t]+)’

Two important things were done here. We gave each specification the ?P<name> notation which allows us to reference a match group by name later in our code. Also, each token specification was wrapped with parenthesis and separated with “|”. The bar is like an OR operator and evaluates the regex from left to right to determine match, this is why i wanted to put the most restrictive patterns first in my list.

The rest of the code iterates through the line (or string) that was given to find matches in using the tok_regex expression and yields the token value that includes the kind (or type) of the match found and the value (represented as value.strip() to remove the whitespaces from beginning and end).

Evaluating the Output

Now that our parser is defined, lets process the formatted file above. We add some conditional logic to skip the first line and any lines that have a length of zero. We also stop processing whenever we no longer encounter lines with “:”. This effectively is processing all headers and we will save the line processing for another task.

lines = list(csv.reader(open('truck01.txt',mode='r'),delimiter='t'))

counter = 0
ls = []
for l in lines:

    if len(l)==0 or counter == 0:
        counter += 1
        continue

    str = l[0]
    index = str.find(":")
    if(index == -1 and counter != 0):
        break

    print(str)
    for tok in tokenize(l[0]):
        print(tok)

    counter += 1

The first few lines processed will result in the following output from the print statements (first the line, then each token in that line)

Site Name   : Chicago IL                  Seq Number     : 111
Token(typ=’ID’, value=’Site Name’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’ID’, value=’Chicago IL’, line=1, column=14)
Token(typ=’ID’, value=’Seq Number’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’NUMBER’, value=’111′, line=1, column=59)
Mile Mrkr   : 304.40                      DB Index #     : 171
Token(typ=’ID’, value=’Mile Mrkr’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’NUMBER’, value=’304.40′, line=1, column=14)
Token(typ=’ID’, value=’DB Index’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’NUMBER’, value=’171′, line=1, column=59)
Direction   : South                       Arrival        : 00:02  09-22-2017
Token(typ=’ID’, value=’Direction’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’ID’, value=’South’, line=1, column=14)
Token(typ=’ID’, value=’Arrival’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’DATETIME’, value=’00:02¬† 09-22-2017′, line=1, column=59)
Speed In/Out: 33/18 MPH                   Departure      : 00:03:45
Token(typ=’ID’, value=’Speed In/Out’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’SPEED_IN_OUT’, value=’33/18 MPH’, line=1, column=14)
Token(typ=’ID’, value=’Departure’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’TIME’, value=’00:03:45′, line=1, column=59)
Slow Speed  : 38 MPH                      Approach Speed : 0 MPH
Token(typ=’ID’, value=’Slow Speed’, line=1, column=0)
Token(typ=’ASSIGN’, value=’:’, line=1, column=12)
Token(typ=’SPEED’, value=’38 MPH’, line=1, column=14)
Token(typ=’ID’, value=’Approach Speed’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’SPEED’, value=’0 MPH’, line=1, column=59)
Approach Length: ~0.0 Feet
Token(typ=’ID’, value=’Approach Length’, line=1, column=42)
Token(typ=’ASSIGN’, value=’:’, line=1, column=57)
Token(typ=’LENGTH’, value=’0.0 Feet’, line=1, column=60)

Notice how everything is being parsed beautifully without having to do any counting of spaces or finding explicit header names. With being able to identify “SPEED”, “TIME”, “LENGTH”, we will also be able to write a function to change these to the proper type format and add a unit of measure column if needed.

The only assumptions we are going to make to process this header information are as below:

1. skip the first line
2. end processing when a non-empty line no longer has an assignment operator of “:”
3. pattern expected for each line is 0 to n occurrences of ID ASSIGN any_type

To handle #3 above, we add the below code to the end of the for loop shown above:

    dict = {}
    id = None
    assign_next_value = False
    for tok in tokenize(l[0]):
        print(tok)
        if tok.typ == "ASSIGN":
            assign_next_value = True
        elif assign_next_value:
            dict = {id:tok.value}
            print(dict)
            ls.append(dict)
            assign_next_value = False
            id = None
            dict = {}
        else:
            id = tok.value

If you follow the logic, we are just taking the string (each line of the file) and recording the value of the first token as the id, finding the assign operator “:”, and then recording the following token value as the value of a dictionary object. It then appends that dictionary to the “ls” list that was initialized in the first code snippet.

We could then format it as JSON by adding the below line of code after the for loop

import json
jsondata = json.dumps(ls,indent=2,seperators=(",",":"))

See output below, some additional formatting work needs to be done with this as well as pre-processing my numbers and date times to not be represented as strings, but that is not the focus of this blog post.

[
  {
    "Site Name":"Chicago IL"
  },
  {
    "Seq Number":"111"
  },
  {
    "Mile Mrkr":"304.40"
  },
  {
    "DB Index":"171"
  },
  {
    "Direction":"South"
  },
  {
    "Arrival":"00:02  09-22-2017"
  },
  {
    "Speed In/Out":"33/18 MPH"
  },
  {
    "Departure":"00:03:45"
  },
  {
    "Slow Speed":"38 MPH"
  },
  {
    "Approach Speed":"0 MPH"
  },
  {
    "Approach Length":"0.0 Feet"
  },
  {
    "Amb Temp":"81 F"
  },
  {
    "Battery Voltage":"12.03"
  },
  {
    "Axles":"5"
  },
  {
    "Truck Length":"56.0 Feet"
  },
  {
    "Alarms":"0"
  },
  {
    "Cars":"1"
  },
  {
    "Integ Fails":"0"
  },
  {
    "Gate A Cnt":"1"
  },
  {
    "System Warn":"0"
  },
  {
    "Gate B Cnt":"1"
  },
  {
    "Weight":"72000"
  },
  {
    "HBD Filter":"13 Point Median Filter"
  }
]

Now What?

I hope to do a continuation of this blog post and explore a server-less architecture of taking the file from the FTP server, immediately running this pre-processing, and dumping the JSON out to a stream ingestion engine. From there, we can do all sorts of cool things like publish real time data directly to Power BI, or into a Big Data store. This follows principles of “Kappa Architecture”, a simplification of “Lambda Architecture” where everything starts from a stream and the batch processing layer goes away.

There are multiple ways to implement this, but with Cloud computing, we have an opportunity to do this entire chain of events in a “server-less” environment meaning no virtual machines or even container scripts have to be maintained. So, lets cover this next time

Conclusion

Regex are super powerful. I ignored them for years and now I feel super smart and clever for finding a better solution to file processing than i would have originally implemented without regex.

The full Python code from above as well as the formatted file can be found on my GitHub here

 

 

 

Data Driven Subscriptions in Power BI

What was that? Did I just say you could create a data driven subscription in Power BI? Like old school SSRS reporting where i can determine that these 5 sales people get an email with the information only relevant to them while these other 5 people over here get information only relevant to them?

Yep, that is exactly what i said…

Really this is pretty basic stuff and although you will see that this technique is a bit limited in options (for instance, we can ONLY send the hyperlink to the report and not a PDF or attachment) it works nonetheless.

Microsoft Flow

I have chosen to do this via Microsoft Flow, but this technique could be achieved several ways. You can use the scheduler of your choice. Flow is a business friendly workflow engine and if you are already an Office 365 shop, it is likely you can use it as part of your current licensing. In this article we are going to use Flow to iterate through a subscription table stored in SQL Server twice a day and send emails with a hyperlink to the properly filtered report.

To learn more about Flow and how to evaluate it for FREE, visit https://preview.flow.microsoft.com/en-us/

The Power BI Report

In this blog post we will use the AdventureWorksDW database and key off of the “DimSalesTerritory” table to determine which sales country should be filtered in the subscription notification. The report has two pages showing sales information. The PBIX file is available on my GitHub repository if you don’t want to create your own.

When setting up the report, add DimSalesTerritory.SalesTerritoryCountry to the Report Level Filters. Adding to the Report Level will ensure that EVERY page of the report adheres to the filtered value that comes from the subscription (described below).


Enabling Data Driven Subscriptions

To achieve this we will use url query string parameters as described here when putting the hyperlink in the email notification.

To create a custom hyperlink for each email address (to make it truly data driven), I am using a SQL Server table to store the Sales Territory each employee should see in the report. Below is the DDL for the “subscriptions” table that I am adding to the AdventureWorksDW database. However, this could easily be a SharePoint list or some other data store that can be accessed from Microsoft Flow to drive the email subscriptions

 

CREATE TABLE [dbo].[subscriptions](
[id] [int] IDENTITY(1,1) NOT NULL,
[email] [varchar](50) NOT NULL,
[groupid] [varchar](50) NOT NULL,
[reportid] [varchar](50) NOT NULL,
[filtertable] [varchar](50) NULL,
[filtercolumn] [varchar](50) NULL,
[filteroperator] [varchar](12) NULL,
[filtervalue] [varchar](50) NULL,
[modifieddt] [datetime] NULL,
[modifieduser] [varbinary](50) NULL
)
GO

 

URL Formats in Power BI

The table above will contain the email address of the recipient as well as critical sections of the Power BI URL: the group id and the report id.

There are 3 different formats of the base url that for this example you will have to determine how your report is available to users in Power BI.

If the report is shared directly in your personal workspace, the format is below
https://app.powerbi.com/groups/me/reports/{reportid}/ReportSection

Notice that in this scenario, the group identified is “me”. This indicates the report is available in your personal workspace.

If the report is shared with you from an App Workspace (formerly known as Group Workspace), the format is below
https://app.powerbi.com/groups/{groupid}/reports/{reportid}/ReportSection

As described in my previous blog post Power BI Content Workflow – with Apps¬†, the best practice however is to distribute content via an “App” and therefore the format should render as below:
https://app.powerbi.com/groups/me/apps/{groupid}/reports/{reportid}/ReportSection

In this scenario, we are using the groupid value from the SQL Server data source as the App and you can see that the group is still specified as “me”

Query String

In the guidance referenced above for including a query string, all of the above formats would have “?filter=DimSalesTerritory/SalesTerritoryCountry eq ‘United States’” for instance appended to set the report filter to only shows sales related to the United States.

SQL Server Entries

In this example, there are two entries in my subscriptions table, one for steve@angryanalytics.com that will be sent a link via email for ‘United States’ sales and another for bob@angryanalytics.com that will be sent a link for ‘France’ sales.

Setting up Microsoft Flow

In Microsoft Flow, there are two types of activities: Actions and Triggers. We will create a “Schedule” trigger in a blank flow

From the Schedule (Recurrence) trigger, select the frequency to be Daily and the proper time zone from the advanced options. Then you can select multiple hours you want the subscription to be ran

To add an additional action, use the New Step button

After that select the “SQL Server – Get Rows” action and find your “subscription” table. Note: You will need to connect to your database first if you have never connected before via Microsoft Flow.

 

Magic

And this is where the magic happens if you have never used flow before… add the “Send an email” action. Flow detects the previous action and sees that there is a “Get Rows” that has been performed. It will first create a loop (Apply to each) for every row in the table and then it will show the columns available from the selected rows so that they can be used in the “Send an email” action.

The email address can be used for the “To” location. I created a static subject for “Your Power BI report is ready!”. Click “Show advanced options” to expand the selection and make sure to set “Is HTML” to “Yes”.

Finally in the “Body” use an html tag <a href=”…”>Click Me!</a> or something more elegant of course. In the href attribute string together the base url (in my case, “https://app.powerbi.com/groups/me/apps/&#8221;) and append the rest of the variable data from the subscriptions table to make dynamic data driven link.

Now Save your flow and you can click “Run Now” to test it (this will run it now regardless of the trigger schedule you created).

Result

As expected I receive an email for both steve@angryanalytics.com and bob@angryanalytics.com

Because when we setup the Power BI report, we used a Page level filter, both pages of the report will auto filter to the expected sales territory.

 

Now, every day at the times selected for the schedule 100s of people can get the filtered report link that would show them the most up to date information after data refreshes have occurred.

Other Actions

In this example, I used a simple email, however, you could have selected Twilio to use for text messaging notifications or something like SendGrid to create highly polished email content that looks much better than a simple url link. More complex workflows could also be performed by created conditional logic or more complex filters.

Conclusion

As said, this is pretty basic stuff… but very powerful.

 

 

 

Working with Scatter Plots in Power BI

I really like some of the advancements that have been made in Power BI scatter plots over the last few months. I wanted to point out some capabilities you may not be using that maybe you should be.

Data Sampling Improvements

In the September 2017 release, you can now be confident that all of your outliers are being shown. No one can visually look at a plot and interpret several thousand data points at once, but you can interpret which of those points may be outliers. I decided to test this out myself between a Python scatter plot of 50k data points and Power BI.

In the test, I used a randomly generated normal distribution of 50k data points to ensure I had some outliers.

#Create a random dataset that has a normal distribution and then sort it (in this case, 50000 data points)
x = np.random.normal(50,25,50000)
x = np.sort(x)

#Create another dataset to put on the y axis of the scatter plot
y = np.random.normal(50,25,50000)

#plot the dataset
plt.scatter(x,y,alpha=0.3)
plt.show()

(You can see the Python notebook on my GitHub here).

Here it is in Python:

Here it is in Power BI (September desktop release)

Notice that all the outliers have been preserved. Note that in previous releases, the Power BI rendering of this would have been shown as below.

This is a great improvement. To learn more about this update, check out the official blog post on high density sampling: https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-high-density-scatter-charts/

Working with Outliers (Grouping)

Now that we know the dense sampling is preserving our outliers, we can perform some analysis on them. Power BI makes it easy to CTRL+click on multiple outliers and then right-click and add new Group

This will create a new field in your fields list for this group of outliers and will automatically include a Group for “Other” (the other 49.993 data points that weren’t selected). Note that I renamed my field to “High Performers”

As this is a random dataset with integers for x,y values there are no dimensions here that may be interesting to compare, but consider now we can always come back to this grouping for further analysis such as the bar chart below:

Clustering

You can also use “…” in the upper right of the scatter chart to automatically detect clusters. Our example again is a bit uninteresting due to it being a random normal distribution but gives you an idea of how you can cluster data that is more meaningful.

Symmetry Shading and Ratio Lines

These gems were released in the August 2017 desktop release and really helps visualize the skew of your data.

Both of these can be turned on from the analytics tab.

Instead of using our sample dataset above I will use the dataset from my last blog post on Scorecards and Heatmaps.

In the below plot I took the SalesAmount field and plotted on the y axis against the SalesAmountQuota field on the x axis. From Symmetry shading we can observe that none of our sales people are meeting their quota. From the ratio line we can see the few individuals that have a positive variance to the ratio while most are flat or below the ratio.

You can read more about these two features in the August Desktop blog post: https://powerbi.microsoft.com/en-us/blog/power-bi-desktop-august-2017-feature-summary/#symmetryShading

Conclusion

These are just a few of the recently released features that I think have made the native scatter chart in Power BI a very useful visual. I have posted the PBIX file for the normal distribution data on my GitHub if you would like to download: https://github.com/realAngryAnalytics/angryanalyticsblog/tree/master/20171002-scatterplots