Deep Learning Toolkit considerations for emerging data scientists

Overview

Disclaimer: This blog is my own opinion and not that of my employer, however it should be noted that I am a Microsoft employee and this may reflect that perspective.

Update: New version of these benchmarks is being worked on and can be tracked here: http://dlbench.comp.hkbu.edu.hk/

This post is a departure from my usual focus on Power BI. Enterprise deployment scenarios for Power BI have been a great subject for me. However, In my day job, I do work on a variety of data platform related subjects. These are my findings on deep learning toolkits and what you should know before getting too deep into them, especially pay attention to my section on Keras

Deep learning is popular for image processing (computer vision, facial recognition, emotion detection), natural language processing (sentiment analysis, translation), and even starting to find its way into areas such as customer churn. Neural networks with many layers are used to increase precision of a prediction as opposed to more statistical type algorithms such as linear regression.

There are several popular open source deep learning toolkits including Caffe, Torch, TensorFlow, CNTK (now Cognitive toolkit), and mxnet

This post will mostly reference TensorFlow and CNTK for reasons established in the section on Keras.

Python vs R

This debate will rage on for probably another decade similar to how I remember the Java vs C# debate as a developer in the early 2000’s. From what I have seen, Python appears to have more support in the area of deep learning than R. All but Torch support Python integration while only TensorFlow and mxnet support R directly.

Toolkit Performance

One of the most important aspects of a deep learning toolkit is performance.

Lets consider a couple of scenarios:

In the software development cycle a poorly indexed table could be the difference in 5 seconds and 5 minutes to call the database. This is annoying but is not the critical path to meeting a deadline. It takes many developer hours and iterations to build the code around that database call making that index issue less significant, but of course something that should be addressed.

In deep learning on the other hand, the difference between a model that performs twice as fast as another toolkit could mean the difference between 1 vs 2 days of training time. The iteration cycle is greatly impacted and retraining a model 5 times could be the difference between 1 week and 2 weeks to deliver results. This is significant!

Benchmarking Performance among leading toolkits

Benchmarking State-of-the-Art Deep Learning Software Tools is an academic journal (latest revision February 2017) comparing the most popular deep learning toolkits for CNN, FCN, and LTSM. These acronyms are neural network types you will want to familiarize yourself with if you are not already. There is a new eDX course that is just starting that you can learn all about these concepts.

Below i have included some links that may be to other frameworks but the content explanations seemed more easily understood

Convolutional Neural Network (CNN) – used primarily for image processing. Popular implementations include:
  • AlexNet – an 8 layer CNN circa 2012 that cut error rate nearly in half from previous versions
  • ResNet-50/101/152/etc – A deep residual learning network with 50/101/152/etc layers respectively circa 2015 achieving an error rate of 3.57% which was 4 times improvement from AlexNet

Fully Convolutional Neural Nework (FCN) – variation of CNN that doesn’t include the fully connected layer

Recurrent Neural Network (RNN) & Long Short Term Memory (LTSM) – widely used for natural language processing

This paper is extremely thorough and as our instincts are to scroll immediately to page 7 to start interpreting the bar charts, it is important to note how they ran these tests and gathered the results as described in pages 1-6.

One summary table that doesn’t fully represent all results is shown below.

Shaohuai Shi, Qiang Wang, Pengfei Xu, Xiaowen Chu, “BenchmarkingState-of-the-ArtDeepLearningSoftwareTool”

As you interpret these results, as well as the rest of them in the journal, you will notice three glaring observations

  • There is not one toolkit that has best performance across all neural network types. In fact, there can be wide variation in performance rank for a single toolkit based on # of CPUs or # of GPUs used.
  • Google TensorFlow is arguably the most popular of all of these toolkits, yet the results published in this paper other than in a few cases show it is quite average if not consistently poorer performing than others.
  • CNTK is orders of magnitude better than all of the competition in LTSM

Note on Google TensorFlow

CNTK performs better overall and by orders of magnitude in some cases to TensorFlow. As emerging data scientists start to pick toolkits for deep learning, TensorFlow seems to be a popular choice. In many cases, it will have desirable performance, but to put “all your eggs in one basket” so to speak, may not be the best approach here.

I actually am a fan of TensorFlow and picking a toolkit on performance alone would also not be wise. TensorFlow has some neat features one being TensorBoard that helps visualize the execution graph (note that CNTK also supports TensorBoard). Google has also recently introduced a dedicated TensorFlow processor (TPU) when running on their cloud platform that will surely speed up processing time. But if you are doing NLP (natural language processing), it is quite obvious you would want to use CNTK for performance reasons…

What is an emerging data scientist to do?

This is where Keras comes in…

Keras

Keras is an abstraction layer that allows you to run the same code on top of both TensorFlow and CNTK (as well as Theano, another deep learning toolkit) as the backend.

For Big Data people, I would make a correlation between Keras and the use of HIVE as an abstraction layer for Map/Reduce. It is rare to actually write Map/Reduce code anymore with the evolution of libraries around big data, and that is what Keras reminds me of compared to actually writing TensorFlow (or CNTK) code. For instance, TensorFlow on its own actually requires you to write the formula for Mean Squared Error to pass into the model. Although trivial, this is totally annoying and the use of Keras builds a lot of shortcuts for us that makes life much easier and reduces code often by 50%.

In the keras.json file that is created during installation, the backend can be configured by changing one line between “tensorflow” and “cntk”

{
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "cntk",
    "image_data_format": "channels_last"
}

to verify the backend that is being used, from python simply enter

import backend from keras

or at anytime you can access the _BACKEND variable from the same library to see the result

These details are all described clearly on the keras.io site referenced above.

…and for all of the R users, there is a nice CRAN package available too:
https://cran.r-project.org/web/packages/kerasR/vignettes/introduction.html

from my somewhat limited experience, I can say that using Keras on top of TensorFlow or CNTK keeps me from pulling my hair out. Kudos to the creators and contributors to this library. Maybe we can dive deeper into this in a future post.

Transfer Learning

Transfer learning is the ability to take a preexisting model and use it as the base for another model. This allows you to take for instance a model that has classified millions of images and has trained for possibly weeks and apply it to new images that are more specific to your scenario. This allows for more rapid model development if you can build on preexisting work.

CNTK has a really nice tutorial on this technique here:
https://docs.microsoft.com/en-us/cognitive-toolkit/build-your-own-image-classifier-using-transfer-learning

TensorFlow also has it’s own “Inception” library that can be transferred.

This concept is the basis for the next section of “Deep Learning as a Service”

Deep Learning as a Service

I don’t believe this has actually become a term yet. I am just making it up as I go here 🙂

The concept of transfer learning opens some new capabilities to more easily apply your own scenarios to previously trained models.

Microsoft has developed a few interesting services that make deep learning very accessible to end users

One is the Custom Vision Service: https://www.customvision.ai/

This allows you to bring your own images to train on and allows you to reinforce in an iterative approach

Another is Q&A Maker: https://qnamaker.ai/

this allows you to build a bot in minutes to scroll through FAQ and document content on a subject that is important to your organization. This bot can then interact in an intelligent way without having to use a deep learning toolkit or a bunch of coding.

I did one using the Power BI FAQ pages and it worked really well

What is interesting about these services is that it is actually training a model on YOUR data. Not simply tapping into a pre-existing model. You are able to influence the results.

I believe we will continue to see many more services pop up like this that will continue to “democratize” AI for the masses

Conclusion

I would never claim to be a data scientist, but many of us are doing more and more data science like activities. For a person moving from data and business intelligence into machine learning and artificial intelligence, I feel like the above content would have saved me a lot of time. There is plenty of getting started content out there so start using your google/bing search skills to get deeper into it.

Power BI Routing Visual with Two Lines of R

Objective

Although the out of box Bing and ESRI maps in Power BI can visualize most business requirements, trying to visualize routes or shipping lanes for the transportation industry can be challenging. Fortunately in Power BI we can expand our visualization palette by using either custom visuals or R visuals.

In this blog post we will look at publicly available Flight Data and determine routes that have the highest likelihood of cancellation. This could easily be translated to shipping or transportation scenarios.

You can get the final solution from my github repository or you can download the “Airports” data set from http://openflights.org/data.html and the “Flights” data set from Engima.IO

NOTE: Engima.IO is a repository for public datasets. It is free but requires you to create a login to use it. I enjoy working with the enigma-us.gov.dot.rita.trans-stats.on-time-performance.2009 as it is rather large at 2.4 GB.

Although the above visual only requires two lines of R code to be written, there are two hurdles to get over first: Ensuring R and the required libraries are installed on your desktop, and doing the data prep in the query editor to create the R data frame in the format that is expected.

Installing R and ggmap

There is already well documented guidance on installing R to be used with Power BI on the Power BI Documentation site. Once this installation has been complete, we need to get ggmap and other supporting R libraries installed as well.

I prefer going to the RGui command line (just search for “RGui”) and perform the following command:

install.packages("ggmap")

Doing this in the R console will automatically download any dependent packages as well. If you performed this line in a Power BI visual directly it would not install the other required packages and you will get an error when you run the solution.

Data Prep

In Power BI, lets first bring in the airports data CSV file we downloaded from http://openflights.org/data.html. The important columns in this data set are the 3 letter airport code and the latitude and longitude of the airport. You can include the other fields for more detail as I am showing below, however they are not necessary for us to achieve the R visual above.

Next import the flight data that was downloaded from Engima.IO for 2009. This data is extremely wide and a lot of interesting data points exist, however we can simply remove a large portion of the columns that we will not be using. Scroll to the right Shift+click and right click to Remove Columns that start with “div”. Alternatively you can use the “Choose Columns” dialog to un-select.

To reduce the number of rows we will work with, filter on the flightdate column to only retrieve the last 3 months.

Shaping the Data

We now need to shape the data for the R visual. To only require 2 lines of R, the data has to be in the following format

index Direction Latitude Longitude
1 Origin 41.97 -87.9
1 Destination 33.63 -84.42
2 Origin 41.73 -71.43
2 Destination 33.63 -84.42
3 Origin 35.21 -80.94
3 Destination 33.63 -84.42

We will include other columns, however the format of alternating rows for the origination of the route and then the destination with the latitude and longitude for each is required. Having an Index row that keeps the right origin and destination values ordered appropriately will also help Power BI from making adjustments that you don’t want.

The Double Merge

Latitude and Longitude are not included in the flight data so we need to do a MERGE from the airport data set that we have.

NOTE: Finding lat/long values for zip codes, airport codes, city names, etc… is generally the hardest part of using ggmap and is why most of the time the use of a second reference data set is required to get this data

Perform a “Merge Queries” against the “origin” column of the flights data and the “Code” column (this is the 3 letter airport code from the airports data set)

Rename the newly created column as “Origin” and then click the “Expand” button to select the Latitude and Longitude columns ONLY to be merged.

Now repeat the above Merge steps a second time but instead of using the “origin” use the “dest” column from the flights data and merge against the same 3 digit code in the airports data. Call the new column “Dest” before expanding the Latitude and Longitude.

Once finished your dataset should look like this:

We have successfully acquired latitude and longitude which ggmap needs to plot the path of the line. However, we need the values of Lat/Long for origin and destination to be on separate rows as shown above. This is where we get to use my favorite M/PowerQuery function of “Unpivot”

Using Unpivot

To ensure later that the order of our rows are not out of sync when creating the dataframe for the R visual, add an index column via the “Index Column” button on the “Add Column” tab. I start at 1.

We need to have two columns to unpivot on to create two rows for each single row currently in the data set. I achieve this most simply by adding two custom columns via the “Custom Column” button on the “Add Column” tab. Just fill in the expression with the string “Origin” and then “Destination” for each new column as shown below

The data should now look like this:

Select both of the new custom columns (mine are called Direction 1 and Direction 2) right click and select “Unpivot Columns”

Now each row has been duplicated so that we can get the Origin and Destination latitude and longitude on separate rows.

The newly created “Attribute” column should be removed and in my example I have renamed “Value” to “Direction”.

NOTE: In this example I am using unpivot to manipulate my single row into two rows. A more meaningful use of unpivot would be if you have revenue data by month and each month is represented as a column (Jan, Feb, March, etc…) you could select all the month columns and “Unpivot” and you would now have a row for each month as the attribute and the sales amount as the value.

Conditional Columns

Once the unpivot has been completed, add two conditional columns for “Latitude” and “Longitude” via the “Conditional Column” button in the “Add Column” tab to get the values for Origin and Destination into a single column for each. Use the “Direction” column as the conditional and when it equals “Origin” select the “Origin.Latitude” column. Otherwise, select the “Dest.Latitude” column.

See below example for Latitude:

Be sure to change the type of the two circled buttons from Value to Column.

Repeat the above for Longitude.

Change the type of these new columns to decimal numbers.

Now remove the 4 columns of “Origin.Latitude”, “Origin.Longitude”, “Dest.Latitude”, “Dest.Longitude”.

The last 4 columns of the Flights data set should now look like this:

Data prep is complete. We can now close and Apply.

The data load will take a long time if you used the 2.4 GB file from Enigma.IO… Time for coffee 🙂

Creating the Viz

As we work with this data set, remember that we now have two rows representing each flight. So if you want to get any counts or summarizations, always divide by 2 in your measures.

Number of Flights = COUNTROWS(Flights) / 2
Number of Delayed Flights = sum(Flights[depdel15]) / 2
Number of Cancelled Flights = sum(Flights[cancelled]) / 2

With these 3 measures, we can start working on visualizations. First, simply a few card visuals and a matrix by origin/destination

We have 1.6 million flights in our data set. Creating an R visual to represent all of these routes will not be very productive and will probably crash your memory anyway. Let’s setup a filter by origincityname and only route the cancelled flights.

For the above visual, we first should add origincityname as a slicer and select “Atlanta, GA”. Then add a slicer for cancelled = 1.

To create the R visual, select the “R” visualizations icon

Pull in the following values from the “Flights” data

This automatically generates some R code in the R script editor for the Visual

NOTE: This is really important as we have eliminated potentially 100s of lines of R code to “prep” the data to make the data frame look like we need it to be entered into the ggmap function. This was all done via the Query Editor transformations we made above.

Now we simply add the 2 lines of R code underneath the generated R script section

# This function simply imports the required library (ggmap) 
# to be used in the Power BI R visual
library(ggmap)

# qmap is a wrapper for ggmap and get_map. 
# The first parameter is the locale that you want to be mapped. second is zoom level. 
# Then geom_path is used to give the details of what data is to be plotted.
qmap("united states", zoom = 4) + geom_path(aes(x = Longitude, y = Latitude), size = .1, data = dataset, colour="red", lineend = "round")

NOTE: The comments make it longer than two lines, but helps describe what is happening

The geom_path function is part of the ggplot2 library and is further explained here: http://docs.ggplot2.org/current/geom_path.html

Hopefully from this example you can see that the R code is fairly minimal in filling the requirement of routing visualization.

Measures vs Columns for this R Visual

One limitation that currently exists in Power BI Desktop is that the measures we defined earlier are not really providing value to the visualization because we need to include the “Index” column to keep the dataset ordered as expected for ggmap to plot the route from the alternating rows of data.

Because the index column is required to keep the sort order, the filter context applied to the DAX measure of “Number of Cancelled Flights” will always equal 1. This does not allow us to do much “Business Intelligence” of the data set.

EARLIER function

Until this day, I am still not fully aware of why they call this function EARLIER, but what we need to do to introduce some actual “Business Intelligence” into this R visual is to create a column with the total number of cancelled flights via a given route. This “column’s data” will be repeated over and over so beware of how you utilize it. However, it will give us a great way to ONLY retrieve the data that we want.

For the “Flights” data set, add the following column to create a unique value for each Route that exists between airports:

Unique Route ID = CONCATENATE(Flights[originairportid],Flights[destairportid])

Once that value is added, the EARLIER function can be applied to get the total number of cancellations:

Total Cancelled for Route = SUMX(FILTER('Flights',EARLIER('Flights'[Unique Route ID])='Flights'[Unique Route ID]),'Flights'[cancelled])/2

The above value is repeated in every row, so don’t use it to be summarized… use it as a page level filter or a slicer to only retrieve data that meets that requirement (example: Only show routes that have more than 25 cancellations)

Make sure your slicers are cleared and your new plot should look something like this:

Now the solution is starting to become usable to gain some insights into the data. You can download my finished solution from the github repository and see that I have duplicated the “Airports” data set and have one for Origin and one for Destination that I can use as a slicer to more quickly find routes that have frequent cancellations from/to each city or state.

Conclusion

This is just an example of the many ways Power BI can be extended to help solve business problems through visualizations. One note to make is that ggmap is not yet supported by PowerBI.com. This specific R visual is only available in the desktop but many other R visuals are supported in the service as well.

And for my next article, we will see if i am brave enough to post my real opinions on the internet about Big Data and data warehousing architectures. I have lots to “rant” about but we will see if I actually post anything 🙂