How to use Hadoop HDFS as a datasource for Power BI

I have a local Hadoop server running on my Windows laptop. To connect Power BI to this Hadoop file system I used Hadoop’s WebHDFS REST API which is very well documented.

https://hadoop.apache.org/docs/r1.0.4/webhdfs.html

The WebHDFS API specifies a url in which you specify the Hadoop root file system directory that you want to get files from as follows. In my case I had folder named “myhdfsfolder“:

http://localhost:50070/webhdfs/v1/myhdfsfolder?op=LISTSTATUS

The myhdfsfolder folder contained a csv file which had previously been imported into Hadoop.

In Power BI, select “Get Data”, “Other”, then “Hadoop File (HDFS)” which will pop a window that asks you to enter a url. The url requested is the WebHDFS REST API url specified above.

After entering your url and clicking ok you should see a list of objects that are in the Hadoop directory. In my case I can see my csv file along with bunch of Hadoop log files.

Since I only want the csv file, I clicked the Power BI Edit button so I could filter the list to only my csv file. This part is just standard Power BI/Power Query stuff. If you had multiple csv files, you could filter to csv file types, or some other feature of the file name, or file attribute such as create date etc.

After filtering, only the csv file I am interested in shows in the list. There is one more step to do which is also just standard Power BI/Power Query stuff. Since the file is a binary file you first need to extract its contents. To do this, select the Content column double down arrow selector. This will extract the binary file content’s into a new table which is the data you want.

Now you can continue transforming the csv file data in Power BI/Power  Query or just load the query to start using the file’s data in Power BI data modeling and reporting!

Installing Hadoop on Windows 8.1

It turned out to be quite simple to install Hadoop on Windows 8.1

I have been using Hadoop Virtual Machine images from Cloudera, Hortonworks or MapR when I wanted to work with Hadoop on my laptop. However these VMs are big files and take up precious hard drive space. So I thought I would be nice to install Hadoop on my Windows’s 8.1 laptop.

windows81_hadoop
After some searches I found excellent instructions on installing Hadoop on Windows 8.1   on Mariusz Przydatek’s blog.

Note that these instructions are not for complete beginners. There are lots of assumptions that the reader understands basic Windows environment configuration, Java SDK, Java applications, and building binaries from source files. So beginners will also want to look at more detailed step by step instructions as well and use Mariusz’s instructions as a guide to make sure they are doing things correctly.

The most important thing to note is that you do not need Cygwin to install Hadoop on Windows. Other tutorials and how-to blog posts that insist Cygwin is required.

Also because you need to build Hadoop on your computer you need to have MS Visual Studio so you can use trial version as you don’t need it after you build Hadoop binaries. Other tutorials and how-to blog posts have some variation on what version of MS Visual Studio you need but this blog makes it clear.

At high level, the Hadoop installation follows these steps:

  • Download and extract Hadoop source files
  • Download and install packages and helper software
  • Make sure System Environment Path and Variables are correct
  • Make sure configuration files and package paths are ok
  • Execute command that uses packages and helper software to build Hadoop binaries
  • Copy newly created Hadoop binaries into new Hadoop production folder
  • Execute command to run the new Hadoop binaries
  • Hadoop is running and ready to be used

Its pretty obvious but worth stating again that this tutorial installs only a single node Hadoop cluster which is useful for learning and development. I quickly found out that I had to increase memory limits so it could successfully run medium sized jobs.

After Hadoop is successfully installed, you can then install Hive, Spark, and other big data ecosystem tools that work with Hadoop.