Wednesday, September 11, 2013

Hadoop Big Data .NET – Manufacturing scenario

Hadoop Big Data .NET – Manufacturing scenario
 
Purpose: The purpose of this document is to explain how to apply the power of Hadoop Big Data platform in Manufacturing scenario.
 
Challenge: Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Data growth challenges and opportunities are considered to be three-dimensional, i.e. increasing volume (amount of data), velocity (speed of data in and out), and variety (range of data types and sources). In Manufacturing space there're number of scenarios where we can speak about Big Data. From Engineering Modeling perspective we can simulate every aspect of manufacturing process and get business insight when doing Demand Forecasting, Supply Chain Planning, Capacity Planning, Resource Scheduling, Inventory Optimization, OEE Optimization, etc.
 
Solution: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. Apache Hadoop platform consists of the Hadoop kernel, MapReduce and Hadoop Distributed File System (HDFS) and other components.
HDInsight is Microsoft's Hadoop-based service that brings a 100% Apache Hadoop-based solution to the cloud. HDInsight gives you the ability to gain the full value of Big Data with a modern, cloud-based data platform that manages data of any type, whether structured or unstructured, and of any size. With HDInsight you can seamlessly store and process data of all types through Microsoft's modern data platform that provides simplicity, ease of management, and an open Enterprise-ready Hadoop service all running in the cloud. You can analyze your Hadoop data directly in Excel using new capabilities like Power Pivot and Power View.
 
Scenario
 
In this scenario (OEE Optimization) I want to develop Hadoop MapReduce program in order to analyze Equipment Run Log file(s) and get business insight in order to optimize OEE.
 
Sample Equipment Run Log (file) in a structured way may look like
 
Time
Machine
Event
Message
8:55
Machine1
[TRACE]
8:55 Machine1 [TRACE] exit code is 546789093
9:00
Machine1
[TRACE]
9:00 Machine1 [TRACE] exit code is 775367878
9:01
Machine2
[DEBUG]
9:01 Machine2 [DEBUG] exit code is 5546774
9:03
Machine3
[TRACE]
9:03 Machine3 [TRACE] exit code is 455674443
9:03
Machine1
[INFO]
9:03 Machine1 [INFO] exit code is 99682642
9:06
Machine1
[TRACE]
9:06 Machine1 [TRACE] exit code is 56425462
9:07
Machine6
[DEBUG]
9:07 Machine6 [DEBUG] exit code is 3664526
9:10
Machine29
[TRACE]
9:10 Machine29 [TRACE] exit code is 6426342
9:10
Machine12
[TRACE]
9:10 Machine12 [TRACE] exit code is 4629422
9:10
Machine2
[DEBUG]
9:10 Machine2 [DEBUG] exit code is 7628764324
9:10
Machine6
[TRACE]
9:10 Machine6 [TRACE] exit code is 76428436284
9:15
Machine1
[TRACE]
9:15 Machine1 [TRACE] exit code is 24257443623
9:25
Machine10
[DEBUG]
9:25 Machine10 [DEBUG] exit code is 24586
9:28
Machine9
[FATAL]
9:28 Machine9 [FATAL] exit code is 2745722
 
However the data we collect from equipment may be unstructured, semi-structured or a combination of semi/unstructured data and structured data.
 
So the same Sample Equipment Run Log (file) may also look like this
 
8:55 Machine1 [TRACE] exit code is 546789093
9:00 Machine1 [TRACE] exit code is 775367878
9:01 Machine2 [DEBUG] exit code is 5546774
This is a diagnostics message: XYZ
Machine downtime - start
Machine downtime - end
9:03 Machine3 [TRACE] exit code is 455674443
9:03 Machine1 [INFO] exit code is 99682642
This is a diagnostics message: XYZ
This is a diagnostics message: XYZ
9:06 Machine1 [TRACE] exit code is 56425462
9:07 Machine6 [DEBUG] exit code is 3664526
9:10 Machine29 [TRACE] exit code is 6426342
9:10 Machine12 [TRACE] exit code is 4629422
9:10 Machine2 [DEBUG] exit code is 7628764324
This is a diagnostics message: XYZ
9:10 Machine6 [TRACE] exit code is 76428436284
9:15 Machine1 [TRACE] exit code is 24257443623
9:25 Machine10 [DEBUG] exit code is 24586
This is a diagnostics message: XYZ
This is a diagnostics message: XYZ
This is a diagnostics message: XYZ
9:28 Machine9 [FATAL] exit code is 2745722
 
In order to analyze unstructured data in Equipment Run Log file(s) we will apply Hadoop MapReduce algorithm. MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. MapReduce program comprises a Map() procedure that performs filtering and sorting (such as sorting messages by type into queues, one queue for each type) and a Reduce() procedure that performs a summary operation (such as counting the number of messages in each queue, yielding type frequencies). The MapReduce System orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, providing for redundancy and fault tolerance, and overall management of the whole process.
 
Please see how MapReduce algorithm works on the schema below
 
Walkthrough
 
For the purposes of this walkthrough I locally installed .NET SDK for Hadoop (http://hadoopsdk.codeplex.com) making it easier to work with Hadoop from .NET.
 
Please note that you can you can also leverage Windows Azure HDInsight service in the Cloud (http://gettingstarted.hadooponazure.com) which I will also describe in this article
 
Now let's review the process step-by-step!
 
Let's install Microsoft HDInsight Developer Preview first
 
Microsoft HDInsight Developer Preview
 
 
Microsoft HDInsight Developer Preview
 
 
Microsoft HDInsight Developer Preview
 
 
Once Microsoft HDInsight Developer Preview is installed you can access your Hadoop cluster at http://localhost:8085/ (exact URL may vary) on the localhost
 
 
You can now navigate to Local cluster to see what you can do with it
 
 
Please note that there're samples which you can deploy and try out
 
 
Now we can go ahead and create a Visual Studio project to implement MapReduce program which will analyze Equipment Run Log file(s) and extract meaningful information in order to get a business insight into the types of messages we have there 
 
New Project
 
 
Once Visual Studio project has been created we have to add references to NuGet packages for Hadoop processing such as Microsoft.Hadoop.MapReduce and Microsoft.AspNet.WebApi
NuGet is the package manager for the Microsoft development platform including .NET. The NuGet client tools provide the ability to produce and consume packages. The NuGet Gallery is the central package repository used by all package authors and consumers.
 
Please find info about how to install NuGet Package Manager here: http://docs.nuget.org/docs/start-here/installing-nuget
 
Please find more info about NuGet Microsoft.Hadoop.MapReduce package here: http://www.nuget.org/packages/Microsoft.Hadoop.MapReduce
 
Please find more info about NuGet Microsoft.AspNet.WebApi package here: http://www.nuget.org/packages/Microsoft.AspNet.WebApi
 
Extensions and Updates
 
 
Add References (Install NuGet packages)
 
 
Please note that when you install NuGet Microsoft.Hadoop.MapReduce package you may get the following error which suggests to upgrade NuGet to the latest version from the link
 
 
However if you navigate to this link you then see the following error
 
 
This is a known issue and the solution for it is to reinstall (Install/Uninstall) Nuget Package Manager
 
Once we installed required NuGet packages the Solution Explorer will look like
 
Solution Explorer
 
 
Now it's time to implement MapReduce program as shown below
 
Source code
 
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Hadoop;
using Microsoft.Hadoop.MapReduce;
using Microsoft.Hadoop.WebClient.WebHCatClient;
using System.Text.RegularExpressions;
 
namespace OEEAnalysis
{
    class Program
    {
        //Mapper
        public class MyMapper : MapperBase
        {
            public override void Map(string inputLine, MapperContext context)
            {
                string key = "";
 
                //define the pattern
                string pattern = @"\[(.*?)\]";                
 
                Regex re = new Regex(pattern);
 
                //determine value type
                foreach (Match m in re.Matches(inputLine))
                {
                    key = m.Value;
                    break;
                }
 
                //output key assignment with value
                context.EmitKeyValue(key, inputLine);           
            }
        }
 
        //Reducer
        public class MyReducer : ReducerCombinerBase
        {
            public override void Reduce(string key, IEnumerable<string> values, ReducerCombinerContext context)
            {
                //initialize counter
                int count = 0;
                    
                //code to aggregate the occurrence
                foreach (string value in values)
                {
                    count++;
                }
 
                //output results
                context.EmitKeyValue(key, count.ToString());
            }
        }
 
        static void Main(string[] args)
        {
            //establish job configuration
            HadoopJobConfiguration myConfig = new HadoopJobConfiguration();
            myConfig.InputPath = "user/Administrator/OEEAnalysis/input/";
            myConfig.OutputFolder = "user/Administrator/OEEAnalysis/output/";
 
            //connect to cluster
            Uri myUri = new Uri("http://localhost");
            string userName = "hadoop";
            string password = null;
            IHadoop myCluster = Hadoop.Connect(myUri, userName, password);
 
            //execute mapreduce job
            MapReduceResult jobResult = myCluster.MapReduceJob.Execute<MyMapper, MyReducer>(myConfig);
 
            //write job result to console
            int exitCode = jobResult.Info.ExitCode;
 
            string exitStatus = "Failure";
            if (exitCode == 0) exitStatus = "Success";
            exitStatus = exitCode + " (" + exitStatus + ")";
 
            Console.WriteLine();
            Console.Write("Exit Code = " + exitStatus);
            Console.Read();
        }
    }
}
 
As the result of this we classify all messages into the type and get counts per type
 
When we now execute the program (by simply pressing F5) we will see the following outcome
 
Command prompt
 
 
MapReduce program executed successfully and we can now get the output file from Hadoop File System as shown below
 
Command prompt
 
 
The result will look like
 
Result
 
DEBUG
4
FATAL
1
INFO
1
TRACE
8
 
Now we can do a quick analysis using the power of Microsoft Excel and visualize the results in Pie chart
 
Pie chart
 
 
We can also review the execution history in our Cluster
 
 
Job History
 
 
In case you want to schedule MapReduce job in Windows Azure HDInsight portal you will be required to provide a JAR file. That's why there're number of other ways how to schedule MapReduce job written in .NET (and not Java) in HDInsight. For example, you can leverage MRRunner framework provides as a part of .NET SDK for Hadoop: http://hadoopsdk.codeplex.com/wikipage?title=Example%20Map-Reduce%20program
 
Now we locally executed MapReduce job and got the result, but we also want to leverage Windows Azure HDInsight service which potentially can provide us with much more computational power in order to process TB's or PB's of Equipment Run Log file(s) data which we may have collected
 
That's why we will go ahead and create HDInsight cluster in Windows Azure now which is very simple to do
 
New HDInsight Cluster
 
 
Configure Cluster User
 
 
Storage Account
 
 
As the result we'll have HDInsight cluster provisioned for us
 
HDInsight Cluster
 
 
HDInsight Cluster - Dashboard
 
 
On the dashboard you can monitor activity, review the specs and more. In this particular scenario I'm using 24 cores for computations in the cluster
 
Next step is to log into HDInsight Management Portal (Manage Cluster) 
 
HDInsight Management Portal Login
 
 
The URL for HDInsight Management Portal may look like this: https://alex.azurehdinsight.net/
 
HDInsight Management Portal
 
 
Now I can also connect my local HDInsight Developer Preview installation to Windows Azure HDInsight Cluster
 
Register Cluster
 
 
Once this is done you will see that now I have 2 clusters defined: one is local cluster and another one is Windows Azure HDInsight cluster
 
 
Finally when I submit MapReduce job for execution in Windows Azure HDInsight cluster I can review execution history and other details 
 
 
Please note that I executed OEE Analysis job in Windows Azure HDInsight cluster
In this walkthrough we reviewed how to install and set up HDInsight cluster locally and in the Cloud, how to do OEE Optimization utilizing Big Data collected from Equipment on the Shop Floor in form of unstructured logs and get a valuable business insight. Please note that we could also mash this data up with transactional data in Microsoft Dynamics AX 2012 for better business insights.
 
 
Summary: This document describes how implement MapReduce Hadoop job in .NET in order to do OEE Optimization for Manufacturing organization. Hadoop platform provides a cheaper (scales to PB's or more), faster (parallel data processing) and better (suited for particular types of Big Data problems) way to work with unstructured, semi-structured or the combination of semi/unstructured data and structured data, and get a valuable business insight for optimization. We discussed how to utilize a local Hadoop environment as well as Windows Azure HDInsight service available in the Cloud. Please learn more about Windows Azure here: http://www.windowsazure.com.
 
Tags: Big Data, Windows Azure, HDInsight, Hadoop, MapReduce, Manufacturing, Microsoft Dynamics AX 2012, OEE, Overall Equipment Efficiency, .NET.
 
Note: This document is intended for information purposes only, presented as it is with no warranties from the author. This document may be updated with more content to better outline the issues and describe the solutions.
 
Author: Alex Anikiev, PhD, MCP

3 comments:

  1. Fortunately, Apache Hadoop is a tailor-made solution that delivers on both counts, by turning big data insights into actionable business enhancements for long-term success. To know more, visit Big data Training Bangalore

    ReplyDelete
  2. Nice post,keep sharing more posts with us.
    Keep updating..Thank you..

    hadoop administration course

    ReplyDelete
  3. The Data Lake serviceis a cloud based data management service that allows data scientists and analysts to easily store and process data at scale. A Data Lake is a central repository of data used for business intelligence, exploration and machine learning.

    ReplyDelete