|
Blogs
In Part 1 of this blog series I talked about how decisions would be made using a combination of technologies especially BPM and HANA. NBA ( Next Best Action ) is the key in managing customers as enterprises needs to have an intelligent DNA in them to decide in few seconds what would be the best action possible given a particular customer context. Hadoop for processing unstructured data.Apache Hadoop is an implementation of Google's papers on MapReduce and GFS.It is extremely good at processing huge amount of mostly unstructured data type in a distributed fashion. Hadoop is not a real time system like HANA, it's a batch processing system, meaning you submit jobs to it which it will distribute using it's HDFS distributed file system and perform Map and Reduce operations on top of data. Hadoop does not intend to compete with RDBMS, instead it complements it in the areas of batch processing on huge amounts of unstructured data which is again fed back to RDBMS for real processing and analytics. Setting up Hadoop would require a separate blog on its own, which I'm planning to write in the near future. I will present here the MapReduce programs and analytics that I did using Hadoop. The result of this processing would be fed into HANA for real time NBA determination in BPM. Unstructured Data.Unstructured data is basically data without schema which cannot in its raw form can be analyzed. Lately unstuctured data has been growing at exponential rates in the enterprise as more and more systems comes online and enterprises wants to capture as much data as possible to make the best decisions in a matter of second. I would be using a sample data from an ecommerce application logs in which customer logs and performs various actions. Here a snapshot of the data I generated for processing with Hadoop.
Once you have your data you can start writing Map Reduce programs to start processing this data. I wrote several programs to analyze activities, referrals, purchase amount by location, most browsed product models and grouped them by location to produce visualization of each. Map Reduce programs were written in Python using Hadoop streaming API. One of the advantage with this approach is that I can even simulate Hadoop programs before I submit them for correctness and testing, which saved me from lots of trouble. Map Programimport sys The above program check whether Hadoop simulation is set to True, it then produces sorted keys for customer id attribute with the entire line ( Not a good practice, only a small subset of data should be emitted as value ). Note : this approach is not scalable for huge amount of data, so use cut command to sample out small data for testing purpose while using Hadoop simulation. Reduce Programimport sys logintime,logouttime,loggedintime = 0,0,0 p_senti = ["great","outstanding","impressive","wow","brilliant"] fields = line.strip().split("\t")[1].split(",") if fields[6] == "purchasing": elif fields[6] == "login": if fields[9] == "suc:closed": for item in p_senti: if item in fields[15]:
The above program is a Reduce program, it again reads the emitted value from STDIN and starts processing values for each line. Note the use of last_key for determining change in sorted customer id. Executing Hadoop Jobs.You need to execute hadoop jobs referencing hadoop*streaming*.jar in your path, specifying map and reduce programs. #time hadoop jar ../contrib/streaming/hadoop-*streaming*.jar -input /input -output /output_c -mapper 'cmap.py 0 1' -reducer 'cred.py' -file cmap.py -file cred.py Here are the result:
You can see the time taken is approx 1 minute to execute the job. Also first few lines of output is displayed showing customer id, total purchase amount, average purchase amount, with total number of purchases status for each customer, total referrals and a very primitive sentiment analysis on the comments that customers have put in. Using excel on the output data you can do some analysis as shown below on average amount of purchase and sentiment analysis and positive and negative sentiments ( p_senti and n_senti)
|