Archive for May, 2016

May 27 2016

Comprehensive Study of Hadoop Distributions


Anyone who is getting attention about big data probably will be aware of how hot Hadoop is right now. Hadoop is powerful open source software framework that makes it possible to process large data sets, doing so across clusters of computers. This design makes it easy to quickly scale-up from a single server to thousands. With data sets distributed across commodity servers, companies can run it fairly economical and without the need of high-end hardware.  The number of vendors has developed their own distributions, adding new functionality or improving the code base. Vendor distributions are designed to overcome issues with the open source edition and provide additional value to customers, with a focus on things such as:

  • Reliability -The vendors react faster when bugs are detected. They promptly deliver fixes and patches, which makes their solutions more stable.
  • Support- A variety of companies provides technical assistance, which makes it possible to adopt the platforms for mission-critical and enterprise-grade tasks.
  • Advanced System management and Data management tools –Using other tools and feature like security, management, work­flow, provisioning and coordination.

Several infrastructure vendors like Oracle, IBM, Cloudera,Hortonwork,EMC green Plump and other companies also provide their own distributions and do their best to promote their distributions by bundling Hadoop distribution with custom developed systems referred to as ‘engineered systems’. The engineered systems with bundled Hadoop distributions form the “engineered big data systems”.

There are some major players in the industry those are MapReduce, Cloudera, Hortonworks, MapR, IBM, Oracle, and EMC Green Plum.


hadooptable        Comparison of latest hadoop distributions

  • Amazon Web Services led the pack due to its proven, feature-rich Elastic MapReduce subscription service.
  • IBM and EMC Greenplum (now Pivotal) offer Hadoop solutions within strong enterprise data warehouse (EDW)  portfolios
  • MapR is best if you are looking complete Hadoop stack with all feature.
  • Cloudera include component such user interface, security, integration, and make administration of your enterprise data hub simple and straight forward. By using Cloudera manager you can centrally operate big data.
  • Hortonworks is the only vendor who offers all Hadoop open source services.



  1. Hadoop Distributions: Evaluating Cloudera, Hortonworks, and MapR in Micro-benchmarks and Real-world Applications by Vladimir Starostenkov, Senior R&D Developer, Kirill Grigorchuk, and Head of R&D Department.
  2. EMC federation BigData solution 2015.
  3. Using IBM InfoSphere BigInsights to accelerate big data time to value (IBM White Paper).


May 27 2016

Spark streaming vs Flink

There are ample amount of Distributed stream processing systems available in the market, but among them Apache Spark is being widely used by all organizations, it may be due to the fundamental need for faster data processing and real time streaming data. But with the rise of new contender Apache Flink, one begins to wonder whether they might have to shift to Flink from Spark streaming. Let us understand the pros and cons of both the tools in this article.

From inception Apache Spark (fig: 1.1) has provided a unified engine which backs both batch and stream processing workload, while other systems that either have a processing engine designed only for streaming, or have similar batch and streaming APIs but compile internally to different engines. Spark streaming discreteness the streaming data into micro batches, that means it receives data and in parallel buffer it in spark’s worker nodes. This enables both better load balancing and faster fault recovery. Each batch of data is a Resilient Distributed Dataset (RDD), which is the basic abstraction of a fault-tolerant dataset in Spark.


Fig: 1.1

Apache Flink (fig:1.2) is a latest big data processing tool known for processing big data quickly with low data latency and high fault tolerance on distributed systems on a large scale. Its major essence is its ability to process streaming data in real time like storm and is primarily a stream processing framework that can look like a batch processor. It is optimized for cyclic or iterative processes achieved by an optimization of join algorithms, operator chaining and reusing of partitioning.


Fig: 1.2

Both systems are targeted towards building the single platform where you can run batch, streaming, interactive, graph processing, ML etc. While Flink provides event level granularity while Spark Streaming doesn’t provide, since it is a faster batch processing. Due to intrinsic nature of batches, support for windowing is very limited in Spark streaming. Flink rules over Spark streaming with better windowing mechanisms. Flink allows window based on process time, data time, no of records that to be customized. This flexibility makes flink streaming API very powerful compared to spark streaming

While Spark streaming follows a procedural programming system, Flink follows a distributed data flow approach. So, whenever intermediate results are required, broadcast variables are used to distribute the pre-calculated results through to all the worker nodes.

Some of the similarities between Spark Streaming and Flink is exactly-once guarantees (Correct results, also in failure cases), thereby eliminating any duplicates and both provides you with a very high throughput compared to other processing systems like Storm. Also, both provide automatic memory management.

For example, If you need to compute the cumulative sales for a shop with specific time interval then batch processing could do it in ease but rather when an alert is to be created, when a value reaches its threshold level then this situation can be well tackled by stream processing.

Let us now take a deep dive and analyze the features of Spark Streaming ad Flink.


Though Spark has a lot of advantages in batch data processing, but still it has a lot cases to cater in streaming. Flink can process batch processing it cannot be compared with spark in same league. At this point of time Spark is much mature and complete framework compared to Flink. But it appears that Flink is taking big data processing to next level altogether in streaming.


May 24 2016

Intelligent Traffic Management – Applying Analytics on Internet of Things

With extension to the previous article “Revolutionizing the Agriculture Industry” this article talks more about how Internet of Things (IoT) and Analytics is going to revolutionize the Traffic Management in this modern world. Day-by-day roads are getting deluged with vehicles but the road infrastructure remains unchanged. Congestion in cities is cited as the major transportation problem around the globe. According to Texas A&M Transportation Institute, in 2011 traffic costs $121 billion due to travel delay with a loss of 2.9 billion gallons of fuel in USA alone.

The world is becoming more intelligent, where sensors in cars, roads are connected to internet and the devices communicate the data with each other. Using this intelligent, fleets can avoid accidents, predict car failure, preventive actions for maintenance can be taken, etc. Let us say hello to John.

Our John drives to office every day. His intelligent car gets data about various on-the-road events like accident, etc that took place ahead in his route and will tell guide him to take an alternative efficient route. With this intelligence John reaches his destination and his car informs him about the available parking slot! These events will not only enormously help John to save time and fuel but give him a tension free life. Are there any such services like this available already?


Meet Zenryoku Annai

Zenryoku Annai is a service which is provided in Japan by Nomura Research Institute (NRI). Using this service, subscribers all over Japan can plot out the shortest travel routes, avoid traffic snarls and estimate what time they will arrive at their destinations. It compounds information from satellite navigation systems linked to sensors at fixed locations along roads with traffic data determined through statistical analysis on position and speed information from subscribers, moving vehicles and even pedestrians. Meanwhile, data from thousands of taxicabs is added to the mix. Using all this information, Zenryoku Annai analyzes road conditions and helps drivers plan routes more accurately and over a wider range than is possible with conventional GPS systems. Since more vehicles are being added to it over a period, they were using in-memory computing technology, which has improved the search speed by a factor of more than 1,800 over the department’s previous relational database management system, that is 360 million data points can be processed over just in 1 second.


Fig: 1.1

Fig: 1.1 In a conventional SatNav system, road conditions are known at locations only where sensors are installed, but in NRI’s probe technology roads conditions can be determined much more accurately with position and speed data delivered from in-car units and mobile phones, sensors. Thereby it will also suggest the best alternative route for the user. Though Zenryoku Annai is not fully skilled, since all the vehicles in roads are not connected to it. But very sooner once can expect everything that runs on road to be connected to the internet.


Driverless Cars: Another example!

Driverless/Automation car is an ideal example, where IoT and Real time Analytics plays a crucial part. When a car goes down in a road it actually starts interacting with the signals, vehicles through various sensor points, accordingly it routes to the shortest distance itself by calculating the congestion & distance to reach the destination in a minimum duration. Pilot test run of Google cars (fig: 1.2) inferred that, it generates 1 gigabytes of data per minute from surroundings and these cars entirely dependents on IoT and analytics to take decisions.


Fig: 1.2



IoT with Analytics is in the very nascent stage and there are many barriers to restricts its growth. i.e. security of the data, privacy of the individuals, implementation problems and technology fragmentation. An average American commuter has spent 14 hours/year in 1982 but in 2010 it has surged to 34 hours/year, if this problem is unsolved, then it may even boost up to 40 hours/year. With the density of the population is exploding rapidly with the shortage of space, it is impossible to increase the capacity of roads but rather, a practically viable option is to use the power of data analytics on IoT.


Views expressed on this article are based solely on publicly available information.  No representation or warranty, express or implied, is made as to the accuracy or completeness of any information contained herein.  Aaum expressly disclaims any and all liability based, in whole or in part, on such information, any errors therein or omissions therefrom.




May 12 2016

Big lessons from big data implementation – Part I

Each day 23 billion GB of data are being generated and the speed of generating big data is going double in every 40 month! Apart from their business data, organizations now also have humongous data available from google, Facebook, amazon, etc. They wish they can use all the available data to find useful information for doing their business better.  Let us look into big data deployment of a few organizations and learn from their experience.

Case 1: Rabobank

Rabobank is a Dutch multinational banking and financial services company headquartered in Utrecht, Netherlands. It is a global leader in food and agro financing and sustainability-oriented banking. Rabobank started with developing a big data strategy July 2011. They created a list of 67 possible big data use case. These use cases included:

  • To signal and predict risks, prevent fraudulent actions that the bank is running
  • To identify customer behavior  and to obtain a 360-degrees customer profile;
  • To recognize the most influential customers as well as their network;
  • To be able to analyses mortgages;
  • To identify the channel of choice for each customer.

For each of these categories they roughly calculated the time needed to implement it as well as the value proposition. In the end the Rabobank moved forward with big data application to improve business processes as the possibility for a positive ROI. A dedicated, highly skilled and a multidisciplinary team was created to start with the big data use cases. They were using Hadoop for analyzing big data. They selected social data, open data and trend data were integrated. So there data approach with a deluge of semi and unstructured mess. Hadoop is only part of a big data strategy. The key to success was the multidisciplinary team and that they embraced uncertainties and accepted mistakes to be made help them to overcome situation.

Problems faced during implementation

Rabobank didn’t store raw data, due to the costs and capacity issues. The data quality was not constant and the security issues were very high. Rabobank noticed that it was often unclear who owned the data as well as where all data was stored. Hadoop is different from older database and data warehousing systems, and those differences confused the users.


  1. Specialized knowledge as well as visualizations is very important to drive big data success.
  2. Start with the basics & don’t stop at stage one. Big data implementation is continuous journey to reap data-driven insights.
  3. Not having the right skills for the job can be a big problem.
  4. The dangers of underestimating the complexity of a big data system implementation so focus on data management.


In 2010 the newly elected president of the United States of America government introduced Patient Protection and Affordable Care act. The main purpose of this act was the best of public and private insurance coverage for the population, and thereby controlling and reducing healthcare costs and requires them to interact with the government via a website to do so. The system is in essence a big data implementation problem with data being collected on a potential population in Excess of 300 million people across the entire country. Unfortunately the project not progressed as planned, and has become mired in technological controversy.

Problems faced during implementation

  • This act brought the country to default on its debt
  • Cost of Obama care – $1.6 trillion
  • Estimation of cost 2014-2024 – $ 3.8 trillion

Anticipation to prevent the problem:

  • They can take special knowledge as well as visualizations can prevent the loss.


  1. The dangers of underestimating the complexity of a big data system implementation to focus on data management.
  2. The prior analysis and prediction complexity of data can prevent cause.
  3. Most of the data collected and stored in an agency’s transaction processing systems lacks adequate integrity so make sure that captured data meet integrity standard.
  4. Specialized knowledge as well as visualizations is very important to drive big data success.
  5. Not having the right skills for the job can be a big problem.


We shall analyze a few more cases tomorrow. Keep watching this space.


Views expressed on this article are based solely on publicly available information.  No representation or warranty, express or implied, is made as to the accuracy or completeness of any information contained herein.  Aaum expressly disclaims any and all liability based, in whole or in part, on such information, any errors therein or omissions therefrom.


  1. The process of big data solution adoption By Bas verheji.
  2. Case study on Rabobank by Hewlett Packard enterprise.
  3. Big data for all : Privacy and user control in age of analytics by Omer Tene and Jules Polonetsky
  4. Realizing the promise of big data implementing the big data projects by Kevin c Desouza
  5. Case study on Obama care big data problem on patient protection and affordable care act


May 11 2016

Revolutionizing the Agriculture industry – Applying Analytics on Internet of Things

Analytics on Internet of Things

Internet of Things (IoT) and Big Data analytics could be the two most buzzed terms in the Industries for the past 2 years. IDC has forecasted to yield $8.9 trillion revenue by 2020 in IoT and Goldmann Sachs has estimated that 28 billion devices will be connected to internet by 2020, which indicates that each of this connected devices will shoot back with humongous amount of data each second, therefore you need a proper analytical tool/solution for the “Value creation”.

Introduction to IoT:

IoT is a network of inter-connected objects able to collect and exchange data. This eco-system enables entities to connect to, and control, their devices. The information that performs the command and/or sends back the information back over the network to be analyzed and displayed on the remote. For example: At 6am John’s phone receives a mail informing that his meeting has been pushed back, now his mail service tell his smart clock to give him an extra 30 minutes of sleep and alerts him to the change once he wakes. This is how the whole system works, but with the addition of Analytics to this prototype will enable the firms to take an efficient and best decision with the available data.

Analytics on IoT in Agricultural Industry:

How will this trend influences agriculture industry? The answer is, it is going to entirely revolutionize the whole working pattern of the industry rather than just influencing. Because in 2050 global population is expected to reach 9 billion people (34% higher than today), in parallel food production should be increased by at least 70%, but in other hand U.S Department of Agriculture states that 90% of all crop loses is due to weather related incidents. Thus, if we could minimize the losses due to weather related incidents and use the limited fresh water resource as draught is prevailing across the globe (to the fact 70% of the world’s fresh water is already being used for Agriculture). And so, this issues can be resolved by using predictive Analytics, which is going to play a major part in building value creation. Predictive analytics acts as a central element in predicting the future picture which requires a lot of input data from various distinctive variables. The basic idea is to identify and differentiate between the high and low yielding crop lands by measuring its productivity.

Flint River Valley project:

Hyper local forecasting techniques will assist farmers to overcome the above obstacles. The Flint River Valley is a part of Georgia’s agricultural industry, it roughly contributes around $2 billion annually in farm-based revenue. A pilot test run is been made in Flint River valley in USA, by researchers from the Flint River Soil and Water Conservation District, the U.S. Department of Agriculture, the University of Georgia and IBM. The primary objective is to give farmers a beneficial information about the weather through analyzing the data obtained from various sensors (fig: 1.1) which is installed over fields. Here sensors collect data like temperature, moisture level in air and surroundings which is made to blend with satellite data. They use this data in Variable Irrigation Rate technology which enables the farmers to conserve water using sprinklers which will turn water off over areas that don’t need water and turn them back on over areas that need water.

Fig: 1.1


Fig: 1.2


Fig: 1.2 shows the cloud water density, that is water content in the cloud which is important in figuring out which type of cloud is going to form and helps to determine the cloud formations that are likely to occur, which is extremely useful for weather forecasting

According to IBM, farmers will be able to track weather conditions in 10 min increments up to 72 hours in advance. And a full 72 hour forecast will create data around 320 gigabytes but while each individual farmers will require a small tranche of it in a personalized way. They are also building a weather model with 1.5 kilometer resolution for the farmers. It is estimated to save 15% of the total water that is been used in irrigation that is about some million gallons per year. The costing comes around $20 – $40 per acre for first 3 years.

 Fig: 1.3


With geospatial mapping, sensors and predictive analytics, farmers will be presented with real time data in time series, graphs at a granular level. Soil quality, field workability, details on nitrogen, pests and disease, precipitation, temperatures, and harvest projections with even predicting the expected amount of revenue in relation to the commodity’s market trend and all of this is been analyzed and reported via a smartphone ( Fig: 1.3 ), tablet, or desktop. In future it may even become mandatory to use IoT and Analytics in agriculture industry for sustaining and growth, in order to maximize its revenue in multiple folds with the minimal use of resources.


The IoT is on its way to becoming the next technological revolution with $6 trillion to be invested before 2020, and a predicted ROI of $13 trillion by 2025 (cumulative of 2020-2025). Given the massive amount of revenue and data that the IoT will generate, its impact will be felt across the entire big data universe, forcing companies to upgrade current tools and processes, and technology to evolve to accommodate this additional data volume and take advantage of the insights.




May 10 2016

Dawn Of Online Aggregators – How Business Analytics enabled them

Online aggregators are websites which bequests in the e-commerce industry by stockpiling the information about various goods and services and conglomerate them from several competing sources, in their websites. The aggregator model assist consumers bestowed, customized, tailored, and cater for the needs and wants of the consumers, later adding value to their feedback and services by revamping their shopping experience.

Who are the Online Aggregators?

They have been graciously welcomed by both the end players- customers and businesses, since it enhances the sale and make a good reach of the product/service to the customers, thus and thus benefits the end players. They have been lustering across many industries like travel, payment gateways, insurance, taxi services, or some firms open up secondary market over internet like letgo, Locanto, vinted and in food ordering services like    Campus food, Gimmegrub, Diningin, GetQuick et al.

Business Analytics – As a key factor for the aggregator’s triumph

For bringing in such a dynamic yet strong change in the e-commerce industry, from what has been traditionally followed, what pitch would have had the aggregator firm taken up?  What paved as a base for the firm to venture in this space? It is possible by unlocking the marketing data and turn its inside out. This job could be well chipped by business analytics, since it could be portrayed intersection of data science with business.

Business Analytics is the study of data through statistical and operational analysis. It focuses to bring out new insight based on the data collected and utilize them in enhancing the business performance. It is closely related to managerial science, since it is extensively fact-based explanatory and predictive model using statistical analysis to drive management in decision making.

Uber- A study on how Business analytics has augmented their business 

In an online aggregator like Uber, this is spread across the world Business analytics plays a crucial role. Uber is an app based technological platform which links the passengers who are up to hire a taxi, to a driver who is ready to assent a ride and Uber takes 20% of the cab fare as its commission and the remaining is pocketed to the driver. The firm has rooted its trial across 444 cities worldwide. In a city like New York it’s whisked around by 14000 thousand Uber cabs while the unorganised hold 13500 taxies. In Los Angles among 22300 cabs 20000 are registered under Uber.

Market share of Uber

Fig 1: Market share of Uber


Uber maintains a huge data of all its drivers, users and the details of the every city in which it exists so that it can instantly match the passenger with the driver who is nearby. In USA, the traditional taxi meters charge the passengers based on the duration of the rides. But, Uber follows a patent valid algorithm that use the particulars of distance and the duration of the trip. This technique which is used by Uber is called Surge Pricing. A lone feature of this technique is that, the price is multiplied in terms of the surge time, which is when the traffic is overflown, for which the firm had to make a study on the traffic in the New York City (Fig 2). But the passengers are previously warned about the magnifying rate over the normal once. This algorithm advocates the driver to stay back at home due to the shortfall of rides or encourage them to get behind the wheels to gain the extra money while the city is in traffic.

Traffic hours in the New York City

Fig 2: The analysed result of Traffic hours in the New York City and the Surge price that the Uber app informs the passenger about the multiplied charge.


In a study made, Uber inferred that in New York passengers are travelling from almost same locality to almost same destination. When surveyed, majority of them agreed to share their cab even if it’s a stranger. Uber Pool is a service provided by Uber, where a passenger could track another passenger who is waiting to aboard, on the way to be picked up and have to reach almost the same destination. This lets a passenger share the cab with a stranger and cuts down the cost.

Operation of Uber pool

  Fig 3: Operation of Uber pool

Uber lets the passenger rate about the driver by the end of every trip based on this knowledge about the city roads, professionalism, driving ability, car quality and punctuality. This is to evaluate the driver and educate them with the skills or even at an extent to rusticate them from the service.

Distribution of drivers by rating and self-assessment chart to the drivers from Uber app

Fig 4: Distribution of drivers by rating and self-assessment chart to the drivers from Uber app.



In the case­ of Uber’s journey on being an online aggregator, the business analytics have come in handy to evaluate them among the market players, to know about their customer needs and driver’s attitude, to come up with a new strategy like Uber pool letting the passengers to share the taxi and its fare, Rating system to make managerial decision in favour of or against the taxi drivers, to come up with a contemporary pricing method – surge pricing, which charges the passengers based on the changes in demand.

A firm is bound to realize business analytics while making over any managerial decisions, when making a strategic move, when casting a new product or services because this unloads the assumptions and gives in a firm and statistical data in various business obligations. It helps a management make the decision faster and improves the critical performance with the precise data in hand. Business analytics is committed and aids to procure, sustain and reduce the churn rate of any business entity. The analysis subsidize more insights about the market and find the target customers, evaluate the impact created over them due to the changes made in price/ service over the product and realize their expectations. So, Business analytics would be the brain of an organisation to take proactive decisions and plan the business for maximum success by looking into the future.


(The data and information used in this article are utilized from the referred sites and documents, and are not self-generated.)