Archive for the Bigdata Category

May 27 2016

Comprehensive Study of Hadoop Distributions


Anyone who is getting attention about big data probably will be aware of how hot Hadoop is right now. Hadoop is powerful open source software framework that makes it possible to process large data sets, doing so across clusters of computers. This design makes it easy to quickly scale-up from a single server to thousands. With data sets distributed across commodity servers, companies can run it fairly economical and without the need of high-end hardware.  The number of vendors has developed their own distributions, adding new functionality or improving the code base. Vendor distributions are designed to overcome issues with the open source edition and provide additional value to customers, with a focus on things such as:

  • Reliability -The vendors react faster when bugs are detected. They promptly deliver fixes and patches, which makes their solutions more stable.
  • Support- A variety of companies provides technical assistance, which makes it possible to adopt the platforms for mission-critical and enterprise-grade tasks.
  • Advanced System management and Data management tools –Using other tools and feature like security, management, work­flow, provisioning and coordination.

Several infrastructure vendors like Oracle, IBM, Cloudera,Hortonwork,EMC green Plump and other companies also provide their own distributions and do their best to promote their distributions by bundling Hadoop distribution with custom developed systems referred to as ‘engineered systems’. The engineered systems with bundled Hadoop distributions form the “engineered big data systems”.

There are some major players in the industry those are MapReduce, Cloudera, Hortonworks, MapR, IBM, Oracle, and EMC Green Plum.


hadooptable        Comparison of latest hadoop distributions

  • Amazon Web Services led the pack due to its proven, feature-rich Elastic MapReduce subscription service.
  • IBM and EMC Greenplum (now Pivotal) offer Hadoop solutions within strong enterprise data warehouse (EDW)  portfolios
  • MapR is best if you are looking complete Hadoop stack with all feature.
  • Cloudera include component such user interface, security, integration, and make administration of your enterprise data hub simple and straight forward. By using Cloudera manager you can centrally operate big data.
  • Hortonworks is the only vendor who offers all Hadoop open source services.



  1. Hadoop Distributions: Evaluating Cloudera, Hortonworks, and MapR in Micro-benchmarks and Real-world Applications by Vladimir Starostenkov, Senior R&D Developer, Kirill Grigorchuk, and Head of R&D Department.
  2. EMC federation BigData solution 2015.
  3. Using IBM InfoSphere BigInsights to accelerate big data time to value (IBM White Paper).


May 27 2016

Spark streaming vs Flink

There are ample amount of Distributed stream processing systems available in the market, but among them Apache Spark is being widely used by all organizations, it may be due to the fundamental need for faster data processing and real time streaming data. But with the rise of new contender Apache Flink, one begins to wonder whether they might have to shift to Flink from Spark streaming. Let us understand the pros and cons of both the tools in this article.

From inception Apache Spark (fig: 1.1) has provided a unified engine which backs both batch and stream processing workload, while other systems that either have a processing engine designed only for streaming, or have similar batch and streaming APIs but compile internally to different engines. Spark streaming discreteness the streaming data into micro batches, that means it receives data and in parallel buffer it in spark’s worker nodes. This enables both better load balancing and faster fault recovery. Each batch of data is a Resilient Distributed Dataset (RDD), which is the basic abstraction of a fault-tolerant dataset in Spark.


Fig: 1.1

Apache Flink (fig:1.2) is a latest big data processing tool known for processing big data quickly with low data latency and high fault tolerance on distributed systems on a large scale. Its major essence is its ability to process streaming data in real time like storm and is primarily a stream processing framework that can look like a batch processor. It is optimized for cyclic or iterative processes achieved by an optimization of join algorithms, operator chaining and reusing of partitioning.


Fig: 1.2

Both systems are targeted towards building the single platform where you can run batch, streaming, interactive, graph processing, ML etc. While Flink provides event level granularity while Spark Streaming doesn’t provide, since it is a faster batch processing. Due to intrinsic nature of batches, support for windowing is very limited in Spark streaming. Flink rules over Spark streaming with better windowing mechanisms. Flink allows window based on process time, data time, no of records that to be customized. This flexibility makes flink streaming API very powerful compared to spark streaming

While Spark streaming follows a procedural programming system, Flink follows a distributed data flow approach. So, whenever intermediate results are required, broadcast variables are used to distribute the pre-calculated results through to all the worker nodes.

Some of the similarities between Spark Streaming and Flink is exactly-once guarantees (Correct results, also in failure cases), thereby eliminating any duplicates and both provides you with a very high throughput compared to other processing systems like Storm. Also, both provide automatic memory management.

For example, If you need to compute the cumulative sales for a shop with specific time interval then batch processing could do it in ease but rather when an alert is to be created, when a value reaches its threshold level then this situation can be well tackled by stream processing.

Let us now take a deep dive and analyze the features of Spark Streaming ad Flink.


Though Spark has a lot of advantages in batch data processing, but still it has a lot cases to cater in streaming. Flink can process batch processing it cannot be compared with spark in same league. At this point of time Spark is much mature and complete framework compared to Flink. But it appears that Flink is taking big data processing to next level altogether in streaming.


May 12 2016

Big lessons from big data implementation – Part I

Each day 23 billion GB of data are being generated and the speed of generating big data is going double in every 40 month! Apart from their business data, organizations now also have humongous data available from google, Facebook, amazon, etc. They wish they can use all the available data to find useful information for doing their business better.  Let us look into big data deployment of a few organizations and learn from their experience.

Case 1: Rabobank

Rabobank is a Dutch multinational banking and financial services company headquartered in Utrecht, Netherlands. It is a global leader in food and agro financing and sustainability-oriented banking. Rabobank started with developing a big data strategy July 2011. They created a list of 67 possible big data use case. These use cases included:

  • To signal and predict risks, prevent fraudulent actions that the bank is running
  • To identify customer behavior  and to obtain a 360-degrees customer profile;
  • To recognize the most influential customers as well as their network;
  • To be able to analyses mortgages;
  • To identify the channel of choice for each customer.

For each of these categories they roughly calculated the time needed to implement it as well as the value proposition. In the end the Rabobank moved forward with big data application to improve business processes as the possibility for a positive ROI. A dedicated, highly skilled and a multidisciplinary team was created to start with the big data use cases. They were using Hadoop for analyzing big data. They selected social data, open data and trend data were integrated. So there data approach with a deluge of semi and unstructured mess. Hadoop is only part of a big data strategy. The key to success was the multidisciplinary team and that they embraced uncertainties and accepted mistakes to be made help them to overcome situation.

Problems faced during implementation

Rabobank didn’t store raw data, due to the costs and capacity issues. The data quality was not constant and the security issues were very high. Rabobank noticed that it was often unclear who owned the data as well as where all data was stored. Hadoop is different from older database and data warehousing systems, and those differences confused the users.


  1. Specialized knowledge as well as visualizations is very important to drive big data success.
  2. Start with the basics & don’t stop at stage one. Big data implementation is continuous journey to reap data-driven insights.
  3. Not having the right skills for the job can be a big problem.
  4. The dangers of underestimating the complexity of a big data system implementation so focus on data management.


In 2010 the newly elected president of the United States of America government introduced Patient Protection and Affordable Care act. The main purpose of this act was the best of public and private insurance coverage for the population, and thereby controlling and reducing healthcare costs and requires them to interact with the government via a website to do so. The system is in essence a big data implementation problem with data being collected on a potential population in Excess of 300 million people across the entire country. Unfortunately the project not progressed as planned, and has become mired in technological controversy.

Problems faced during implementation

  • This act brought the country to default on its debt
  • Cost of Obama care – $1.6 trillion
  • Estimation of cost 2014-2024 – $ 3.8 trillion

Anticipation to prevent the problem:

  • They can take special knowledge as well as visualizations can prevent the loss.


  1. The dangers of underestimating the complexity of a big data system implementation to focus on data management.
  2. The prior analysis and prediction complexity of data can prevent cause.
  3. Most of the data collected and stored in an agency’s transaction processing systems lacks adequate integrity so make sure that captured data meet integrity standard.
  4. Specialized knowledge as well as visualizations is very important to drive big data success.
  5. Not having the right skills for the job can be a big problem.


We shall analyze a few more cases tomorrow. Keep watching this space.


Views expressed on this article are based solely on publicly available information.  No representation or warranty, express or implied, is made as to the accuracy or completeness of any information contained herein.  Aaum expressly disclaims any and all liability based, in whole or in part, on such information, any errors therein or omissions therefrom.


  1. The process of big data solution adoption By Bas verheji.
  2. Case study on Rabobank by Hewlett Packard enterprise.
  3. Big data for all : Privacy and user control in age of analytics by Omer Tene and Jules Polonetsky
  4. Realizing the promise of big data implementing the big data projects by Kevin c Desouza
  5. Case study on Obama care big data problem on patient protection and affordable care act