write down,forget
adidas eqt support ultra primeknit vintage white coming soon adidas eqt support ultra boost primeknit adidas eqt support ultra pk vintage white available now adidas eqt support ultra primeknit vintage white sz adidas eqt support ultra boost primeknit adidas eqt adv support primeknit adidas eqt support ultra boost turbo red white adidas eqt support ultra boost turbo red white adidas eqt support ultra boost turbo red adidas eqt support ultra whiteturbo adidas eqt support ultra boost off white more images adidas eqt support ultra boost white tactile green adidas eqt support ultra boost beige adidas eqt support ultra boost beige adidas eqt support refined camo drop adidas eqt support refined camo drop adidas eqt support refined running whitecamo adidas eqt support 93 primeknit og colorway ba7506 adidas eqt running support 93 adidas eqt support 93

Hadoop and MapReduce: Big Data Analytics [gartner]

<Category: Hadoop> 查看评论

收藏,下载地址:http://dl.medcl.com/get.php?id=29&path=books%2Fgartner%2CHadoop+and+MapReduce+Big+Data+Analytics.7z


and MapReduce: Big Data Analytics

14 January 2011

Marcus Collins

Burton IT1 Research Note G00208798

Big data analytics and the Apache Hadoop open source project are rapidly emerging as the preferred solution to address business and technology trends that are disrupting traditional data management and processing. Enterprises can gain a competitive advantage by being early adopters of big data analytics.

Table of Contents

Summary of Findings

Analysis

Disruptive Changes on Traditional Data Processing

Increasing Data Volumes

Increasing Data Complexity

Increasing Analysis Complexity

Changing Analysis Model

Increased Acceptance of Open Source Software

Increasing Availability of Cost-Effective Compute and Storage

Data-Processing Pipeline

Structured Data

Big Data Analytics Classification

Speed of Decision Making

Processing Complexity

Transactional Data Volumes

Data Structure

Flexibility of Processing/Analysis

Throughput

Summary

Big Data Analysis Technology

Data Management Frameworks

HDFS

HBase

Processing Frameworks

Hadoop

MapReduce Processing Framework

Hadoop Continued

Oozie

Development Frameworks

Pig

Hive

Modeling Frameworks

Avro

Management Frameworks

Integration Frameworks

Data Movement Between Hadoop and Relational Databases

Hadoop as an ETL and Processing Engine

In-Database MapReduce

Open Source Hadoop

Cloudera’s Distribution for Hadoop

Hadoop in the Cloud

Hadoop Hardware

Hadoop Futures

Hadoop Use Cases

Big Data Analysis as a Competitive Advantage

Strengths

Weaknesses

Recommendations

Adopt Big Data Analytics and the Hadoop Open Source Project to Meet the Challenges of the Changing Business and Technology Landscape

Adopt a Packaged Hadoop Distribution to Reduce Technical Risk and Increase the Speed of Implementation

Be Selective About Which Hadoop Projects to Implement

Use Hadoop in the Cloud for Proof of Concept

The Big Data Analytics Initiative Should Be a Joint Project Involving Both IT and Business

Enterprises Should Not Delay Implementation Just Because of the Technical Nature of Big Data Analytics

Adapt Existing Architectural Principles to the New Technology


List of Tables

Table 1.

Structured Data Definitions

Table 2.

Cloudera’s Distribution for Hadoop

Table 3.

Big Data Analytics Business Patterns

List of Figures

Figure 1.

Traditional Data-Processing Pipeline

Figure 2.

Emerging Data-Processing Pipeline

Figure 3.

Speed of Decision Making

Figure 4.

Processing Complexity

Figure 5.

Transactional Data Volumes

Figure 6.

Data Structure

Figure 7.

Flexibility of Processing/Analysis

Figure 8.

Throughput

Figure 9.

Summary of Big Data Characteristics

Figure 10.

Big Data Analysis Technology Stack

Figure 11.

HDFS Cluster Data Redundancy

Figure 12.

MapReduce Example

Figure 13.

MapReduce Workflow

Figure 14.

Example Hadoop Process Definition Language

Figure 15.

Hadoop as an ETL Engine

Figure 16.

Cloudera’s Distribution for Hadoop

Figure 17.

BI Process

Summary of Findings

Bottom Line: Big data analytics and the Apache Hadoop open source project are rapidly emerging as the preferred solution to business and technology trends that are disrupting the traditional data management and processing landscape. Even though big data analytics can be technically challenging, enterprises should not delay implementation. As the Hadoop projects mature and business intelligence (BI) tool support improves, big data analytics implementation complexity will reduce, but the early adopter competitive advantage will also wane. Technology implementation risk can be reduced by adapting existing architectural principles and patterns to the new technology and changing requirements rather than rejecting them.

Context: Key business and technology trends are disrupting the traditional data management and processing landscape. Data analysis is increasingly being viewed as a competitive advantage. An increasingly sensor-enabled and instrumented business environment is generating huge volumes of data. Data is increasing in complexity as enterprises look to exploit the value locked-up in non-structured data. Analysis-model complexity is growing as an increasingly multi-faceted world perspective influences decisions. Traditional IT infrastructure is simply not able to meet the demands of this new situation. Many enterprises are asking if big data analytics and the Hadoop open source project should be the solution to this unmet commercial need.

Take-Aways: Organizations searching for big data analysis solutions should mind the following:

  • Evaluate big data analytics and the Hadoop open source project’s ability to meet new analytical requirements.
  • Enterprises should consider adopting a packaged Hadoop distribution (e.g., Cloudera’s Distribution for Hadoop) to reduce the technical risk and increase speed of implementation of the Hadoop initiative.
  • Be selective about which Hadoop projects to implement:
    • The Hadoop open source project consists of a wide selection of sub-projects.
    • Be selective about the sub-projects that you implement, and have a clear rationale for each selection.
    • Consider the following projects as forming a foundation:
      • Hadoop Distributed File System (HDFS) and HBase as data management frameworks
      • MapReduce and Oozie as processing frameworks
      • Pig and Hive as development frameworks to increase programmer productivity
  • Use Hadoop in the cloud for proof of concept:
    • Hadoop in the cloud provides a cost-effective and quick implementation solution for a big data analysis proof-of-concept.
    • Once the proof of concept has been verified, bring the project in-house and build a Hadoop center of excellence to showcase the results to other businesses units while gaining experience with implementing and integrating MapReduce jobs for analysis and with running Hadoop in a production environment.
  • The big data analytics initiative should be a joint project involving both IT and business:
    • IT should be responsible for deploying the right big data analysis tools and implementing sound data management practices.
    • Both groups should understand that success will be measured by the value added by business improvements that are brought about by the initiative.
    • As success is achieved, showcase the results to other business units.
  • Enterprises should not delay implementation just because of the technical nature of big data analytics:
    • As the Hadoop projects mature and BI tool support becomes available, the complexity of implementing big data analytics will reduce.
    • Early adopters will gain competitive advantage and invaluable experience, which will sustain the advantage as the technology matures and gains wider acceptance.
  • Adapt existing architectural principles to the new technology:
    • The risk of implementing big data analytics and the Hadoop technology stack will be reduced by adapting existing architectural principles and patterns to the new technology and changing requirements rather than rejecting them.
  • Given the engineering challenge of building a Hadoop cluster at scale, the hardware implementation should be carefully planned.

Conclusion: Numerous business and technology trends are disrupting the traditional data management and processing landscape. Big data analytics and the Hadoop open source project are rapidly emerging as the preferred solution. Enterprises should not delay implementation just because of the technical nature of big data analytics. As the Hadoop projects mature and BI tool support improves, the complexity of implementing big data analytics will reduce. The risk of implementing this technology will be reduced by adapting existing architectural principles and patterns to the new technology and changing requirements rather than rejecting them.

Back to Table of Contents


Analysis

Up until mid 2009 ago, the data management landscape was simple: Online transaction processing (OLTP) systems supported the enterprise’s business processes, operational data stores (ODSs) accumulated the business transactions to support operational reporting, and enterprise data warehouses (EDWs) accumulated and transformed business transactions to support both operational and strategic decision making (see Figure 1).

Figure 1. Traditional Data-Processing Pipeline

Figure 1.Traditional Data-Processing Pipeline

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

A number of key business and technology trends are emerging to make the traditional model more complex.

Back to Table of Contents


Disruptive Changes on Traditional Data Processing

Data analysis is increasingly seen as a competitive advantage as the traditional methods become ubiquitous. Technology is readily available at low cost; business processes are easily mimicked; and increased scale through mergers and acquisitions (M&As) has limited benefits, is extremely complex, and is slow to show benefits (even given some of the spectacular failures, this approach still has proponents [e.g., United and Continental]).

Enterprises are increasingly asking how they can compete in this environment — many enterprises see data analysis as one solution. Data analysis can provide insight into customers, customer buying patterns, and supply chains. This insight leads to more timely situational awareness, lower costs, and increased agility. Data analysis can enable new business models (e.g., Amazon mining clickstream data to drive sales, Netflix mining customer preferences, and consumer package goods manufacturers analyzing point-of-sale data to gain insight into customer buying patterns).

Back to Table of Contents


Increasing Data Volumes

Data growth rates of 40% to 60% are not uncommon in today’s enterprises. This increase constitutes a considerable financial burden on IT organizations. Some of this growth can be attributed to increased compliance requirements, but a key driver of the dramatic increase in data volumes is the increasingly sensor-enabled and instrumented world. Examples include RFID tags, vehicles equipped with GPS sensors, low-cost remote sensing devices, instrumented business processes, and instrumented website interactions. These are having an effect across a number of industry verticals:

  • Retailers collect clickstream data from website interactions and loyalty card data from traditional retailing operations. The point of sale information has traditionally been used by retailers for shopping basket analysis, stock replenishment, and so on, but this data increasingly is also being provided to suppliers for customer buying analysis.
  • Healthcare has traditionally been dominated by paper-based systems, but this is changing as more healthcare data is digitized (e.g., medical imaging and patient records).
  • Science is increasingly being dominated by big science, and therefore big data, initiatives:
    • Large-scale experiments (e.g., Large Hadron Collider [LHC] at CERN generates over 15 PB of data per year; this is more data than can be stored and processed within CERN’s data center and therefore the data is immediately transferred to other laboratories for processing)
    • Increasing scale as continental-scale experiments become both politically and technological feasible (e.g., Ocean Observatories Initiative [OOI], National Ecological Observatory Network [NEON], and USArray, a continental-scale seismic observatory)
    • Improving instrument and sensor technology (e.g., the Large Synoptic Survey Telescope [LSST] has a 3.2 Gpixel camera and will generate over 6 PB of image data per year)
  • Bioinformatics has reduced the time to sequence a genome from years to days and reduced costs such that it will be feasible to sequence an individual’s genome for $1,000 USD and pave the way for improved diagnostics and personalized medicine.
  • Financial services are seeing larger volumes through smaller trading sizes, increased market volatility, and technological improvements in automated and algorithmic trading.

Back to Table of Contents


Increasing Data Complexity

There has been an implicit assumption that the majority of valuable data is structured (see the ‘Structured Data’ section of this assessment for a definition of this term). It has been assumed that the 80% of data that was non-structured was therefore of limited value. The success of search engine providers and e-retailers to unlock the value in Web clickstream data was one of the first examples of the successful processing of large volumes of unstructured data. Other industries have taken note, and the requirement to analyze and mine this data in conjunction with existing structured data is increasingly on the agenda as enterprises look for a competitive advantage.

Back to Table of Contents


Increasing Analysis Complexity

With data complexity comes analysis complexity. Examples include image processing for face recognition, search engine classification of videos, and complex data integration during geospatial processing. In addition to supporting traditional transaction-based analysis (e.g., financial performance), Web clickstream data enables behavioral analysis. Behavioral analytics is designed to determine patterns of behavior from human/human and human/system interaction data and requires large volumes of data in order to build an accurate model. These patterns of behavior may provide insight into which series of actions lead to an event (e.g., a customer sale or product switch). Once these patterns have been determined, they can be used in transaction processing to influence a customer’s decision.

Back to Table of Contents


Changing Analysis Model

Models of transactional data are well understood, and much of the value from this analysis has already been realized. Models of behavioral interactions are less well understood, and large volumes of data may be needed to build accurate models. In addition, research suggests that a sophisticated algorithm with little data is less accurate than a simple algorithm with a large volume of data. Examples include voice and handwriting recognition and crowdsourcing (see “The Unreasonable Effectiveness of Data”).

Back to Table of Contents


Increased Acceptance of Open Source Software

Many open source software projects have reached the required level of sophistication, stability, support, and documentation to make them applicable and appropriate for inclusion in product evaluations for production-quality and large-scale software projects. The number and breadth of open source projects make deploying a data-processing infrastructure based entirely on open source products possible — an example might comprise data analysis and statistical processing using R, data-processing framework using Hadoop, and data management using MySQL.

Back to Table of Contents


Increasing Availability of Cost-Effective Compute and Storage

Cloud computing, commodity hardware, and open source software are changing the economics of data processing. Computational and data storage capacity are no longer constrained by acquisition costs. This is allowing enterprises to ask what’s possible with the new “economics of compute.”

Back to Table of Contents


Data-Processing Pipeline

Many analysts describe the increase in both structured and non-structured data as big data. Big data expands the traditional data-processing pipeline as shown in Figure 2.

Figure 2. Emerging Data-Processing Pipeline

Figure 2.Emerging Data-Processing Pipeline

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

The schematic highlights three areas to show where big data is having a major influence:

  • Complex-event processing (CEP) is concerned with processing and reacting to discrete events in-flight as they enter the enterprise in. High processing speed and associated decision making enables technology-adopting organizations to better see and react to changes in the business environment. CEP adopters gain a competitive advantage over organizations that perform only retrospective analysis on historic data. CEP is addressed in “Business Insight Through Real-Time Analysis on Data in Motion.”
  • BI has traditionally been concerned with analysis of transactional data contained within relational databases (i.e., structured data). Increasingly, though, the BI infrastructure should be architected as a federated system. This federated system supports deployment of specialized database management systems (DBMSs) tailored to the storage and analysis requirements of the data they hold. The specialized data structures that are appropriate to the data warehouse space are addressed in “Data Warehouses: Navigating the Maze of Technical Options.”
  • Big data analytics is concerned with the analysis of large volumes of transaction/event data and behavioral analysis of human/human and human/system interactions. Big data analytics is the focus of the rest of this assessment.

BI infrastructure spans CEP, BI, and big data analytics. Big data analytics is increasingly important to enterprises as they strive to compete. It is undergoing continuous product innovation where no clear architectural patterns are emerging and has a (relatively) constrained functional domain — high-volume transaction and behavioral analysis.

Organizations adopting big data must address the following three questions:

  • What defines big data, and how is it differentiated from traditional data-processing and management patterns?
  • Is the EDW and its associated relational databases the correct architecture for big data?
  • If not, what is the preferred architectural pattern for this processing?

Back to Table of Contents


Structured Data

Unstructured is an overused term when applied to data formats. Rather than divide data into two levels (i.e., structured and unstructured), a four level definition better differentiates data across traditional relational, XML, messaging, and multi-media formats. The four levels are structured, semi-structured, quasi-structured, and unstructured. Table 1 provides examples of these types of data.

Table 1. Structured Data Definitions

Definition

Description

Structured

Relational database (i.e., full atomicity, consistency, isolation, and durability [ACID] support, referential integrity, strong type, and schema support)

Semi-structured

Structured data files that include metadata and are self describing (e.g., netCDF and HDF5)

Semi-structured

XML data files that are self describing and defined by an XML schema

Quasi-structured

Web clickstream data — contains some inconsistencies in data values and format

Unstructured

Text documents amenable to text analytics

Unstructured

Images and video

Source: Gartner (January 2011)

Back to List of Tables

Back to Table of Contents

With a clearer definition of data levels, how can big data analytics architecture differentiate from traditional data-processing architectures?

Back to Table of Contents


Big Data Analytics Classification

Big data analytics can be differentiated from traditional data-processing architectures along a number of dimensions:

  • Speed of decision making
  • Processing complexity
  • Transactional data volumes
  • Data structure
  • Flexibility of processing/analysis
  • Concurrency

Back to Table of Contents


Speed of Decision Making

Data volumes have a major effect on “time to analysis” (i.e., the elapsed time between data reception, analysis, presentation, and decision-maker activities). There are four architectural options (i.e., CEP, OLTP/ODS, EDW, and big data), and big data is most appropriate when addressing slow decision cycles that are based on large data volumes. The CEPs requirement for processing hundreds or thousands of transactions per second requires that the decision making be automated using models or business rules. OLTP and ODS support the operational reporting function in which decisions are made at human speed and based on recent data. The EDW — with the time to integrate data from disparate operational systems, process transformations, and compute aggregations — supports historic trend analysis and forecasting. Big data analysis enables the analysis of large volumes of data — larger than can be processed within the EDW — and so supports long-term/strategic and one-off transactional and behavioral analysis.

Figure 3. Speed of Decision Making

Figure 3.Speed of Decision Making

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

In summary, big data analytics supports both long-term/strategic and one-off transactional and behavioral analysis.

Back to Table of Contents


Processing Complexity

Processing complexity is the inverse of the speed of decision making. In general, CEP has a relatively simple processing model — although CEP often includes the application of behavioral models and business rules that require complex processing on historic data occurring in the EDW or big data analytics phases of the data-processing pipeline. The requirement to process unstructured data at real-time speeds — for example, in surveillance and intelligence applications — is changing this model. Processing complexity increases through OLTP, ODS, and EDW. Two trends are emerging: OLTP is beginning to include an analytics component within the business process and to utilize in-database analytics. The EDW is exploiting the increasing computational power of the database engine. Processing complexities, and the associated data volumes, are so high within the big data analytics phase that parallel processing is the preferred architectural and algorithmic pattern.

Figure 4. Processing Complexity

Figure 4.Processing Complexity

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

In summary, big data analytics has a high processing complexity, and parallel processing is the preferred architectural and algorithmic pattern.

Back to Table of Contents


Transactional Data Volumes

Transactional data volume is the amount of data (either the number of records/events or event size) processed within a single transaction or analysis operation. CEP processes a small number of discrete base events to compute a complex event. OLTP is similarly concerned with transactional or atomic events. Analysis, with its requirement to process many record simultaneously, starts with ODS, and its complexity grows within the EDW. Big data analytics — with the requirement to model long-term trends and customer behavior on Web clickstream data — processes even larger transactional data volumes.

Figure 5. Transactional Data Volumes

Figure 5.Transactional Data Volumes

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

In summary, big data analytics requires a large transaction volume to support transactional analysis and behavioral model building.

Back to Table of Contents


Data Structure

The prevalence of non-structured data (semi-, quasi-, and unstructured) increases as the data-processing pipeline is traversed from CEP to big data. The EDW layer is increasingly becoming more heterogeneous as other, often non-structured, data sources are required by the analysis being undertaken. This is having a corresponding effect on processing complexity. The mining of structured data is advanced, and systems and products are optimized for this form of analysis. The mining of non-structured data (e.g., text analytics and image processing) is less well understood, computationally expensive, and often not integrated into the many commercially available analysis tools and packages. One of the primary uses of big data analysis is processing Web clickstream data, which is quasi-structured. In addition, the data is not stored within databases; rather, it is collected and stored within files.

Some examples of non-structured data that fit with the big data definition include: log files, clickstream data, shopping card data, social media data, call or support center logs, and telephone call data records (CDRs).

There is an increasing requirement to process unstructured data at real-time speeds — for example in surveillance and intelligence applications — so this class of data is becoming more important in CEP processing.

Figure 6. Data Structure

Figure 6.Data Structure

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

In summary, non-structured data, which is often file based, predominates in big data analytics.

Back to Table of Contents


Flexibility of Processing/Analysis

Data management stakeholders understand the processing and scheduling requirements of transactional processing and operational reporting. The stakeholder’s ability to build analysis models is well proven. Peaks and troughs commonly occur across various time intervals (e.g., overnight batch processing window or peak holiday period), but these variations have been studied though trending and forecasting. Big data analysis and a growing percentage of EDW processing are ad hoc or one-off in nature. Data relationships may be poorly understood and require experimentation to refine the analysis. Big data analysis models “analytic heroes” that are continually being challenged “challengers” by new or refined models to see which has better performance or yields better accuracy. The flexibility of such processing is high, and conversely, the governance that can be applied to such processing is low.

Figure 7. Flexibility of Processing/Analysis

Figure 7.Flexibility of Processing/Analysis

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

In summary, big data analytics techniques are continually evolving and being challenged and therefore require a flexible architecture.

Back to Table of Contents


Throughput

Throughput, a measure of the degree of simultaneous execution of transactions, is high in transactional and reporting processing. The high data volumes and complex processing that characterize big data analysis are often hardware constrained and have a low concurrency. The scheduling of big data analysis processing is not time-critical. Big data analysis is therefore not suitable for real-time or near-real-time requirements.

Figure 8. Throughput

Figure 8.Throughput

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

In summary, big data analytics techniques are not time-critical and are therefore not suitable for real-time or near-real-time requirements.

Back to Table of Contents


Summary

If the enterprise has an unmet business need for:

  • Strategic decision making that . . .
  • Has a high degree of processing complexity that . . .
  • Needs to processes high volumes of data that is predominately non-structured where . . .
  • The analysis technique is continually being evolved or challenged and . . .
  • Low concurrency is acceptable.

Then it should explore the benefits of being an being early adopter of big data analytics to gain a competitive advantage over more risk-averse enterprises.

Figure 9. Summary of Big Data Characteristics

Figure 9.Summary of Big Data Characteristics

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

The techniques currently being developed for big data analytics are areas that should be on the IT R&D agenda.

The tools/techniques, architectural patterns, and use cases where big data analysis is applicable is very much a work in progress. The list of constraints above is only a guide, and many organizations are exploring the use of big data analysis techniques and tools for areas outside these. Two things are clear:

  1. There is a growing awareness of this technology, and enterprises are actively looking at where the application of the big data analysis tools and techniques can give them a competitive edge.
  2. Big data analytics is undergoing huge (at times chaotic) change and innovation. Big data analysis is new and has few best practices; expect to learn through failures and the success of peers. On this, be sure to look outside the large Web properties — peers’ systems are often legacy, were built in haste as user counts grew dramatically, and often lack sound architectural principles. Additionally, the scale they have had to address may not be applicable to your problem.

Back to Table of Contents


Big Data Analysis Technology

What is the big data analysis technology stack and what are the emerging architectural patterns?

The big data analysis technology stack is divided into six layers:

  • Integration frameworks
  • Management frameworks
  • Modeling frameworks (i.e., Avro)
  • Development frameworks (i.e., Pig and Hive)
  • Processing frameworks (i.e, Hadoop and Oozie)
  • Data management frameworks (i.e, HDFS and HBase).

The products (open source projects) mentioned are a snapshot of what is currently under development as of January 2011. These products are widely accepted and used within the big data community and are sufficiently feature-rich to warrant inclusion in a big data initiative.

Figure 10. Big Data Analysis Technology Stack

Figure 10.Big Data Analysis Technology Stack

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents


Data Management Frameworks

The data management framework provides the capability to store and access large volumes of data. The framework consists of two components — Hadoop Distributed File System (HDFS) and HBase.

Back to Table of Contents


HDFS

HDFS is a distributed file system that is optimized for the storage of large data files with a write-once read-many access pattern. The HDFS architecture de-optimizes time to access and optimizes time to read. The time required to access the first record is sacrificed to accelerate the time required to read a complete file. HDFS is therefore appropriate where a small number of large files are to be stored and processed. HDFS is not appropriate where low latency data access is required (real-time applications), where lots of small files are to be stored, and where random access is required to a file for both reads and writes (writes to HDFS always occur at the end of a file [i.e., append mode]). An HDFS cluster has two types of nodes: a namenode and multiple datanodes. Namenodes are responsible for the file system tree and file metadata. Datanodes are responsible for the file block access (store and retrieve) when requested to do so by either clients or the namenode. Hadoop is written in Java, so most HDFS access uses the Java FileSystem class, although other programming languages are supported (e.g., C through a libhdfs library). In addition to this programmatic access, HDFS provides interfaces accessible via HTTP and FTP.

The terms “HDFS” and “Hadoop cluster” are synonymous. HDFS datanodes are also MapReduce task execution nodes. The Hadoop framework coordinates execution with data requirements, thus minimizing data movement (i.e., execute on local data).

HDFS reliability and scalability is provided through data redundancy and replication. In a default configuration, HDFS copies data blocks to three separate datanodes within the cluster. Because HDFS is rack-aware, it attempts to place the replicas on datanodes in separate racks.

Figure 11. HDFS Cluster Data Redundancy

Figure 11.HDFS Cluster Data Redundancy

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents


HBase

HBase is the Hadoop database. HBase provides the capability to perform random read/write access to data. HBase is architected as a distributed, versioned, column-oriented database that is designed to store very large tables — billions of rows with millions of columns using a cluster of commodity hardware. HBase is layered over HDFS and exploits its distributed storage architecture. HBase has the same master (i.e., coordinating) and slave node architecture as HDFS. The HBase master node is responsible for distributing the database onto regionservers and recovering regionserver failures. The regionservers are responsible for executing client read/write requests. HBase does not support SQL or ACID properties and, as with HDFS, most access to HBase is from Java. HBase is closely integrated with MapReduce and therefore provides a native interface so that HBase can be used as a source or sink in MapReduce jobs. In addition to this programmatic access, HBase provides Representational State Transfer (REST) interfaces.

Although not a relational database, HBase has the same table, row, and column constructs. Each row has a primary key, which can be any type and determines the sort order of the rows. While HBase provides the capability for secondary indices, they are poorly supported at present. Each cell (the intersection of a row and column) is versioned and can be of any data type. Columns are defined when an HBase database is instantiated, but other columns can be added dynamically. Columns can be grouped into column families, and these column families are stored together on the file system.

A comment from Hadoop: The Definitive Guide, Second Edition brings the difference between HBase and traditional DBMSs into sharp relief, “We currently have tables with hundreds of millions of rows and tens of thousands of columns; the thought of storing billions of rows and millions of columns is exciting, not scary.”

Many readers, including the author, will find the idea of a table with “tens of thousands of columns” and no intrinsic metadata as unusable, ungovernable and yes, scary! Just because the technology allows for such a structure does not make it good architectural practice. Database designers should apply and adapt the best practices they employ when designing column-oriented databases when working with HBase’s more flexible data structures.

Back to Table of Contents


Processing Frameworks

The data-processing framework provides the capability to process the large volumes of data stored within the Hadoop distributed file system and database. The framework consists of two components — Hadoop and Oozie.

Back to Table of Contents


Hadoop

Hadoop is the name of the Apache project that comprises many of the projects described in this section. But the name is now widely used when describing the open source implementation of the MapReduce processing framework.

Back to Table of Contents


MapReduce Processing Framework

MapReduce is a programming model first developed by Google for the processing of large datasets. MapReduce jobs consists of two functions, map and reduce, and a framework for running a large number of instances of these programs on commodity hardware. The map function reads a set of records from an input file, processes these records, and outputs a set of intermediate records. These output records take the generic form of (Key, Data). As part of the map function, a split function distributes the intermediate records across many buckets using a hash function. The reduce function then processes the intermediate records.

An example of MapReduce is the requirement to count the number of words that occur within a document or set of documents. The map function splits each document into words and outputs each word together with the digit “1.” The output records are therefore of the form (word, 1). The MapReduce framework groups all the records with the same key (i.e., word) and feeds them into the reduce function. The reduce function sums the input values and outputs the word and the total number of occurrences in the document(s). A simple and abbreviated schematic of this is shown in Figure 12.

Figure 12. MapReduce Example

Figure 12.MapReduce Example

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

The primary benefits of the MapReduce framework are its scalability, the flexibility it allows in the type of data that can be accessed, and the processing within the map and reduce functions. Scalability in this framework is provided through scale-out with data stored within a Hadoop Distributed File System (HDFS) and the parallelization of both the map and reduce functions within a given MapReduce job. The map function can read from standard flat files (e.g., clickstream logs), Web crawls, and database tables. The programming language or type of processing that the map function can perform is unlimited. For example, the map function could be written in Java, Perl, or Python and perform unstructured text analysis or statistical analysis. This flexibility should be compared with the limited processing supported by the SQL language, even when we include the increased sophistication provided by the major database vendors (in-database processing, stored procedures, and functions).

Back to Table of Contents


Hadoop Continued

MapReduce has the same master (coordinating) and slave node architecture as HDFS and HBase. The MapReduce JobTracker is responsible for controlling and coordinating the MapReduce job. The MapReduce TaskTrackers run the tasks that the MapReduce job has been split into. The TaskTrackers keep track of their own progress and periodically update the JobTracker with their status. This architecture makes MapReduce jobs highly resilient to failure. In the event that a TaskTracker fails or becomes unresponsive, the JobTracker will resubmit the task. Developers should therefore assume that each task will execute several times, perhaps concurrently; the code must therefore be re-entrant and idempotent.

MapReduce and HDFS are closely aligned — the map (and reduce) functions execute directly on the datanodes, thus processing data that resides locally on these servers. Data need only be transferred to a different node to execute the reduce function after a considerable data reduction has been achieved. In addition, HDFS is rack-aware and so minimizes the network bandwidth of moving data off-rack when possible.

The distributed nature of Hadoop processing makes the coordination of common configurations challenging. A component of Hadoop, called ZooKeeper, addresses these issues. ZooKeeper is a centralized coordination service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Back to Table of Contents


Oozie

Even a simple MapReduce job consists of many map and reduce functions. As the complexity of the analysis increases, a series of interdependent MapReduce jobs will be required (see Figure 13). The definition, execution, and control of this workflow is the role of Oozie — a workflow/coordination service written by Yahoo.

Figure 13. MapReduce Workflow

Figure 13.MapReduce Workflow

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

Oozie coordinates a workflow consisting of a set of actions and flow-control nodes. An action node is a MapReduce job, Pig job, or a shell command. Flow-control nodes include start, stop, decision, fork, and so on. The workflow is organized as a collection of nodes arranged in a control dependency directed acyclic graph (DAG). The workflow definition is XML based and is called the Hadoop Process Definition Language (hPDL). For an example of hPDL consisting of two MapReduce jobs, see Figure 14.

Figure 14. Example Hadoop Process Definition Language

Figure 14.Example Hadoop Process Definition Language

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

The workflow capabilities provided by Oozie, and the ensuing increase in reliability and visibility into the complex nature of the analysis, make it a key component of the Big data analysis technology stack as the big data analysis initiative migrates from proof of concept into production.

Back to Table of Contents


Development Frameworks

Writing a MapReduce job (with multiple map and reduce functions) is challenging, and the development framework provides tools to enable increased programmer productivity. The framework consists of two architectural components: Pig and Hive.

Back to Table of Contents


Pig

Pig provides a higher level of abstraction for MapReduce, or data flow, processing. Pig consists of two components: Pig Latin — a language used to express data flows — and a compiler. A Pig Latin program implements a given data flow as a series of data transformation steps. Pig Latin differs from SQL in that SQL performs a transformation within a single statement (e.g., select, group by, and having) while Pig Latin explicitly details each step of the transformation. In addition, the Pig execution environment understands the parallel nature of the MapReduce framework. Pig Latin provides support for types (e.g., long, float, and chararray), schemas, and functions. The compiled nature of Pig allows the programmer to function on the high-level semantic of the analysis, thus leaving the optimization to the Pig execution environment and leading to higher programmer productivity and efficient execution. Pig Latin is extensible and supports the definition of user defined functions. Pig Latin makes use of the boilerplate nature of Java to generate MapReduce jobs. There is current a performance overhead of approximately 20% over native Java implementation; expect this overhead to reduce as the project matures.

Back to Table of Contents


Hive

Hive is described as a data warehouse infrastructure built on top of Hadoop. Hive actually implements a query language (Hive QL), based on the SQL syntax, that can be used to access and transform data held within HDFS. The execution of a Hive QL statement generates a MapReduce job to transform the data as required by the Hive QL statement. Translating this to the RDBMS vernacular, Hive QL can be considered part view and part stored procedure. Two differentiators between Hive QL and SQL are that Hive QL jobs are optimized for scalability (all rows returned) not latency (first row returned) and Hive QL implements a subset of the SQL language.

Hive QL is designed for the execution of analysis queries and supports aggregates and sub-queries in addition to select and join operations. Complex processing is supported through user-defined functions implemented in Java and by embedding MapReduce scripts directly within the Hive QL.

The promise of Hive QL is the ability to access data from within HDFS, HBase, or a relational database (via Java Database Connectivity [JDBC]), thus allowing full interoperability between data within the existing BI domains and big data analytics infrastructure. Support for Hive QL among BI tools vendors is currently poor (Pentaho for Hadoop is an early implementer of this technology); expect the number of vendors to increase as the stability and feature set of Hive matures and the adoption of Hadoop increases.

Back to Table of Contents


Modeling Frameworks

MapReduce programs read and write data as key/value pairs. Hadoop provides a pluggable serialization framework to support conversion of data structures or objects into a bit stream so that the data can be efficiently stored in a file system or transmitted across a network. The framework consists of a single architectural component: Avro. Avro is not the only serialization framework that can be used within MapReduce jobs — Apache Thrift and Google protocol buffers are two popular alternatives.

Back to Table of Contents


Avro

Avro is a serialization system that provides rich data structures that can be integrated with both dynamic and statically typed languages. Avro transmits data (over remote procedure call [RPC]) or stores data (as a container) in a compact binary data format. When Avro data is stored in a file, the schema is stored with it to support access by programs at a later data. Avro schemas are modeled using JavaScript Object Notation (JSON).

Back to Table of Contents


Management Frameworks

To date, the Hadoop project has not focused on the management of the Hadoop cluster, and the lack of guidance on building, configuring, administering, and managing a production-quality Hadoop cluster at scale (perhaps exceeding 1,000 nodes) will be a major impediment to the adoption of the technology. Open source tools for the monitoring of large clusters (e.g., Ganglia andNagios) do exist, but they do not provide the comprehensive functionality required, nor are they Hadoop-aware. What is required is an integrated management framework that provides monitoring and administration at three levels:

  • Hardware: Servers, storage, and network
  • Hadoop cluster: Cluster configuration
  • Workload: Resource management of workflows (Oozie) and jobs (MapReduce)

This lack of open source support for a comprehensive management framework mirrors the early years of Linux and MySQL, where the focus was on features and not ease of implementation. Commercial vendors see this lack of management frameworks as an opportunity to work with enterprise customers to increase adoption through a reduction in technical risk and an increase in the speed of implementation and therefore a reduced time needed to achieve a positive ROI. Cloudera is an early Hadoop (see the “Cloudera’s Distribution for Hadoop” section of this assessment) implementer that has developed a management framework that is:

  • An authorization management and provisioning tool that enables organizations to monitor and manage user activity within a Hadoop cluster
  • A resource management tool that lets administrators monitor usage of assets within a Hadoop cluster
  • An integration, configuration, and monitoring tool that eases the management of data integration points into a Hadoop cluster

Enterprises should ensure that the evaluation of a comprehensive management framework is included in the Hadoop initiative as a means to reduce technical risk and increase the speed of implementation in order to reduce the time needed to achieve a positive ROI.

Back to Table of Contents


Integration Frameworks

Hadoop is becoming an integral part of an enterprise’s BI infrastructure. The integration between Hadoop and the traditional BI infrastructure is growing in importance. The integration is bi-modal: it includes bidirectional movement of data between Hadoop (HDFS) and the EDW and execution of Hadoop (MapReduce jobs) from within the EDW (i.e., in-database MapReduce).

Back to Table of Contents


Data Movement Between Hadoop and Relational Databases

Data can be moved between Hadoop and a relational database as a bulk data transfer, or relational tables can be accessed from within a MapReduce map function.

Cloudera’s Distribution for Hadoop (see the “Cloudera’s Distribution for Hadoop” section of this assessment) provides a bulk data transfer tool (i.e., Sqoop) that imports individual tables or entire databases into HDFS files. The tool also generates Java classes that support interaction with the imported data. Sqoop supports all relational databases over JDBC, and Quest Software provides a connector (i.e., OraOop) that has been optimized for access to data residing in Oracle databases.

MapReduce provides a mechanism to access relational database tables, using JDBC, directly from within a map function that uses the DBInputFormat. Care should be used when invoking this function because the parallelized nature of the MapReduce framework and non-sharded structure of databases will mean multiple database connections. Increasing the number of database connections can cause database access to become a performance bottleneck. Direct database access is best suited to the situation in which a small number of database records are to be joined with a far larger volume of records within HDFS. DBOutputFormat is provided to write job output directly to a relational database.

Back to Table of Contents


Hadoop as an ETL and Processing Engine

One architectural pattern that is gaining prominence exploits the data transformation, data reduction, and analysis capabilities of the MapReduce framework to prepare data for loading and further analysis in either a relational EDW or HDFS/Hive data warehouse.

Figure 15. Hadoop as an ETL Engine

Figure 15.Hadoop as an ETL Engine

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

An early implementer of this approach for using Hadoop as an extraction, transformation, and loading (ETL) engine is Pentaho for Hadoop.

Back to Table of Contents


In-Database MapReduce

Two commercial database vendors provide the ability to execute MapReduce functions directly within the database engine — Aster Data Systems and Greenplum. Oracle 11g‘s table functions can mimic MapReduce, but the non-massively parallel processing (MPP) architecture of Oracle means that these will not be parallelized. Both Aster Data nCluster and Greenplum Database are based on MPP architecture and implement in-database MapReduce in a similar fashion. In-database MapReduce is implemented as user-written table functions that can be invoked from within the SQL FROM clause. When invoked, the map function reads data from either database tables (Aster Data and Greenplum) or files (Greenplum — although not based on HDFS). The SQL statement can then join this table function with other database tables or transform it using the group by statement. Aster Data’s table functions can be developed in Java, C#, Python, or Ruby. Greenplum’s table functions can be developed in either Python or Perl.

In-database analytics is increasing in importance as the computational power of the database engines increases. SQL is understood by all BI tools and is familiar to analysts and programmers, so the learning curve required to exploit the power of the MapReduce framework is reduced.

Back to Table of Contents


Open Source Hadoop

Although open source has clear cost advantages over commercial software, it also has a number of disadvantages. The project nature of Apache creates a large number of siloed products (theHadoop ecosystem consists of Hadoop Common, HDFS, MapReduce, ZooKeeper, Avro, Chukwa, HBase, Hive, Mahout, and Pig). Product and version selection is confusing, which leads to issues with version incompatibility. Integration across multiple cooperating open source software projects can lead to overly complex configurations. There is little institutional knowledge of big data analytics and Hadoop implementation practices. The expected implementation scale of real-world projects will create a need to customize simplistic out-of-the-box configurations. In addition, monitoring and administration functions are often last to be incorporated into emerging open source projects. Adopters should expect increased administration overhead and unreliable operations.

Current Hadoop maturity mirrors the early years of Linux and MySQL. Enterprises are again requesting a single, tested package of products with production-quality administration and monitoring functions, along with consulting and training services to speed implementation. Cloudera is an early Hadoop implementer that bundles enterprise-grade product and service offerings. Expect similar vendor offerings and support to grow as Hadoop adoption increases.

Back to Table of Contents


Cloudera’s Distribution for Hadoop

Cloudera’s Distribution for Hadoop (CDH) consists of the following projects in a single distribution (see Table 2).

Table 2. Cloudera’s Distribution for Hadoop

Product

Description

HDFS

Hadoop Distributed File System

MapReduce

Parallel data-processing framework

Hadoop Common

A set of utilities that support the Hadoop subprojects

HBase

Hadoop database for random read/write access

Hive

SQL-like queries and tables on large datasets

Pig

Data flow language and compiler

Oozie

Workflow for interdependent Hadoop jobs

Sqoop

Integration of databases and data warehouses with Hadoop

Flume

Configurable streaming data collection

ZooKeeper

Coordination service for distributed applications

Hue

User interface framework and software development kit (SDK) for visual Hadoop applications

Source: Cloudera 2010

Back to List of Tables

Back to Table of Contents

Cloudera Enterprise offers CDH and a set of management tools designed to lower the cost and complexity of administration and production support services; this includes 24/7 problem resolution support, consultative support, and support for certified integrations.

Figure 16. Cloudera’s Distribution for Hadoop

Figure 16.Cloudera's Distribution for Hadoop

Source: Cloudera 2010

Back to List of Figures

Back to Table of Contents

CDH or an equivalent distribution should be evaluated as a means to reduce technical risk and increase the speed of implementation to reduce the time needed to achieve a positive ROI.

Back to Table of Contents


Hadoop in the Cloud

Amazon offers Amazon Elastic MapReduce as a Web service that utilizes a hosted Hadoop framework running on the Web-scale infrastructure of Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). The Apache distribution of Hadoop comes with a set of utilities that makes it straightforward to deploy and run on EC2. The Amazon Elastic MapReduce cluster is initiated with a single master node (JobTracker and namenode) and multiple worker nodes (TaskTracker and datanodes) using the supplied scripts.

Running Hadoop on EC2 is particularly appropriate where the data already resides within Amazon S3. To execute the MapReduce job, a Java Archive (JAR) file is transferred to the EC2 cluster and executed. Data is both read from and written to Amazon S3. If the data does not reside on Amazon S3, then the data must be transferred — the Amazon Web Services (AWS) pricing model includes a cost for network costs and Amazon Elastic MapReduce pricing is in addition to normal Amazon EC2 and Amazon S3 pricing. Given the volume of data required to run big data analytics, the cost of transferring data and running the analysis should be carefully considered.

MapReduce and HDFS are closely aligned with the map functions executing directly on local server data. In Amazon Elastic MapReduce, this topology is not present. Data resides on S3, and MapReduce jobs execute on EC2 servers. The performance advantages inherent in operating on locally attached storage will not be seen when executing in the Amazon cloud environment.

Given the large data volumes inherent in big data analytics, running Hadoop in the cloud is appropriate where the data already resides in Amazon S3 or as a prototyping activity — perhaps on a sample of data or as a one-off analysis where the computational resources do not exist in-house.

Back to Table of Contents


Hadoop Hardware

Hadoop runs on commodity hardware; but, for anything more than a simple proof of concept, commodity does not mean a heterogeneous set of surplus machines. As Hadoop moves from proof of concept to full-blown initiative, plan for success. Data sampling is dead, and the success of a big data analytics initiative cannot be shown with a small-scale pilot. Big data analytics is about meeting unmet commercial needs that cannot be met with existing technology. Given the engineering challenge of building a Hadoop cluster at scale, the hardware implementation should be carefully planned. Many of the engineering solutions associated with cloud computing infrastructure are directly applicable to deploying and operating Hadoop clusters.

Additional space, cooling, and power may be required in existing data centers, and because few hardware vendors provide a Hadoop cluster as a single stock-keeping unit (SKU), consistency of build is important. Server selection should focus on the differing requirements of core nodes (JobTracker, namenodes), data nodes (datanodes, TaskTracker) and edge nodes (data movement in/out of the cluster). Core and edge nodes will require increased reliability, redundancy, and support. Hadoop assumes data nodes will fail, so support for these servers can be reduced.

No standard hardware configuration exists for data nodes. When the initiative starts, it will be unclear if the MapReduce job profile is data or compute centric. Provide headroom in memory, cores, and disk so that as the job profile becomes better understood, the hardware can be adjusted as required. The data-aware nature of HDFS and MapReduce means that a physical, not virtualization, model is preferred.

Network bandwidth may become an issue. Hadoop is rack-aware and reduces network bandwidth as it moves data around during the reduce functions. Consider if the cost of increasing the network bandwidth or provisioning additional nodes is the best approach.

Installing, configuring, and implementing Hadoop requires sophisticated instance management, network security engineering, data partitioning, and job coordination. Consider deploying a cluster management utility to reduce the administration effort involved in providing consistency of build and cluster operations (i.e., monitoring and administration).

A detailed analysis of these topics is contained in the Intel white paper “Optimizing Hadoop Deployments.”

Back to Table of Contents


Hadoop Futures

Hadoop should be considered an IT-centric tool. Installing, configuring, and administering a production-scale Hadoop cluster requires considerable system administration expertise. Interacting with Hadoop requires a detailed knowledge of programming languages. This is less of a problem with Pig and Hive, but even these should not be considered end-user tools. Widespread adoption of Hadoop requires innovation in two areas:

  • Close integration with existing BI tools: Support for interaction with Hadoop data and processing using the Hive and Pig projects is improving. Expect this interaction model to mature, improve, and gain additional functionality as adoption increases. The ability of the BI tools to support exploration of extremely large volumes of data is limited, so Gartner expects existing BI tools and new entrants to focus on the second area.
  • A non-technical user interaction (UI) model tailored for the manipulation of extremely large volumes of data: Extreme data volumes and the lack of semantic understanding of non-structured data, when compared to the understanding of structured data, means exploration of the data and data visualization techniques will increase in importance.

Data exploration builds on the ideas developed in “The Business Intelligence Investment: Realizing the Benefits.” This research developed a structured BI process (see Figure 17) as a way to yield predictable and repeatable results, to be more efficient to execute, and to provide the necessary transparency required by governance processes.

Figure 17. BI Process

Figure 17.BI Process

Source: Gartner (January 2011)

Back to List of Figures

Back to Table of Contents

Exploration emphasizes the inner process flow (frame problem, design analysis, gather data, and execute/interpret). Design analysis should allow users to develop their ideas using a common UI metaphor (e.g., a workflow through tabs on a Web page or linked worksheets in a spreadsheet). Gather data must provide a data catalog that includes both structured data in relational and non-relational databases (e.g., Hive) and non-structured datasets contained in an HDFS file system. Execute/interpret will include the ability to execute MapReduce jobs and view the result-set in a variety of visualization forms appropriate to the data and analysis type.

Few tools exist today that provide this exploratory user-interaction model combined with strong visualization support — one that deserves mention is IBM BigSheets. IBM BigSheets builds on the spreadsheet interaction model to support data gathering, exploration, and processing. Various visualization tools are supported, including IBM Many Eyes, which supports a number of visualizations including network diagrams, scatter plots and matrix charts to see relationships among data points; bar chart, histograms and bubble charts to compare sets of values; and word trees and tag clouds to analyze text.

Back to Table of Contents


Hadoop Use Cases

Hadoop is being used as an analysis tool in a wide array of business situations and industries. Table 3 is a snapshot of the types of business problems for which big data analytics is currently being applied.

Table 3. Big Data Analytics Business Patterns

Business patterns

Financial services

Discover fraud patterns based on multi-years worth of credit card transactions and in a time scale that does not allow new patterns to accumulate significant losses.
Measure transaction processing latency across many business processes by processing and correlating system log data.

Internet retailer

Discover fraud patterns in Internet retailing by mining Web click logs. Assess risk by product type and session/Internet Protocol (IP) address activity.

Retailers

Perform sentiment analysis by analyzing social media data.

Drug discovery

Perform large-scale text analytics on publicly available information sources.

Healthcare

Analyze medical insurance claims data for financial analysis, fraud detection, and preferred patient treatment plans.
Analyze patient electronic health records for evaluation of patient care regimes and drug safety.

Mobile telecom

Discover mobile phone churn patterns based on analysis of CDRs and correlation with activity in subscribers’ networks of callers.

IT technical support

Perform large-scale text analytics on help desk support data and publicly available support forums to correlate system failures with known problems.

Scientific research

Analyze scientific data to extract features (e.g., identify celestial objects from telescope imagery).

Internet travel

Improve product ranking (e.g., of hotels) by analysis of multi-years worth of Web click logs.

Source: Gartner (January 2011)

Back to List of Tables

Back to Table of Contents

For a larger list of use cases, see the “PoweredBy” section on the Apache Hadoop website.

Back to Table of Contents


Big Data Analysis as a Competitive Advantage

The starting point for a big data analytics initiative should be an unmet commercial need that cannot be met with existing technology (e.g., EDW and its associated in-database processing). Even if Hadoop and its related projects are version 0.x, they are a foundation for a production quality mechanism to address these needs.

The big data analytics model is different from the traditional analytical model. Big data analytics is a simple model combined with a large volume of data compared to the traditional complex analytical model operating on small volume of data (see “The Unreasonable Effectiveness of Data”). The algorithm is therefore no longer a competitive advantage. Increasingly sophisticated analysis algorithms are provided directly by the DBMS vendors. Competitive advantage is gained by the application of the algorithm and the effectiveness of the decision-making process that acts on the analysis.

The notion that sampling is no longer required or preferred has implication throughout the analysis process:

  • Analysis at scale will be required to see meaningful results.
  • Data visualization is the only way to make sense of ‘the data and the results’ as the data volume increases.
  • Experimentation will increase in importance because existing analysis will be continually challenged (heroes vs. challengers).
  • Edge cases will become normal — adapt testing regimes to accommodate these situations.

The big data analytics initiative should be a joint project involving both IT and business. IT should be responsible for deploying the right big data analysis tools and implementing sound data management practices. Both groups should understand that success will be measured by the value added by business improvements that are brought about by the initiative. As success is achieved, showcase the results to other business units.

Big data analytics requires large volumes of data. For many organizations this will not be an issue. For others, additional instrumentation of business processes will be required.

Big data analytics is an emerging field, and insufficient skilled people exist within many organizations or in the job market. Your team should actively train individuals within the organization. The skills shortage will be particularly acute within government projects for which security clearance is required.

Existing data analysts have a variety of toolsets to assist with the data analysis process. The technical nature of the MapReduce framework with the requirement to develop Java code, the lack of institutional knowledge on how to architect the Hadoop infrastructure, and the lack of analysis tool support for data residing within HDFS or HBase is causing enterprises to delay the adoption of many of these technologies. There are many examples of the business benefits of exploiting this technology to address an unmet requirement, so enterprises should not delay implementation of this technology. As the Hadoop projects mature and BI tool support improves, the cost, time, and complexity of implementing big data analytics will reduce — but so too will the competitive advantage of being an early adopter.

It would be wrong to assume that the architectural principles developed over many years and successive technologies do not apply in big data analytics. Adapt existing architectural principles and practices to the new technology and changing requirements. This is especially so when the technology stack is predominately open source — the siloed nature of these projects often leads to an overly complex architecture.

Back to Table of Contents


Strengths

Big data analytics and associated technology stacks (i.e., Hadoop and MapReduce) directly address challenges posed by a changing business landscape. The cost-performance curve of commodity servers and storage is putting seemingly intractable complex analysis on extremely large volumes of data within the budget of an increasingly large number of enterprises. Many Hadoop projects are in their infancy, but a vibrant community of contributors means that the feature-set and stability of the projects is rapidly improving. This technology promises to be a significant competitive advantage to early adopter organizations.

Back to Table of Contents


Weaknesses

As with many emerging technologies, few comprehensive standards or established architectural principles and little institutional knowledge exist for implementing the Hadoop and MapReduce technology stacks. The technical nature of the MapReduce framework when compared to traditional analysis tools and techniques is a valid cause for concern but not sufficient to delay adoption. Many enterprises will be tempted to throw out their established architectural principles when adopting the predominately open source technology stack of Hadoop. If this occurs, there is a high probability that it will lead to an overly complex configuration with associated cost and reliability issues. Installing, configuring, and implementing Hadoop requires sophisticated instance management, network security engineering, data partitioning, and job coordination. Given the engineering challenge of building a Hadoop cluster at scale, the hardware implementation should be carefully planned in conjunction with the data center operations staff.

Back to Table of Contents


Recommendations

The following recommendations provide advice on key factors to consider when adopting big data analysis and the Hadoop projects.

Back to Table of Contents


Adopt Big Data Analytics and the Hadoop Open Source Project to Meet the Challenges of the Changing Business and Technology Landscape

A number of key business and technology trends are emerging to disrupting the traditional data management and processing landscape. Data analysis is increasingly being viewed as a competitive advantage; an increasingly sensor-enabled and instrumented business environment is generating huge volumes of data; data is increasing in complexity as enterprises look to exploit the value locked up in non-structured data; the analysis model required to make decisions in an increasingly multi-faceted world is increasing. Traditional IT infrastructures are simply not able to meet the demands of this new situation. Consider adopting big data analytics and the Hadoop open source project to meet the challenges of the business and technology landscape.

Back to Table of Contents


Adopt a Packaged Hadoop Distribution to Reduce Technical Risk and Increase the Speed of Implementation

While open source has clear cost advantages over commercial software there are a number of disadvantages. Open source’s project structure often creates a large number of siloed products, so product and version selection is confusing, and the integration of these software products can lead to overly complex architecture. There is little institutional knowledge of the implementation of Hadoop, and given the scale of the implementations, out-of-the-box configurations are unlikely to be appropriate. In addition, monitoring and administration functions are often last to be developed, which leads to increased administration overhead and unreliable operations. Enterprises should consider adopting a packaged Hadoop distribution to reduce the technical risk and increase the speed of implementation and reduce the time to achieve ROI.

Back to Table of Contents


Be Selective About Which Hadoop Projects to Implement

The Hadoop open source project consists of a wide selection of sub-projects. Be selective about the projects that you implement, and have a clear rationale for each selection. Consider the following projects as forming a foundation:

  • HDFS and HBase as data management frameworks
  • MapReduce and Oozie as processing frameworks
  • Pig and Hive as development frameworks to increase programmer productivity

Back to Table of Contents


Use Hadoop in the Cloud for Proof of Concept

Hadoop in the cloud provides a cost-effective and quick implementation solution for a big data analysis proof-of-concept. Once the proof of concept has been verified, bring the project in-house and build a Hadoop center of excellence to showcase the results to other businesses units while gaining experience with implementing and integrating MapReduce jobs for analysis and with running Hadoop in a production environment.

Back to Table of Contents


The Big Data Analytics Initiative Should Be a Joint Project Involving Both IT and Business

The big data analytics initiative should be a joint project involving both IT and business. IT should be responsible for deploying the right big data analysis tools and implementing sound data management practices. Both groups should understand that success will be measured by the value added by business improvements that are brought about by the initiative. As success is achieved, showcase the results to other business units.

Back to Table of Contents


Enterprises Should Not Delay Implementation Just Because of the Technical Nature of Big Data Analytics

Enterprises should not delay implementing the Hadoop open source projects just because of the technical nature of big data analytics. As the Hadoop projects mature and BI tool support improves, the complexity of implementing big data analytics will reduce. Early adopters will gain an early competitive advantage and invaluable experience with this technology that, if exploited, will sustain the advantage as the technology matures and gains wider acceptance.

Back to Table of Contents


Adapt Existing Architectural Principles to the New Technology

The risk of implementing big data analytics and the Hadoop technology stack will be reduced by adapting existing architectural principles and patterns to the new technology and changing requirements rather than rejecting them.

Back to Table of Contents

© 2011 Gartner, Inc. and/or its affiliates. All rights reserved. Gartner is a registered trademark of Gartner, Inc. or its affiliates. This publication may not be reproduced or distributed in any form without Gartner’s prior written permission. The information contained in this publication has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information and shall have no liability for errors, omissions or inadequacies in such information. This publication consists of the opinions of Gartner’s research organization and should not be construed as statements of fact. The opinions expressed herein are subject to change without notice. Although Gartner research may include a discussion of related legal issues, Gartner does not provide legal advice or services and its research should not be construed or used as such. Gartner is a public company, and its shareholders may include firms and funds that have financial interests in entities covered in Gartner research. Gartner’s Board of Directors may include senior managers of these firms or funds. Gartner research is produced independently by its research organization without input or influence from these firms, funds or their managers. For further information on the independence and integrity of Gartner research, see “Guiding Principles on Independence and Objectivity” on its website, http://www.gartner.com/technology/about/ombudsman/omb_guide2.jsp

Acronym Key and Glossary Terms

ACID

atomicity, consistency, isolation, and durability

AWS

Amazon Web Services

BI

business intelligence

CDH

Cloudera’s Distribution for Hadoop

CDR

call data record

CEP

complex-event processing

DAG

directed acyclic graph

DBMS

database management system

EC2

Elastic Compute Cloud

EDW

enterprise data warehouse

ETL

extraction, transformation, and loading

HDFS

Hadoop Distributed File System

hPDL

Hadoop Process Definition Language

IP

Internet Protocol

JAR

Java Archive

JDBC

Java Database Connectivity

JSON

JavaScript Object Notation

LHC

Large Hadron Collider

LSST

Large Synoptic Survey Telescope

M&A

merger and acquisition

MPP

massively parallel processing

NEON

National Ecological Observatory Network

ODS

operational data store

OLTP

online transaction processing

OOI

Ocean Observatories Initiative

REST

Representational State Transfer

RPC

remote procedure call

S3

Simple Storage Service

SDK

software development kit

SKU

stock-keeping unit

UI

user interaction

本文来自: Hadoop and MapReduce: Big Data Analytics [gartner]