Apache Spark Ebook Free Download Pdf
What is Apache Hadoop?
Apache Hadoop® is an open source software framework that provides highly reliable distributed processing of large data sets using simple programming models. Hadoop, known for its scalability, is built on clusters of commodity computers, providing a cost-effective solution for storing and processing massive amounts of structured, semi-structured and unstructured data with no format requirements.
A data lake architecture including Hadoop can offer a flexible data management solution for your big data analytics initiatives. Because Hadoop is an open source software project and follows a distributed computing model, it can offer a lower total cost of ownership for a big data software and storage solution.
Hadoop can also be installed on cloud servers to better manage the compute and storage resources required for big data. Leading cloud vendors such as Amazon Web Services (AWS) and Microsoft Azure offer solutions. Cloudera supports Hadoop workloads both on-premises and in the cloud, including options for one or more public cloud environments from multiple vendors.
The Hadoop ecosystem
The Hadoop framework, built by the Apache Software Foundation, includes:
- Hadoop Common: The common utilities and libraries that support the other Hadoop modules. Also known as Hadoop Core.
- Hadoop HDFS (Hadoop Distributed File System): A distributed file system for storing application data on commodity hardware. It provides high-throughput access to data and high fault tolerance. The HDFS architecture features a NameNode to manage the file system namespace and file access and multiple DataNodes to manage data storage.
- Hadoop YARN: A framework for managing cluster resources and scheduling jobs. YARN stands for Yet Another Resource Negotiator. It supports more workloads, such as interactive SQL, advanced modeling and real-time streaming.
- Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
- Hadoop Ozone: A scalable, redundant and distributed object store designed for big data applications.
Supporting Apache projects
Enhance Hadoop with additional open source software projects
Ambari
A web-based tool for provisioning, managing and monitoring Hadoop clusters
Avro
A data serialization system
Cassandra
A scalable, NoSQL database designed to have no single point of failure
Chukwa
A data collection system for monitoring large distributed systems; built on top of HDFS and MapReduce
Flume
A service for collecting, aggregating and moving large amounts of streaming data into HDFS
HBase
A scalable, non-relational distributed database that supports structured data storage for very large tables
Hive
A data warehouse infrastructure for data query and analysis in a SQL-like interface
Mahout
A scalable machine learning and data mining library
Oozie
A Java-based workload scheduler to manage Hadoop jobs
Pig
A high-level data flow language and execution framework for parallel computation
Sqoop
A tool for efficiently transferring data between Hadoop and structured data stores such as relational databases
Submarine
A unified AI platform for running machine learning and deep learning workloads in a distributed cluster
Tez
A generalized data flow programming framework, built on YARN; being adopted within the Hadoop ecosystem to replace MapReduce
ZooKeeper
A high performance coordination service for distributed applications
Hadoop for developers
Apache Hadoop was written in Java, but depending on the big data project, developers can program in their choice of language, such as Python, R or Scala. The included Hadoop Streaming utility allows developers to create and execute MapReduce jobs with any script or executable as the mapper or the reducer.
Spark versus Hadoop
Apache Spark is often compared to Hadoop as it is also an open source framework for big data processing. In fact, Spark was initially built to improve the processing performance and extend the types of computations possible with Hadoop MapReduce. Spark uses in-memory processing, which means it is vastly faster than the read/write capabilities of MapReduce.
While Hadoop is best for batch processing of huge volumes of data, Spark supports both batch and real-time data processing and is ideal for streaming data and graph computations. Both Hadoop and Spark have machine learning libraries, but again, because of the in-memory processing, Spark's machine learning is much faster.
Use cases for Hadoop
Data-driven decisions
Better data access
Data offload
IBM Hadoop solutions
IBM and Cloudera, better together
Support predictive and prescriptive analytics for today's AI. Combine Cloudera's enterprise-grade Hadoop distribution with a single ecosystem of integrated products and services from both IBM and Cloudera to improve data discovery, testing, ad hoc and near real-time queries. Take advantage of the collaboration between IBM and Cloudera to deliver enterprise Hadoop solutions.
Related products
IBM® Db2® Big SQL
Use an enterprise-grade, hybrid ANSI-compliant, SQL-on-Hadoop engine to deliver massively parallel processing (MPP) and advanced data query.
IBM Big Replicate
Replicate data as it streams in so files don't need to be fully written or closed before transfer.
Open source databases
Capitalize more cost effectively on big data with open source databases from leading vendors such as MongoDB and EDB.
Resources
IBM + Cloudera
See how they are driving advanced analytics with an enterprise-grade, secure, governed, open source-based data lake.
How to connect more data
Add a data lake to your data management strategy to integrate more unstructured data for deeper insights.
A robust, governed data lake for AI
Explore the storage and governance technologies needed for your data lake to deliver AI-ready data.
Data lake governance
See how proven governance solutions can drive better data integration, quality and security for your data lakes.
Big data analytics courses
Choose your learning path, based on skill level, from no-cost courses in data science, AI, big data and more.
Open source community
Join the IBM community for open source data management for collaboration, resources and more.
Get started with Hadoop
Talk to an IBM big data specialist for 30 minutes at no cost.
Posted by: gastongastoncrittone0272141.blogspot.com
Source: https://www.ibm.com/analytics/hadoop
Post a Comment for "Apache Spark Ebook Free Download Pdf"