Apache Hadoop is an open-source framework, which efficiently processes and stores large datasets (in the GB to PB scale). Hadoop takes advantage of using a cluster of commodity hardware to massively parallelize the processing workloads. Hadoop consist of four main modules:
- Hadoop Distributed File System – a distributed file system, residing on the cluster, which provides large data throughput and fault tolerance.
- Yet Another Resource Negotiation (YARN) – resource manager.
- MapReduce – a framework which helps programs perform parallel computation on data.
- Hadoop Common – common Java libraries that can be used across all modules.
Some of the most popular applications which store, process, analyze, manage big data, and run in Hadoop are Spark, Presto, Hive, HBase, etc.