Hadoop Pig Tutorial: What is, Architecture, Example

Created by danielrodrigues Jun 29, 2020

Hadoop Pig Tutorial: What is, Architecture, Example

What is PIG?
Pig is a high-level programming language useful for analyzing big data sets. A pig was a result of development effort at Yahoo!
In a MapReduce framework, programs need to be translated into a series of Map and Reduce stages. However, this is not a programming model which data analysts are familiar with. So, in order to bridge this gap, an abstraction called Pig was built on top of Hadoop.
Apache Pig enables people to focus more on analyzing bulk data sets and to spend less time writing Map-Reduce programs. Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data. That's why the name, Pig!
Pig Architecture
Pig consists of two components:

  1. Pig Latin, which is a language
  2. A runtime environment, for running PigLatin programs.

A Pig Latin program consists of a series of operations or transformations which are applied to the input data to produce output. These operations describe a data flow which is translated into an executable representation, by Pig execution environment. Underneath, results of these transformations are series of MapReduce jobs which a programmer is unaware of. So, in a way, Pig allows the programmer to focus on data rather than the nature of execution.
PigLatin is a relatively stiffened language which uses familiar keywords from data processing e.g., Join, Group and Filter.
Execution modes:
Pig has two execution modes:

  1. Local mode: In this mode, Pig runs in a single JVM and makes use of local file system. This mode is suitable only for analysis of small datasets using Pig
  2. Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster (cluster may be pseudo or fully distributed). MapReduce mode with the fully distributed cluster is useful of running Pig on large datasets.