Data matters. Every day we generate huge volumes of data. Processing all of this data presents challenges for many people.
Pig is a data flow language. It sits on top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently. Best of all, it supports many relational features, making it easy to join, group, and aggregate data. If you think this sounds a lot like an ETL tool, you’d be right. Pig has many things in common with ETL tools, if those ETL tools ran on many server simultaneously.
Where would you use Pig?
Case 1 – Time Sensitive Data Loads
Loading data is a key part of many businesses. Data comes in from outside of the database in text, XML, CSV, or some other arbitrary file format. The data then has to be processed into a different formats and loaded into a database for later querying. Sometimes there are a lot of steps involved, sometimes the data has to be translated into an intermediate format, but most of the time it gets into the database, some failure it to be expected, right?
Loading large volumes of data can become a problem as the volume of data increases: the more data there is, the longer it takes to load. To get around this problem people routinely buy bigger, faster servers and use more fast disks. There comes a point when you can’t add more CPUs or RAM to a server and increasing the I/O capacity won’t help. Parallelizing ETL processes can be hard even on one machine, much less scaling ETL out across several machines.
Pig is built on top of Hadoop, so it’s able to scale across multiple processors and servers which makes it easy to processes massive data sets. Many ETL processes lend themselves to being decomposed into manageable chunks; Pig is no exception. Pig builds MapReduce jobs behind the scenes to spread load across many servers. By taking advantage of the simple building blocks of Hadoop, data professionals are able to build simple, easily understood scripts to process and analyze massive quantities of data in a massively parallel environment.
An advantage of being able to scale out across many servers is that doubling throughput is often as easy as doubling the number of servers working on a problem. If one server can solve a problem in 12 hours, 24 servers should be able to solve it in 30 minutes.
Case 2 – Processing Many Data Sources
Knowing the effectiveness of an advertisement is big business. For people buying ad space, it’s critical to know that just how effective their advertising is in both the physical and virtual space. Combining advertising information from multiple sources and mixing it together with web server traffic, IP geo-location, and click through metrics it’s possible to gain a deeper understanding of customer behavior and judge just how effective certain ads are in certain parts of the country.
Pig isn’t just designed to scale out over many servers. Pig can be used to complex data flows and extend them with custom code. A job can be written to collect web server logs, use external programs to fetch geo-location data for the users’ IP addresses, and join the new set of geo-located web traffic to click maps stored as JSON, web analytic data in CSV format, and spreadsheets from the advertising department to build a rich view of user behavior overlaid with advertising effectiveness.
Creating this rich view of data is possible because Pig supplies complex features like joins, sorting, grouping, and aggregation. The syntax is different than developers are used to but Pig’s focus on data flow makes it easy to write complex jobs. Rather than creating complex logic in SQL, developers can create jobs that walk through data step by step to deliver the best results. It’s easy to rapidly prototype these procedural jobs and performance tuning can be accomplished with relative ease.
Case 3 – Analytic Insight Through Sampling
Even in case 2, we’ve seen how Pig can provide some analytical insight into the massive quantities of data that are generated every day in the datacenter. It’s easy to fall into the trap of thinking that Pig is an ETL glue that moves data from a log file, processes it, and drops it off for another database to consume. Pig is more than just an ETL tool.
One of Pig’s strengths is its ability to perform sampling of large data sets. As Pig manipulates data, it’s easy to reduce the set of data that we’re operating on using sampling. By sampling with a random distribution of data, we can reduce the amount of data that needs to be analyze and still deliver meaningful results.
Summing Up
Pig isn’t a replacement for SQL Server Integration Services. Their use cases overlap for many tasks, but they also solve very different problems. Using Pig for all ETL processes will be overkill when the data can reasonably be handled within a single SQL Server instance. On the flip side, there are problems that are too large to quickly solve within a single SSIS process or package. In either situation you should pick the best tool for the job.


