Tag Archive: Hadoop

What Good is a Pig?

Data matters. Every day we generate huge volumes of data. Processing all of this data presents challenges for many people.

Pig is a data flow language. It sits on top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently. Best of all, it supports many relational features, making it easy to join, group, and aggregate data. If you think this sounds a lot like an ETL tool, you’d be right. Pig has many things in common with ETL tools, if those ETL tools ran on many server simultaneously.

Where would you use Pig?

Case 1 – Time Sensitive Data Loads

Loading data is a key part of many businesses. Data comes in from outside of the database in text, XML, CSV, or some other arbitrary file format. The data then has to be processed into a different formats and loaded into a database for later querying. Sometimes there are a lot of steps involved, sometimes the data has to be translated into an intermediate format, but most of the time it gets into the database, some failure it to be expected, right?

Loading large volumes of data can become a problem as the volume of data increases: the more data there is, the longer it takes to load. To get around this problem people routinely buy bigger, faster servers and use more fast disks. There comes a point when you can’t add more CPUs or RAM to a server and increasing the I/O capacity won’t help. Parallelizing ETL processes can be hard even on one machine, much less scaling ETL out across several machines.

Pig is built on top of Hadoop, so it’s able to scale across multiple processors and servers which makes it easy to processes massive data sets. Many ETL processes lend themselves to being decomposed into manageable chunks; Pig is no exception. Pig builds MapReduce jobs behind the scenes to spread load across many servers. By taking advantage of the simple building blocks of Hadoop, data professionals are able to build simple, easily understood scripts to process and analyze massive quantities of data in a massively parallel environment.

Parallel rockets make a single pig faster

An advantage of being able to scale out across many servers is that doubling throughput is often as easy as doubling the number of servers working on a problem. If one server can solve a problem in 12 hours, 24 servers should be able to solve it in 30 minutes.

Case 2 – Processing Many Data Sources

Knowing the effectiveness of an advertisement is big business. For people buying ad space, it’s critical to know that just how effective their advertising is in both the physical and virtual space. Combining advertising information from multiple sources and mixing it together with web server traffic, IP geo-location, and click through metrics it’s possible to gain a deeper understanding of customer behavior and judge just how effective certain ads are in certain parts of the country.

Pig isn’t just designed to scale out over many servers. Pig can be used to complex data flows and extend them with custom code. A job can be written to collect web server logs, use external programs to fetch geo-location data for the users’ IP addresses, and join the new set of geo-located web traffic to click maps stored as JSON, web analytic data in CSV format, and spreadsheets from the advertising department to build a rich view of user behavior overlaid with advertising effectiveness.

Creating this rich view of data is possible because Pig supplies complex features like joins, sorting, grouping, and aggregation. The syntax is different than developers are used to but Pig’s focus on data flow makes it easy to write complex jobs. Rather than creating complex logic in SQL, developers can create jobs that walk through data step by step to deliver the best results. It’s easy to rapidly prototype these procedural jobs and performance tuning can be accomplished with relative ease.

Case 3 – Analytic Insight Through Sampling

Even in case 2, we’ve seen how Pig can provide some analytical insight into the massive quantities of data that are generated every day in the datacenter. It’s easy to fall into the trap of thinking that Pig is an ETL glue that moves data from a log file, processes it, and drops it off for another database to consume. Pig is more than just an ETL tool.

One of Pig’s strengths is its ability to perform sampling of large data sets. As Pig manipulates data, it’s easy to reduce the set of data that we’re operating on using sampling. By sampling with a random distribution of data, we can reduce the amount of data that needs to be analyze and still deliver meaningful results.

Summing Up

Pig isn’t a replacement for SQL Server Integration Services. Their use cases overlap for many tasks, but they also solve very different problems. Using Pig for all ETL processes will be overkill when the data can reasonably be handled within a single SQL Server instance. On the flip side, there are problems that are too large to quickly solve within a single SSIS process or package. In either situation you should pick the best tool for the job.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

More Posts - Website

Follow Me:
TwitterFacebook

Third Normal Form is Snake Oil

Step right up, ladies and gentlemen, and I will sell you the solution to all of your database needs. That’s right, it’s Doctor Codd’s Third Normal Form, guaranteed to cure all ailments of the schemata, pragmata, and performata. Doctor Codd’s Form will make your data performant and compressive. Accept no substitutes or imitators; Doctor Boyce’s lamentable attempts cannot soothe your aches and pains like Doctor Codd’s Third Normal Form. Why, with just a simple application of Doctor Codd’s Third Normal Form, thrice daily, and you’ll be jumping around to the tune of normal forms and transactions in no time!

Sound Familiar?

Anyone pushing a single idea is pushing snake oil, plain and simple. They’re selling you a warm and fuzzy feeling that you’ll make your problems go away by following their simple prescriptions. Deviation from the remedy will, of course, result in problems, failure, and potentially phlebitis.

Can I Cure You?

No, I can’t. Well, I can, but I’m not going to. Not yet, at least. You need to pay attention, first.

The Forms, Both Magnificent and Normal, Are A Not Panacea

Slavish adherence to normalization is bad for your health. There are as many reasons to not normalize data as there are reasons to normalize your data. Don’t believe me? What if I asked you to design a database to persist items that we might sell in a store?

It’s easy, at first, to design an items table with a few columns to describe the main properties that we want to persist about an item in our store. Problems begin when different departments in the store need to save different properties. Different parts of our IT systems will need different views of the data. While adding a column is trivial on small databases, adding a column in a large database is decidedly non-trivial. Eventually the database boils down to an items and item_properties table and at that point the database becomes impossible to query reasonably.

A Solution Most Surprising

We can solve this problem a few ways, but with Microsoft’s Hadoop announcements, it makes sense to look at what the non-relational world can offer. HBase is a real-time column-oriented database that runs on top of Hadoop.

HBase is helpful modeling dynamic properties because of flexible data model. While HBase does have tables, rows, and columns there are some powerful differences. HBase’s columns are split up into column families – these are logical groupings of columns. Columns can be added on the fly once a column family has been created.

Jumping back to our example, instead of modeling a items and item_properties table, we can create an items table and create column families to store properties specific to a department or for a common purpose. Rather than create many tables, we can add a shipping_info column family, a accounting column family, and a sales_promotion column family. Over time this flexible data model can be used to populate reporting tables in an enterprise data warehouse. Rather than focus initial efforts on building a robust general purpose schema in an RDBMS, it’s easy to create a flexible schema in HBase and pull out the data we need for reporting at a later time.

A Final Commentary on Data

Denormalization doesn’t have to be a dirty word. There are many reasons to denormalize data. Ultimately, the process of shredding data apart should depend not on blind adherence to the principles of normalization but to the needs of the applications that consume the data. If you have a log file processing application, does it make sense to read log files from disk into a relational database? Every log entry will need to be shredded into multiple columns doesn’t make sense when log files are only infrequently processed and used to produce aggregations.

Even when you eventually need to query the log file data, there are tools suited to performing SQL-like operations across flat files. Hive provides a SQL-like querying layer on top of the Hadoop framework making it possible to run bulk queries across large volumes of data stored in flat files and spread across many servers.

Know how data is used; know the problem that the business wants to solve. Let the principle of consumption drive the structure of your information. You will thank me, some day, for freeing you from the false rigor of normalization.

Jeremiah Peschka

Jeremiah Peschka has worked as a database and emerging technology expert at Quest Software where he researched new trends and technologies in the world of data storage. Over the course of his career he’s worked with companies across many industries as a system administrator, developer, and DBA. He’s been involved with all aspects of application development and deployment. He likes cheesecake, coffee, and ice cream.

More Posts - Website

Follow Me:
TwitterFacebook