Top Pig Interview Questions
Answer: COGROUP is found in Pig that works in several tuples. The operator can also be applied on several statements which contain or have a few relations at least a hundred and twenty seven relations at every time. When you are making use of the operator on tables, then Pig will immediately book two tables and post that it will join two of the tables on some of the columns that are grouped.
Answer: Pig is a Apache open source project which is run on hadoop,provides engine for data flow in parallel on hadoop.It includes language called pig latin,which is for expressing these data flow.It includes different operations like joins,sort,filter ..etc and also ability to write UserDefine Functions(UDF) for proceesing and reaing and writing.pig uses both HDFS and MapReduce i,e storing and processing.
Answer: The BloomMapFile is a class that extends MapFile. So its functionality is similar to MapFile.
BloomMapFile uses dynamic Bloom filters to provide quick membership test for the keys. It is used in Hbase table format.
Answer: Pig latin is procedural version of SQl.pig has certainly similarities,more difference from sql.sql is a query language for user asking question in query form.sql makes answer for given but dont tell how to answer the given question.suppose ,if user want to do multiple operations on tables,we have write multiple queries and also use temporary table for storing,sql is support for subqueries but intermediate we have to use temporary tables,SQL users find subqueries confusing and difficult to form properly.using sub-queries creates an inside-out design where the first step in the data pipeline is the innermost query .pig is designed with a long series of data operations in mind, so there is no need to write the data pipeline in an inverted set of subqueries or to worry about storing data in temporary tables.
Answer: Pig undergoes some steps when a Pig Latin Script is converted into MapReduce jobs. After performing the basic parsing and semantic checking, it produces a logical plan. The logical plan describes the logical operators that have to be executed by Pig during execution. After this, Pig produces a physical plan. The physical plan describes the physical operators that are needed to execute the script.
Answer: No, illustrate will not pull any MR, it will pull the internal data. On the console, illustrate will not do any job. It just shows output of each stage and not the final output.
Answer: In mapreduce,groupby operation performed at reducer side and filter,projection can be implemented in the map phase.pig latin also provides standard-operation similar to mapreduce like orderby and filters,group by..etc.we can analyze pig script and know data flows ans also early to find the error checking.pig Latin is much lower cost to write and maintain than Java code for MapReduce.
Answer: Yes, the keyword ‘DEFINE’ is like a function name. Once you have registered, you have to define it. Whatever logic you have written in Java program, you have an exported jar and also a jar registered by you. Now the compiler will check the function in exported jar. When the function is not present in the library, it looks into your jar.
Answer: No, the keyword ‘FUNCTIONAL’ is not a User Defined Function (UDF). While using UDF, we have to override some functions. Certainly you have to do your job with the help of these functions only. But the keyword ‘FUNCTIONAL’ is a built-in function i.e a pre-defined function, therefore it does not work as a UDF.
Answer: In three categories,we can use pig .they are 1)ETL data pipline 2)Research on raw data 3)Iterative processing
Most common usecase for pig is data pipeline.Let us take one example, web based compaines gets the weblogs,so before storing data into warehouse,they do some operations on data like cleaning and aggregation operations..etc.i,e transformations on data.
Answer: Pig is a high-level platform that makes many Hadoop data analysis issues easier to execute. The language we use for this platform is: Pig Latin. A program written in Pig Latin is like a query written in SQL, where we need an execution engine to execute the query. So, when a program is written in Pig Latin, Pig compiler will convert the program into MapReduce jobs. Here, MapReduce acts as the execution engine.
Answer: scalar datatype
- int -4bytes,
- float -4bytes,
- double -8bytes,
- long -8bytes,
- chararray,
- bytearray
Answer: There are 3 modes of execution available in pig
- Interactive Mode (Also known as Grunt Mode)
- Batch Mode
- Embedded Mode
Answer: Let us take a scenario where we want to count the population in two cities. I have a data set and sensor list of different cities. I want to count the population by using one mapreduce for two cities. Let us assume that one is Bangalore and the other is Noida. So I need to consider key of Bangalore city similar to Noida through which I can bring the population data of these two cities to one reducer. The idea behind this is some how I have to instruct map reducer program – whenever you find city with the name ‘Bangalore‘ and city with the name ‘Noida’, you create the alias name which will be the common name for these two cities so that you create a common key for both the cities and it get passed to the same reducer. For this, we have to write custom partitioner.
In mapreduce when you create a ‘key’ for city, you have to consider ’city’ as the key. So, whenever the framework comes across a different city, it considers it as a different key. Hence, we need to use customized partitioner. There is a provision in mapreduce only, where you can write your custom partitioner and mention if city = bangalore or noida then pass similar hashcode. However, we cannot create custom partitioner in Pig. As Pig is not a framework, we cannot direct execution engine to customize the partitioner. In such scenarios, MapReduce works better than Pig.
Answer: pig latin is sometimes not a case sensitive.let us see example,Load is equivalent to load.
A=load ‘b’ is not equivalent to a=load ‘b’
UDF are also case sensitive,count is not equivalent to COUNT.
Answer: dump display the output on the screen
dump ‘processed’
Answer: No, Pig will not show any warning if there is no matching field or a mismatch. If you assume that Pig gives such a warning, then it is difficult to find in log file. If any mismatch is found, it assumes a null value in Pig.
Answer: Pig interactive shell is known as Grunt Shell. It provides a shell for users to interact with HDFS.
Answer: Co-group joins the data set by grouping one particular data set only. It groups the elements by their common field and then returns a set of records containing two separate bags. The first bag consists of the record of the first data set with the common data set and the second bag consists of the records of the second data set with the common data set.
Answer: they are
- for each
- order by
- filters
- group
- distinct
- join
- limit