Top 50 Hive Interview Questions

raju2006
June 15, 2016 0 Comments

Top 50 Hive Interview Questions

Q1 What is Hive ?

Answer: Hive is a data warehouse software which is used for facilitates querying and managing large data sets residing in distributed storage.Hive language almost look like SQL language called HiveQL.Hive also allows traditional map reduce programs to customize mappers and reducers when it is inconvenient or inefficient to execute the logic in HiveQL (User Defined Functions UDFS).

 

Q2 How Facebook Uses Hadoop,Hive and Hbase ?

Answer: Facebook data stored on HDFS,everyday millions of photos uploaded into facebook with the help of Hadoop Facebook Messages,Likes and statues updates running on top of Hbase Hive to generate reports for third-party developers and advertisers who need to track the success of their applications or campaigns.

 

Q3 What is the difference between HBase and Hive?

Answer: Both hive and hbase can be used in different technologies that are based on Hadoop. Hive happens to be a infrastructure warehouse of data which is used on Hadoop whereas HBase is NoSQL. The key value stores which run on Hadoop themselves. Hive will also help those who know about SQL run a few jobs in MapReduce when Hbase will also support 4 of the operations such as put, get, scan and delete. The Hbase happens to be good for querying for data but Hive on the other hand is good for querying data is analytical and is collected over a while.

 

Q5 What is Hive Metastore ?

Answer: Hive Meta store is a database that stores metadata of your hive tables like table name,column name,data types,table location,number of buckets in the table etc.

 

Q6 Hive new version supported Hadoop Versions ?

Answer: This release works with Hadoop 0.20.x, 0.23.x.y, 1.x.y, 2.x.y

 

Q7 Which companies are mostly using Hive ?

Answer: Facebook,Netflix

 

Q8 Wherever (Different Directory) I run hive query, it creates new metastore_db, please explain the reason for it?

Answer: Whenever you run the hive in embedded mode, it creates the local metastore. And before creating the metastore it looks whether metastore already exist or not. This property is defined in configuration file hive – site.xml. Property is “javax.jdo.option.ConnectionURL” with default value “jdbc:derby:;databaseName=metastore_db;create=true”. So to change the behavior change the location to absolute path, so metastore will be used from that location.

 

Q9 Is it possible to use same metastore by multiple users, in case of embedded hive?

Answer: No, it is not possible to use metastore in sharing mode. It is recommended to use standalone “real” database like MySQL or PostGresSQL.

 

Q10 What is the functionality of Query Processor in Apached Hive ?

Answer: This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.

 

Q11 Is multi line comment supported in HIVE Script?

Answer: NO

 

Q12 What is the functionality of Query Processor in Apache Hive?

Answer: This components implements the processing framework for converting SQL to graph of map/reduce jobs and the execution time framework to run those jobs in the order od dependencies.

 

Q13 what is a Hive Metastore?

Answer: Hive Metastore is a central repository that stores metadata in external database.

 

Q14 Explain about the SMB Join in Hive.

Answer: In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table and then a merge sort join is performed. Sort Merge Bucket (SMB) join in hive is mainly used as there is no limit on file or partition or table join. SMB join can best be used when the tables are large. In SMB join the columns are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join.

 

Q15 What is ObjectInspector functionality?

Answer: ObjectInspector is used to analyze the structure of individual columns and the internal structure of the row objects. ObjectInspector in Hive provides access to complex objects which can be stored in multiple formats.

 

Q16 Is it possible to use same metastore by multiple users, in case of embedded hive?

Answer: No, it is not possible to use metastore in sharing mode. It is recomended to use standalone “real” database like MySQL or PostGreSQL.

 

Q17 Explain about the different types of join in Hive.

Answer: HiveQL has 4 different types of joins – JOIN- Similar to Outer Join in SQL

  1. FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
  2. LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
  3. RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.

 

Q18 Is it possible to change the default location of Managed Tables in Hive, if so how?

Answer: Yes, we can change the default location of Managed tables using the LOCATION keyword while creating the managed table. The user has to specify the storage path of the managed table as the value to the LOCATION keyword.

 

Q19 How can you connect an application, if you run Hive as a server?

Answer: When running Hive as a server, the application can be connected in one of the 3 ways-

  1. ODBC Driver-This supports the ODBC protocol
  2. JDBC Driver- This supports the JDBC protocol
  3. Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.

 

Q20 Which classes are used by the Hive to Read and Write HDFS Files

Answer: Following classes are used by Hive to read and write HDFS files:

  • TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
  • SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.

 

Q21 What are the types of tables in Hive?

Answer: There are two types of tables.

  1. Managed tables.
  2. External tables.

Only the drop table command differentiates managed and external tables. Otherwise, both type of tables are very similar.

 

Q22 Is it possible to create multiple table in hive for same data?

Answer: As hive creates schema and append on top of an existing data file. One can have multiple schema for one data file, schema will be saved in hive’s metastore and data will not be parsed or serialized to disk in given schema. When we will try to retrieve data, schema will be used. For example if we have 5 column (name, job, dob, id, salary) in the data file present in hive metastore then, we can have multiple schema by choosing any number of columns from the above list. (Table with 3 columns or 5 columns or 6 columns).

 

Q23 What kind of datawarehouse application is suitable for Hive?

Answer: Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do. Hive is most suited for data warehouse applications, where

1) Relatively static data is analyzed,

2) Fast response times are not required, and

3) When the data is not changing rapidly.

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

 

Q24 What is the maximum size of string data type supported by Hive?

Answer: Maximum size is 2 GB.

 

Q25 What are the Binary Storage formats supported in Hive?

Answer: By default Hive supports text file format, however hive also supports below binary formats.

Sequence Files, Avro Data files, RCFiles, ORC files, Parquet files

  • Sequence files: General binary format. splittable, compressible and row oriented. a typical example can be. if we have lots of small file, we may use sequence file as a container, where file name can be a key and content could stored as value. it support compression which enables huge gain in performance.
  • Avro datafiles: Same as Sequence file splittable, compressible and row oriented except support of schema evolution and multilingual binding support.
  • RCFiles: Record columnar file, it’s a column oriented storage file. it breaks table in row split. in each split stores that value of first row in first column and followed sub subsequently.
  • ORC Files: Optimized Record Columnar files.

 

Q26 CONCAT function in Hive with Example?

Answer: CONCAT function will concat the input strings. You can specify any number of strings separated by comma.

 

Q27 is HQL case sensitive?

Answer: HQL is not case sensitive.

 

Q28 REPEAT function in Hive with example?

Answer: REPEAT function will repeat the input string n times specified in the command.

 

Q29 Describe REVERSE function in Hive with example?

Answer: REVERSE function will reverse the characters in a string.

 

Q30 LOWER or LCASE function in Hive with example?

Answer: LOWER or LCASE function will convert the input string to lower case characters.

 

Q31 UPPER or UCASE function in Hive with example?

Answer: UPPER or UCASE function will convert the input string to upper case characters.

 

Q33 Rename a table in Hive – How to do it?

Answer: Using ALTER command, we can rename a table in Hive.

ALTER TABLE hive_table_name RENAME  TO new_name;

 

Q34 Difference between order by and sort by in hive?

Answer: SORT BY will sort the data within each reducer. You can use any number of reducers for SORT BY operation.

ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses single reducer.

ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered final results.

 

Q35 RLIKE in Hive?

Answer: RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don’t need to put % symbol for a simple match in RLIKE.

 

Q36 Difference between external table and internal table in HIVE ?

Answer: Hive has a relational database on the master node it uses to keep track of state. For instance, when you CREATE TABLE FOO(foo string) LOCATION ‘hdfs://tmp/’;, this table schema is stored in the database. If you have a partitioned table, the partitions are stored in the database(this allows hive to use lists of partitions without going to the filesystem and finding them, etc). These sorts of things are the ‘metadata’.

When you drop an internal table, it drops the data, and it also drops the metadata. When you drop an external table, it only drops the meta data. That means hive is ignorant of that data now. It does not touch the data itself.

 

Q37 Does Hive support record level Insert, delete or update?

Answer: Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

 

Q38 Is Hive suitable to be used for OLTP systems? Why?

Answer: No Hive does not provide insert and update at row level. So it is not suitable for OLTP system.

 

Q39 What kind of datawarehouse application is suitable for Hive?

Answer: Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.

Hive is most suited for data warehouse applications, where

  •  Relatively static data is analyzed,
  •  Fast response times are not required, and
  •  When the data is not changing rapidly.

 

Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing.So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

 

 

Q40 Can we change the data type of a column in a hive table?

 

Answer: Using REPLACE column option

 

ALTER TABLE table_name REPLACE COLUMNS ……

 

 

Q41 TRIM function in Hive with example?

 

Answer: TRIM function will remove the spaces associated with a string. xample:

 

TRIM(‘  Hadoop  ‘);

 

Output: Hadoop.

 

 

Q42 Why do we need Hive?

 

Answer: Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a databse like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax.

 

 

Q43 Is there a date data type in Hive?

 

Answer: Yes. The TIMESTAMP data types stores date in java.sql.timestamp format

 

 

Q44 What is a Hive variable? What for we use it?

 

Answer: The hive variable is variable created in the Hive environment that can be referenced by Hive scripts. It is used to pass some values to the hive queries when the query starts executing.

 

 

Q45 While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file ?

 

Answer: By Omitting the LOCAL CLAUSE in the LOAD DATA statement.

 

 

Q46 What does the “USE” command in hive do?

 

Answer: With the use command you fix the database on which all the subsequent hive queries will run.

 

 

Q47 How can you delete the DBPROPERTY in Hive?

 

Answer: There is no way you can delete the DBPROPERTY.

 

 

Q48 Does the archiving of Hive tables give any space saving in HDFS?

 

Answer: No. It only reduces the number of files which becomes easier for namenode to manage.

 

 

Q49 What is the usefulness of the DISTRIBUTED BY clause in Hive?

 

Answer: It controls ho wthe map output is reduced among the reducers. It is useful in case of streaming data.

 

 

Q50 Can a partition be archived? What are the advantages and Disadvantages?

 

Answer: Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.

 

*********All The Best*********