Home Top Sqoop Interview Questions

Top Sqoop Interview Questions

Answer: The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.

Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-

  • Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.
  • Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.
  • Value (last-value) –This denotes the maximum value of the check column from the previous import operation.

Answer: The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.

Answer: To get the out file of a sqoop import in formats other than .gz like .bz2 we use the –compress -code parameter.

Answer: Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-

  • CLOB ‘s – Character Large Objects
  • BLOB’s –Binary Large Objects

Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.

 

Answer: The native utilities used by databases to support faster load do not work for binary data formats like Sequence File.

Answer: Yes, Sqoop supports two types of incremental imports-

  1. Append
  2. Last Modified

To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.

Answer: The command to check the list of all tables present in a single database using Sqoop is as follows-

Sqoop list-tables –connect jdbc: mysql: //localhost/user;

Answer: The Parameter –num-mappers is used to control the number of mappers executed by a sqoop command. We should start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers initially may slow down the performance on the database side.

Answer: We can run a filtering query on the database and save the result to a temporary table in database.

Then use the sqoop import command without using the –where clause.

Answer: qoop can have 2 approaches.

a − To use the –incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row.

b − To use the –incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import.

Answer: It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.

Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument.

Answer: Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.

Answer: –Append

–Columns

–Where

These command are most frequently used to import RDBMS Data.

Answer: Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.

Answer: MySQL, Oracle, PostgreSQL, IBM, Netezza and Teradata. Every database connects through jdbc driver.

Answer: The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.

Answer: Blog and Clob columns are common large objects. If the object is less than 16MB, it stored inline with the rest of the data. If large objects, temporary stored in_lob subdirectory. Those lobs processes in a streaming fashion. Those data materialized in memory for processing. IT you set LOB to 0, those lobs objects placed in external storage.

Answer: It allows user to run sample SQL queries against Database and preview the results on the console. It can help to know what data can import? The desired data imported or not?