Top 50 Sqoop Interview Questions
Q1 What is the process to perform an incremental data load in Sqoop?
Answer: The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.
Incremental load can be performed by using Sqoop import command or by loading the data into hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-
- Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last Modified.
- Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.
- Value (last-value) –This denotes the maximum value of the check column from the previous import operation.
Q2 How Sqoop can be used in a Java program?
Answer: The Sqoop jar in classpath should be included in the java code. After this the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.
Q3 What is the significance of using –compress-codec parameter?
Answer: To get the out file of a sqoop import in formats other than .gz like .bz2 we use the –compress -code parameter.
Q4 How are large objects handled in Sqoop?
Answer: Sqoop provides the capability to store large sized data into a single field based on the type of data. Sqoop supports the ability to store-
- CLOB ‘s – Character Large Objects
- BLOB’s –Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred as “LobFile” i.e. Large Object File. The LobFile has the ability to store records of huge size, thus each record in the LobFile is a large object.
Q5 What is a disadvantage of using –direct parameter for faster data load by sqoop?
Answer: The native utilities used by databases to support faster load do not work for binary data formats like Sequence File.
Q6 Is it possible to do an incremental import using Sqoop?
Answer: Yes, Sqoop supports two types of incremental imports-
- Append
- Last Modified
To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.
Q7 How can you check all the tables present in a single database using Sqoop?
Answer: The command to check the list of all tables present in a single database using Sqoop is as follows-
Sqoop list-tables –connect jdbc: mysql: //localhost/user;
Q8 How can you control the number of mappers used by the sqoop command?
Answer: The Parameter –num-mappers is used to control the number of mappers executed by a sqoop command. We should start with choosing a small number of map tasks and then gradually scale up as choosing high number of mappers initially may slow down the performance on the database side.
Q9 What is the standard location or path for Hadoop Sqoop scripts?
Answer: /usr/bin/Hadoop Sqoop.
Q10 How can we import a subset of rows from a table without using the where clause?
Answer: We can run a filtering query on the database and save the result to a temporary table in database.
Then use the sqoop import command without using the –where clause.
Q11 When the source data keeps getting updated frequently, what is the approach to keep it in sync with the data in HDFS imported by sqoop?
Answer: qoop can have 2 approaches.
a − To use the –incremental parameter with append option where value of some columns are checked and only in case of modified values the row is imported as a new row.
b − To use the –incremental parameter with lastmodified option where a date column in the source is checked for records which have been updated after the last import.
Q12 What is a sqoop metastore?
Answer: It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument.
Q13 Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?
Answer: Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.
Q14 Tell few import control commands:
Answer: –Append
–Columns
–Where
These command are most frequently used to import RDBMS Data.
Q15 Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?
Answer: Sqoop allows us to use free form SQL queries with the import command. The import command should be used with the –e and – query options to execute free form SQL queries. When using the –e and –query options with the import command the –target dir value must be specified.
Q16 How can you see the list of stored jobs in sqoop metastore?
Answer: sqoop job –list
Q17 What type of databases Sqoop can support?
Answer: MySQL, Oracle, PostgreSQL, IBM, Netezza and Teradata. Every database connects through jdbc driver.
Q18 What is the purpose of sqoop-merge?
Answer: The merge tool combines two datasets where entries in one dataset should overwrite entries of an older dataset preserving only the newest version of the records between both the data sets.
Q19 HOw sqoop can handle large objects?
Answer: Blog and Clob columns are common large objects. If the object is less than 16MB, it stored inline with the rest of the data. If large objects, temporary stored in_lob subdirectory. Those lobs processes in a streaming fashion. Those data materialized in memory for processing. IT you set LOB to 0, those lobs objects placed in external storage.
Q20 What is the importance of eval tool?
Answer: It allows user to run sample SQL queries against Database and preview the results on the console. It can help to know what data can import? The desired data imported or not?
Q21 What is the default extension of the files produced from a sqoop import using the –compress parameter?
Answer: .gz
Q22 Can we import the data with “Where” condition?
Answer: Yes, Sqoop has a special option to export/import a particular data.
Q23 What are the limitations of importing RDBMS tables into Hcatalog directly?
Answer: There is an option to import RDBMS tables into Hcatalog directly by making use of –hcatalog –database option with the –hcatalog –table but the limitation to it is that there are several arguments like –as-avro file , -direct, -as-sequencefile, -target-dir , -export-dir are not supported.
Q24 what are the majorly used commands in sqoop?
Answer: In Sqoop Majorly Import and export command are used. But below commands are also useful sometimes. codegen, eval, import-all-tables, job, list-database, list-tables, merge, metastore.
Q25 What is the usefulness of the options file in sqoop.
Answer: The options file is used in sqoop to specify the command line values in a file and use it in the sqoop commands.
For example the –connect parameter’s value and –user name value scan be stored in a file and used again and again with different sqoop commands.
Q26 what are the common delimiters and escape character in sqoop?
Answer: The default delimiters are a comma(,) for fields, a newline(\n) for records
Escape characters are \b,\n,\r,\t,\”, \\’,\o etc
Q27 What are the two file formats supported by sqoop for import?
Answer: Delimited text and Sequence Files.
Q28 while loading table from MySQL into HDFS, if we need to copy tables with maximum possible speed, what can you do?
Answer: We need to use -direct argument in import command to use direct import fast path and this -direct can be used only with MySQL and PostGreSQL as of now.
Q29 How can you sync a exported table with HDFS data in which some rows are deleted?
Answer: Truncate the target table and load it again.
Q30 Differentiate between Sqoop and distCP.
Answer: DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.
Q31 How can you import only a subset of rows form a table?
Answer: By using the WHERE clause in the sqoop import statement we can import only a subset of rows.
Q32 How do you clear the data in a staging table before loading it by Sqoop?
Answer: By specifying the –clear-staging-table option we can clear the staging table before it is loaded. This can be done again and again till we get proper data in staging.
Q33 What is Sqoop?
Answer: Sqoop is an open source project that enables data transfer from non-hadoop source to hadoop source. It can be remembered as SQL to Hadoop -> SQOOP. It allows user to specify the source and target location inside the Hadoop.
Q34 Is it possible to do an incremental import using Sqoop?
Answer: Yes, Sqoop supports two types of incremental imports-
- Append
- Last Modified
To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.
Q35 How can you export only a subset of columns to a relational table using sqoop?
Answer: By using the –column parameter in which we mention the required column names as a comma separated list of values.
Q36 Which database the sqoop metastore runs on?
Answer: Running sqoop-metastore launches a shared HSQLDB database instance on the current machine.
Q37 How will you update the rows that are already exported?
Answer: The parameter –update-key can be used to update existing rows. In it a comma-separated list of columns is used which uniquely identifies a row. All of these columns is used in the WHERE clause of the generated UPDATE query. All other table columns will be used in the SET part of the query.
Q38 You have a data in HDFS system, if you want to put some more data to into the same table, will it append the data or overwrite?
Answer: No it can’t overwrite, one way to do is copy the new file in HDFS.
Q39 Where can the metastore database be hosted?
Answer: The metastore database can be hosted anywhere within or outside of the Hadoop cluster.
Q40 Which is used to import data in Sqoop ?
Answer: In SQOOP import command is used to import RDBMS data into HDFS. Using import command we can import a particular table into HDFS.
Q41 What is the role of JDBC driver in a Sqoop set up?
Answer: To connect to different relational databases sqoop needs a connector. Almost every DB vendor makes this connecter available as a JDBC driver which is specific to that DB. So Sqoop needs the JDBC driver of each of the database it needs to interact with.
Q42 How to import only the updated rows form a table into HDFS using sqoop assuming the source has last update timestamp details for each row?
Answer: By using the lastmodified mode. Rows where the check column holds a timestamp more recent than the timestamp specified with –last-value are imported.
Q43 What is InputSplit in Hadoop?
Answer: When a hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called Input Split.
Q44 Hadoop sqoop word came from ?
Answer: Sql + Hadoop = sqoop
Q45 What is the work of Export In Hadoop sqoop ?
Answer: Export the data from HDFS to RDBMS
Q46 Use of Codegen command in Hadoop sqoop ?
Answer: Generate code to interact with database records
Q47 Use of Help command in Hadoop sqoop ?
Answer: List available commands
Q48 How can you schedule a sqoop job using Oozie?
Answer: Oozie has in-built sqoop actions inside which we can mention the sqoop commands to be executed.
Q49 What are the two file formats supported by sqoop for import?
Answer: Delimited text and Sequence Files.
Q50 What is a sqoop metastore?
Answer: It is a tool using which Sqoop hosts a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the –meta-connect argument.