Top 50 Interview Questions for HDFS

raju2006
May 05, 2016 0 Comments

 

Top 50 Interview Questions for HDFS

Q1 What does ‘jps’ command do?

Answer:It gives the status of the deamons which run Hadoop cluster. It gives the output mentioning the status of Namenode, Datanode, Secondary Namenode, Jobtracker and Tasktracker.

 

Q2.What if a Namenode has no data?

Answer: It cannot be part of the Hadoop cluster.

 

Q3. What happens to job tracker when Namenode is down?

 Answer: When Namenode is down, your cluster is OFF, this is because Namenode is the single point of failure in HDFS.

 

Q4.What is a Namenode?

Answer: Namenode is the master node on which job tracker runs and consists of the metadata. It maintains and manages the blocks which are present on the datanodes. It is a high-availability machine and single point of failure in HDFS.

 

Q5.Replication causes data redundancy, then why is it pursued in HDFS?

Answer: HDFS works with commodity hardware (systems with average configurations) that has high chances of getting crashed any time. Thus, to make the entire system highly fault-tolerant, HDFS replicates and stores data in different places. Any data on HDFS gets stored at least 3 different locations. So, even if one of them is corrupted and the other is unavailable for some time for any reason, then data can be accessed from the third one. Hence, there is no chance of losing the data. This replication factor helps us to attain the feature of Hadoop called Fault Tolerant.

 

Q6.  What is a Datanode?

Answer: Datanodes are the slaves which are deployed on each machine and provide the actual storage. These are responsible for serving read and write requests for the clients.

 

Q7.  Why do we use HDFS for applications having large data sets and not when there are lot of small files?

Answer: HDFS is more suitable for large amount of data sets in a single file as compared to small amount of data spread across multiple files. This is because Namenode is a very expensive high performance system, so it is not prudent to occupy the space in the Namenode by unnecessary amount of metadata that is generated for multiple small files. So, when there is a large amount of data in a single file, name node will occupy less space. Hence for getting optimized performance, HDFS supports large data sets instead of multiple small files.

 

Q8.Explain the major difference between HDFS block and InputSplit.

Answer: In simple terms, block is the physical representation of data while split is the logical representation of data present in the block. Split acts a s an intermediary between block and mapper. Suppose we have two blocks:

Block 1: ii nntteell

Block 2: Ii ppaatt

Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block. It then forms key-value pair using inputformat and records reader and sends map for further processing with inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.

 

Q9.What is a ‘block’ in HDFS?

Answer: A ‘block’ is the minimum amount of data that can be read or written. In HDFS, the default block size is 64 MB as contrast to the block size of 8192 bytes in Unix/Linux. Files in HDFS are broken down into block-sized chunks, which are stored as independent units. HDFS blocks are large as compared to disk blocks, particularly to minimize the cost of seeks. If a particular file is 50 mb, will the HDFS block still consume 64 mb as the default size? No, not at all! 64 mb is just a unit where the data will be stored. In this particular situation, only 50 mb will be consumed by an HDFS block and 14 mb will be free to store something else. It is the MasterNode that does data allocation in an efficient manner.

 

Q10.Explain what happens if during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3.

Answer: Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1, if the DataNode crashes under any circumstances, then only single copy of the data would be lost.

 

Q11.What are the most common Input Formats in Hadoop?

Answer: There are three most common input formats in Hadoop:

  • Text Input Format: Default input format in Hadoop
  • Key Value Input Format: used for plain text files where the files are broken into lines
  • Sequence File Input Format: used for reading files in sequence

 

Q12.  What is commodity hardware?

Answer: Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high end hardware configuration to execute jobs.

 

Q13. What is the port number for NameNode,Secondary NameNode,DataNodes,TaskTracker and JobTracker?

Answer:

  • NameNode 50070
  • Secondary NameNode 50090
  • DataNodes 50075
  • JobTracker 50030
  • TaskTracker 50060

 

Q14. Explain about the process of inter cluster data copying.

Answer: HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the hadoop cluster then it is referred to as inter cluster data copying. DistCP requires both source and destination to have a compatible or same version of hadoop.

 

Q15. What is a heartbeat in HDFS?

Answer: A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

 

Q16. Explain the difference between NAS and HDFS.

Answer: NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.NAS stores data on a dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.

 

Q17. Explain about the indexing process in HDFS.

Answer: Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

 

Q18. What is a rack awareness and on what basis is data stored in a rack?

Answer: All the data nodes put together form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.The contents present in the file are divided into data block as soon as the client is ready to load the file into the hadoop cluster. After consulting with the NameNode, client allocates 3 data nodes for each data block. For each data block, there exists 2 copies in one rack and the third copy is present in another rack t0 ensure if any entire rack fails we still have one copy in another rack.This is generally referred to as the Replica Placement Policy.

 

Q19. How NameNode Handles data node failures?

Answer:Through checksums. every data has a record followed by a checksum. if checksum doesnot match with the original then it reports an data corrupted error.

 

Q20. What is HDFS?

Answer:The Hadoop Distributed File System (HDFS) is a sub-project of the Apache Hadoop project.HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.

 

Q21. What are the key features of HDFS?

Answer: HDFS is highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to file system data and can be built out of commodity hardware.

 

Q22.  What is throughput? How does HDFS get a good throughput?

Answer: Throughput is the amount of work done in a unit time. It describes how fast the data is getting accessed from the system and it is usually used to measure performance of the system. In HDFS, when we want to perform a task or an action, then the work is divided and shared among different systems. So all the systems will be executing the tasks assigned to them independently and in parallel. So the work will be completed in a very short period of time. In this way, the HDFS gives good throughput. By reading data in parallel, we decrease the actual time to read data tremendously.

 

Q23. What is data-integrity in HDFS?

Answer: HDFS transparently checksums all data written to it and by default verifies checksums when reading data.A separate checksums created for every bytes of data(default is 512 bytes, because CRC-32 checksums is 4 bytes).Datanodes are responsible for verifying the data they receive before storing the data and checksums.It is possible to disable checksums by passing false to setverifychecksum() method on filesystem before using open() method to read file.

 

Q24.  What all modes Hadoop can be run in?

Answer: Hadoop can run in three modes:

  1. Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
  1. Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
  1. Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.

 

Q25 What are the core components of Hadoop?

Answer: Core components of Hadoop are HDFS and MapReduce. HDFS is basically used to store large data sets and MapReduce is used to process such large data sets.

 

Q26. What is a metadata?

Answer: Metadata is the information about the data stored in datanodes such as location of the file, size of the file and so on.

 

Q27. What happens when two clients try to write into the same HDFS file?

Answer:HDFS supports exclusive writes only. When the first client contacts the name-node to open the file for writing, the name-node grants a lease to the client to create this file. When the second client tries to open the same file for writing, the name-node will see that the lease for the file is already granted to another client, and will reject the open request for the second client.

 

Q28.  What is a daemon?

Answer: Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. The equivalent of Daemon in Windows is “Services” and in Dos is “TSR”.

 

Q29.What are file permissions in HDFS?

Answer:HDFS has a permission model for files and directories that is much like posix. They are three types of permissions

  • read permissions(x)
  • write permissions(w)
  • execute permissions(X)

Each file and directory has an owner and group and mode.

 

Q30. What does Data Locality mean?

Answer:Data Locality means processing the data where it resides. It simply means that Hadoop Map-Reduce will do their best to schedule the map tasks and the reduce tasks such that most tasks read their input data from the local computer. In certain scenarios , mainly in the reduce phase exception to Data Locality may be needed.

 

Q31. What is the process to change the files at arbitrary locations in HDFS?

Answer: HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in append only format i.e. writes to a file in HDFS are always made at the end of the file.

 

Q32.  What is the process of indexing in HDFS?

Answer: Once data is stored HDFS will depend on the last part to find out where the next part of data would be stored.

 

Q33.Difference betweenHadoop fs -copyFromLocal and Hadoop fs -moveFromLocal

Answer: Hadoop fs -put and Hadoop fs -copyFromLocal both are same means it’ll copy the data from local to hdfs and local copy also available and it’s working like copy & paste.Hadoop fs -moveFromLocal command working as cut & paste means it’ll move the file from local to HDFS, but local copy is not available.

 

Q34. What happens if one Hadoop client renames a file or a directory containing this file while another client is still writing into it?

Answer:A file will appear in the name space as soon as it is created. If a writer is writing to a file and another client renames either the file itself or any of its path components, then the original writer will get an IOException either when it finishes writing to the current block or when it closes the file.

 

Q35.What is Secondary NameNode?

Answer: Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.It is not a substitute to the Namenode, so if the Namenode fails, the entire Hadoop system goes down.

 

Q36. What is default block size in HDFS?

Answer: As of Hadoop-2.4.0 release, the default block size in HDFS is 256 MB and prior to that it was 128 MB.

 

Q37. What are the limitations of HDFS file systems?

Answer: HDFS supports file operations reads, writes, appends and deletes efficiently but it doesn’t support file updates.HDFS is not suitable for large number of small sized files but best suits for large sized files. Because file system namespace maintained by Namenode is limited by it’s main memory capacity as namespace is stored in namenode’s main memory and large number of files will result in big fsimage file.

 

Q38. Is there an easy way to see the status and health of a cluster?

Answer: There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system.The JobTracker status page will display the state of all nodes, as well as the job queue and status about all currently running jobs and tasks. The NameNode status page will display the state of all nodes and the amount of free space, and provides the ability to browse the DFS via the web.

 

Q39. How do you debug a performance issue or a long running job?

Answer: This is an open ended question and the interviewer is trying to see the level of hands-on experience you have in solving production issues. Use your day to day work experience to answer this question. Here are some of the scenarios and responses to help you construct your answer. On a very high level you will follow the below steps.

  • Understand the symptom
  • Analyze the situation
  • Identify the problem areas
  • Propose solution

 

Q40. What is a sequence file in Hadoop?

Answer: Sequence file is used to store binary key/value pairs. Sequence files support splitting even when the data inside the file is compressed which is not possible with a regular compressed file. You can either choose to perform a record level compression in which the value in the key/value pair will be compressed. Or you can also choose to choose at the block level where multiple records will be compressed together.Consider case scenario: In M/R system, – HDFS block size is 64 MB. Now Input format is FileInputFormat and we have 3 files of size 64K, 65Mb and 127Mb. How many input splits will be made by Hadoop framework?

Hadoop will make 5 splits as follows:

  • split for 64K files
  • splits for 65MB files
  • splits for 127MB files

 

Q41. What happens when a datanode fails?

Answer: When a datanode fails:

  • Jobtracker and namenode detect the failure
  • On the failed node all tasks are re-scheduled
  • Namenode replicates the users data to another node

 

Q42.  What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?

Answer: Distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

 

Q43.  What happens to a NameNode that has no data?

Answer:There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.

 

Q44.  What is a block and block scanner in HDFS?

Answer:Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.

 

 Q45. Why is a block in HDFS so Large?

Answer: HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. By making a block large enough, the time to transfer the data from the disk can be significantly longer than the time to seek to the start of the block.

 

Q46. What is HDFS High-Availability?

Answer: The 2.x release series of Hadoop adds support for HDFS high-availability (HA). In this implementation there is a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.

 

Q47. What are some typical functions of Job Tracker?

Answer: The following are some typical tasks of JobTracker:

  • When Client applications submit map reduce jobs to the Job tracker
  • The JobTracker talks to the Name node to determine the location of the data
  • The JobTracker locates TaskTtracker nodes with available slots at or near the data
  • The JobTracker submits the work to the chosen Tasktracker nodes
  • The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker
  • When the work is completed, the JobTracker updates its status
  • Client applications can poll the JobTracker for information

 

Q48. How does one switch off the “SAFEMODE” in HDFS?

Answer:You use the command: Hadoop dfsadmin –safemode leave.

 

Q49. What is streaming access?

Answer: As HDFS works on the principle of ‘Write Once, Read Many’, the feature of streaming access is extremely important in HDFS. HDFS focuses not so much on storing the data but how to retrieve it at the fastest possible speed, especially while analyzing logs. In HDFS, reading the complete data is more important than the time taken to fetch a single record from the data.

 

Q50.  Is Namenode also a commodity?

Answer: No. Namenode can never be commodity hardware because the entire HDFS rely on it. It is the single point of failure in HFS. Namenode has to be a high-availability machine.