Top 48 HBase Interview Questions

raju2006
July 02, 2016 0 Comments

Q1 What are the different types of tombstone markers in HBase for deletion?

Answer:

There are 3 different types of tombstone markers in HBase for deletion-

  • Family Delete Marker- This marker marks all columns for a column family.
  • Version Delete Marker-This marker marks a single version of a column.
  • Column Delete Marker-This markers mark all the versions of a column.

 

Q2 When should you use HBase and what are the key components of HBase?

Answer: HBase should be used when the big data application has –

  • A variable schema
  • When data is stored in the form of collections
  • If the application demands key-based access to data while retrieving.

Key components of HBase are –

  • Region- This component contains memory data store and Hfile.
  • Region Server-This monitors the Region.
  • HBase Master-It is responsible for monitoring the region server.
  • Zookeeper- It takes care of the coordination between the HBase Master component and the client.
  • Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.

 

Q3 Explain the difference between HBase and Hive.

Answer: HBase and Hive both are completely different Hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key-value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real-time querying of big data where Hive is an ideal choice for analytical querying of data collected over the period of time.

Q4 What is Row Key?

Answer: Every row in an HBase table has a unique identifier known as RowKey. It is used for grouping cells logically and it ensures that all cells that have the same RowKeys are co-located on the same server. RowKey is internally regarded as a byte array.

 

Q5 Explain the difference between RDBMS data model and HBase data model.

Answer:

RDBMS is a schema-based database whereas HBase is schema-less data model.

  • RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
  • RDBMS stores normalized data whereas HBase stores de-normalized data.

 

Q6 What are the different operational commands in HBase at record level and table level?

Answer: Record Level Operational Commands in HBase are –put, get, increment, scan and delete.

Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.

 

Q7 Explain about the different catalog tables in HBase?

Answer: The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.

 

Q8 Explain the process of row deletion in HBase.

Answer: On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.

 

Q9 What is column families? What happens if you alter the block size of ColumnFamily on an already populated database?

Answer: The logical deviation of data is represented by a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.

 

Q10 Explain about HLog and WAL in HBase.

Answer: All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.

 

Q11 what is NoSql?

Answer: Apache HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

Q12 What is region server?

Answer: It is a file which lists the known region server names.

Q13 Give the name of the key components of HBase

Answer: The key components of HBase are Zookeeper, RegionServer, Region, Catalog Tables and HBase Master.

 

Q14 What is the reason for using HBase?

Answer: Hbase is used because it provides random read and write operations and it can perform a number of operation per second on a large data sets.

Q15 Define standalone mode in Hbase?

Answer: It is a default mode of HBase .In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process.

 

Q16 Which operating system is supported by HBase?

Answer: HBase supports those OS which supports java like windows, Linux.

 

Q17 What are the main features of Apache HBase?

Answer: Apache HBase has many features which support both linear and modular scaling, HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows(Automatic sharding).HBase supports a Block Cache and Bloom Filters for high volume query optimization(Block Cache and Bloom Filters).

 

Q18 What is the difference between HDFS/Hadoop and HBase?

Answer: HDFS doesn’t provide fast lookup records in a file, IN Hbase provides fast lookup records for a large table.

 

Q19 What are data model operations in HBase?

Answer:

  • Get(returns attributes for a specified row, Gets are executed via HTable.get)
  • put(Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via HTable.put (writeBuffer) or HTable.batch (non-writeBuffer))
  • scan(Scan allow iteration over multiple rows for specified attributes)
  • Delete(Delete removes a row from a table. Deletes are executed via HTable.delete)

 

HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compaction.

 

Q20 How many filters are available in Apache HBase?

Answer: Total we have 18 filters are support to hbase.They are:

  • ColumnPrefixFilter
  • TimestampsFilter
  • PageFilter
  • MultipleColumnPrefixFilter
  • FamilyFilter
  • ColumnPaginationFilter
  • SingleColumnValueFilter
  • RowFilter
  • QualifierFilter
  • ColumnRangeFilter
  • ValueFilter
  • PrefixFilter
  • SingleColumnValueExcludeFilter
  • ColumnCountGetFilter
  • InclusiveStopFilter
  • DependentColumnFilter
  • FirstKeyOnlyFilter
  • KeyOnlyFilter

 

Q21 Does HBase support SQL?

Answer: Not really. SQL-ish support for HBase via Hive is in development, however, Hive is based on MapReduce which is not generally suitable for low-latency requests.By using Apache Phoenix can retrieve data from HBase by using sql queries.

 

Q22 Is there any difference between HBase data model and RDBMS data model?

Answer: In HBase, data is stored as a table(have rows and columns) similar to RDBMS but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.

 

Q23 What is Apache HBase?

Answer: Apache HBase is one the sub-project of  Apache Hadoop, which was designed for NoSql database(Hadoop Database),big data store and a distributed, scalable.Use Apache HBase when you need random, real-time read/write access to your Big Data.A table which contains billions of rows X millions of columns -atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modelled after Google’s Bigtable. Apache HBase provides Bigtable-like capabilities run on top of Hadoop and HDFS.

 

Q24 What is the use of shutdown command?

Answer: It is used to shut down the cluster.

 

Q25  How to delete the table with the shell?

Answer: To delete table first disable it then delete it.

 

Q26 What is the full form of MSLAB?

Answer: MSLAB stands for Memstore-Local Allocation Buffer.

 

Q27 What is REST?

Answer: Rest stands for Representational State Transfer which defines the semantics so that the protocol can be used in a generic way to address remote resources. It also provides support for different message formats, offering many choices for a client application to communicate with the server.

 

Q28 What Is The Difference Between HBase and Hadoop/HDFS?

Answer: HDFS: is a distributed file system that is well suited for the storage of large files. It\’s documentation states that it is not, however, a general-purpose file system, and does not provide fast individual record lookups in files.

HBase: on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed “StoreFiles” that exist on HDFS for high-speed lookups.

 

Q29 How many Operational commands in Hbase?

Answer: There are five main commands in HBase.

  1. Get
  2. Put
  3. Delete
  4. Scan
  5. Increment

 

Q30 Why cant I iterate through the rows of a table in reverse order?

Answer: Because of the way HFile works: for efficiency, column values are put on a disk with the length of the value written first and then the bytes of the actual value written second. To navigate through these values in reverse order, these length values would need to be stored twice (at the end as well) or in a side file. A robust secondary index implementation is the likely solution here to ensure the primary use case remains fast.

 

Q31 Explain what is Hbase?

Answer: HBase is a column-oriented database management system which runs on top of HDFS (Hadoop Distributed File System). HBase is not a relational data store, and it does not support structured query language like SQL.

In HBase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.

 

Q32 How to connect to Hbase?

Answer: A connection to HBase is established through Hbase Shell which is a Java API.

 

Q33 Why we describe HBase Schemaless?

Answer: Other than the column family name, HBase doesn’t require you to tell it anything about your data ahead of time. That’s why HBase is often described as a schema-less database.

 

Q34 What is Hfile?

Answer: All columns in a column family are stored together in the same low-level storage file, called an Hfile.

 

Q35 How data is written into HBase?

Answer: When data is updated it is first written to a commit log, called a write-ahead log (WAL) in HBase, and then stored in the in-memory memstore. Once the data in memory has exceeded a given maximum value, it is flushed as an HFile to disk. After the flush, the commit logs can be discarded up to the last unflushed modification.

 

Q36 How data is read back from HBase?

Answer: Reading data back involves a merge of what is stored in the memstores, that is, the data that has not been written to disk, and the on-disk store files. Note that the WAL is never used during data retrieval, but solely for recovery purposes when a server has crashed before writing the in-memory data to disk.

 

Q37 What is the role of Zookeeper in Hbase?

Answer: The zookeeper maintains configuration information, provides distributed synchronization, and also maintains the communication between clients and region servers.

 

Q38 What are the different types of filters used in Hbase?

Answer: Filters are used to get specific data form a Hbase table rather than all the records.

They are of the following types.

  • Column Value Filter
  • Column Value comparators
  • KeyValue Metadata filters.
  • RowKey filters.

 

Q39 How does Hbase provide high availability?

Answer: Hbase uses a feature called region replication. In this feature for each region of a table, there will be multiple replicas that are opened in different RegionServers. The Load Balancer ensures that the region replicas are not co-hosted in the same region servers.

 

Q40 Explain what is the row key?

Answer: Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.

 

Q41 What are the different compaction types in Hbase?

Answer: There are two types of compaction. Major and Minor compaction. In minor compaction, the adjacent small HFiles are merged to create a single HFile without removing the deleted HFiles. Files to be merged are chosen randomly.

In Major compaction, all the HFiles of a column are emerged and a single HFiles is created. The delted HFiles are discarded and it is generally triggered manually.

 

Q42 What is TTL (Time to live) in Hbase?

Answer: TTL is a data retention technique using which the version of a cell can be preserved till a specific time period.Once that timestamp is reached the specific version will be removed.

 

Q43 In Hbase what is log splitting?

Answer: When a region is edited, the edits in the WAL file which belong to that region need to be replayed. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting.

 

Q45 Why MultiWAL is needed?

Answer: With a single WAL per RegionServer, the RegionServer must write to the WAL serially, because HDFS files must be sequential. This causes the WAL to be a performance bottleneck.

 

Q46 What are the different Block Caches in Hbase?

Answer: HBase provides two different BlockCache implementations: the default on-heap LruBlockCache and the BucketCache, which is (usually) off-heap.

 

Q47 Can you create HBase table without assigning column family.

Answer:  No, Column family also impact how the data should be stored physically in the HDFS file system, hence there is a mandate that you should always have at least one column family. We can also alter the column families once the table is created.

 

Q48 What is HFile ?

Answer: The HFile is the underlying storage format for HBase.

HFiles belong to a column family and a column family can have multiple HFiles.

But a single HFile can’t have data for multiple column families.