Q: Q1 Explain about the core components of Flume.

Answer : The core components of Flume are – Event- The single log entry or unit of data that is transported. Source- This is the component through which data enters Flume workflows. Sink-It is responsible for transporting data to the desired destination. Channel- it is the duct between the Sink and Source. Agent- Any JVM that runs Flume. Client- The component that transmits event to the source that operates with the agent.

Q: Q2 What is Flume?

Answer : Flume is a distributed service for collecting, aggregating, and moving large amounts of log data.

Q: Q3 Which is the reliable channel in Flume to ensure that there is no data loss?

Answer : FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.

Q: Q4 How can Flume be used with HBase?

Answer : Apache Flume can be used with HBase using one of the two HBase sinks – HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96. AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase. Working of the HBaseSink – In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster. Working of the AsyncHBaseSink – AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.

Q: Q5 What is an Agent?

Answer : A process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their destination.

Q: Q6 Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how?

Answer : Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink.

Q: Q7 Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how.

Answer : Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink.

Q: Q9 Explain about the different channel types in Flume. Which channel type is faster?

Answer : The 3 different built in channel types available in Flume are- MEMORY Channel – Events are read from the source into memory and passed to the sink. JDBC Channel – JDBC Channel stores the events in an embedded Derby database. FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink. MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.

Question 1

Q1 Explain about the core components of Flume.

Accepted Answer

Answer: The core components of Flume are –

Event- The single log entry or unit of data that is transported.
Source- This is the component through which data enters Flume workflows.
Sink-It is responsible for transporting data to the desired destination.
Channel- it is the duct between the Sink and Source.
Agent- Any JVM that runs Flume.
Client- The component that transmits event to the source that operates with the agent.

Question 2

Q2 What is Flume?

Accepted Answer

Answer: Flume is a distributed service for collecting, aggregating, and moving large amounts of log data.

Question 3

Q3 Which is the reliable channel in Flume to ensure that there is no data loss?

Accepted Answer

Answer: FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.

Question 4

Q4 How can Flume be used with HBase?

Accepted Answer

Answer: Apache Flume can be used with HBase using one of the two HBase sinks –

HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase.

Working of the HBaseSink –

In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster.

Working of the AsyncHBaseSink –

AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.

Question 5

Q5 What is an Agent?

Accepted Answer

Answer: A process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their destination.

Question 6

Q6 Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how?

Accepted Answer

Answer: Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink.

Question 7

Q7 Is it possible to leverage real time analysis on the big data collected by Flume directly? If yes, then explain how.

Accepted Answer

Answer: Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink.

Question 8

Q8 What is a channel?

Accepted Answer

Answer: It stores events,events are delivered to the channel via sources operating within the agent.An event stays in the channel until a sink removes it for further transport.

Question 9

Q9 Explain about the different channel types in Flume. Which channel type is faster?

Accepted Answer

Answer:

The 3 different built in channel types available in Flume are-

MEMORY Channel – Events are read from the source into memory and passed to the sink.
JDBC Channel – JDBC Channel stores the events in an embedded Derby database.
FILE Channel –File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink.
MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event.

Question 10

Q10 What is Interceptor?

Accepted Answer

Answer: An interceptor can modify or even drop events based on any criteria chosen by the developer.

Question 11

Q11 Explain about the replication and multiplexing selectors in Flume.

Accepted Answer

Answer: Channel Selectors are used to handle multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. Multiplexing channel selector is used when the application has to send different events to different channels.

Question 12

Q12 Does Apache Flume provide support for third party plug-ins?

Accepted Answer

Answer: Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.

Question 13

Q13 Apache Flume support third-party plugins also?

Accepted Answer

Yes, Flume has 100% plugin-based architecture, it can load and ships data from external sources to external destination which seperately from Flume. SO that most of the bidata analysis use this tool for sreaming data.

Question 14

Q14 Differentiate between FileSink and FileRollSink

Accepted Answer

Answer: The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system.

Question 15

Q15 How can Flume be used with HBase?

Accepted Answer

Answer: Apache Flume can be used with HBase using one of the two HBase sinks –

HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.
AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non-blocking calls to HBase.

Working of the HBaseSink –

In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster.

Working of the AsyncHBaseSink –

AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the getIncrements and getActions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.

Question 16

Q17 How multi-hop agent can be setup in Flume?

Accepted Answer

Answer: Avro RPC Bridge mechanism is used to setup Multi-hop agent in Apache Flume.

Question 17

Q18 Why we are using Flume?

Accepted Answer

Answer: Most often Hadoop developer use this too to get lig data from social media sites. Its developed by Cloudera for aggregating and moving very large amount if data. The primary use is to gather log files from different sources and asynchronously persist in the hadoop cluster.

Question 18

Q19 What is FlumeNG

Accepted Answer

Answer: A real time loader for streaming your data into Hadoop. It stores data in HDFS and HBase. You’ll want to get started with FlumeNG, which improves on the original flume.

Question 19

Q20 Can fume provides 100% reliablity to the data flow?

Accepted Answer

Answer: Yes, it provide ene-to-end reliablity of the flow. By default uses a transactional approach in the dara flow.

Source and sink encapsulate in a transactional repository provides by the channels. This channels responsible to pass reliably from end to end flow. so it provides 100% reliablity to the data flow.

Achieve your goals

Achieve your goals

transform your life through education

Achieve your goals

Achieve your goals

transform your life through education

Top Spark SQL Interview Questions

Headquarters

follow us

Quick Links

resources

About Us

Newsletter