As we know Hadoop main use is to process large amount of data. For this it is often required to copy large data sets to Hadoop file system ( HDFS ). Here Flume comes into picture, it is part of Apache project within Hadoop ecosystem.
Flume is designed for ingestion of large volume of data into Hadoop. Example would be collecting log files from web-servers and moving it to Hadoop for processing.
Download a stable release from the official flume download page. Now unpack the tar ball in your Ubuntu setup using the below command:
tar xzf apache-flume-x.y.z-bin.tar.gz
Edit .bashrc and update the PATH variable
To use flume we require flume agent which is a JVM process , used to flow events from source to destination. Each flume agent contains three components called source(s),channel(s) and sink(s). Lets discuss on this 3 components before going further.
Flume source is configured within an agent. Source listens for events generated at external source and reads them. It then send the events to a channel.
Channel connects sources to sinks. It acts as a communication bridge between sources and agents within an agent.
channels can store events till consumed by sink. Channel can be memory(events are stored in memory queue ) or file ( stored in a file in local file system )
Sink pulls events from the channel and stores in HDFS ( HDFS sink ) or forward it to another flume agent running. sink also supports avro sink,logger sink and many more.
Note: Apart from built in components one can also build own or customized sources,channels and sinks using java.
Configuration and running Flume
Flume agent can be started using flume-ng command as:
flume-ng agent –conf conf –conf-file test.conf –name testagent
flume configuration file must be passed after –conf–file and agent name after –name
Flume configuration file should contain details about sink(s),channel(s) and source(s).
Lets create a sample flume configuration to collect events from a local directory and print the same on console:
we are creating a single agent called agent1 .Also a single sink,source and channel and naming them as sink1,source1 and channel1
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
next we have to connect source and sink to a channel for this we have to configure below
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
source1 is connected to channel1 and also sink1 is connected to channel1
Now we have to configure type attribute for source,channel and sink
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /tmp/spooldir
agent1.sinks.sink1.type = logger
agent1.channels.channel1.type = file
If the sink output needs to be stored in HDFS then replace the sink type(agent1.sources.source1.type = spooldir and agent1.sources.source1.spoolDir = /tmp/spooldir) with below lines:
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /tmp/flume
agent1.sinks.sink1.hdfs.filePrefix = events
agent1.sinks.sink1.hdfs.fileSuffix = .log
agent1.sinks.sink1.hdfs.inUsePrefix = _
agent1.sinks.sink1.hdfs.fileType = DataStream
Files when in the process of written will have .tmp extension. For instance, _events.1111296780136.log.tmp ; sink generates timestamp and inserts in the file name. File is keptt open by default 30 seconds or until by default it reaches 1024 bytes. This can be controlled by config properties hdfs.rollInterval property and hdfs.rollSize.
We are done with the configuration , lets name it as test.conf
Create a directory called “/tmp/spooldir” in your local file system and start the Flume ng from your terminal
% flume-ng agent \
–conf-file test.conf \
–name agent1 \
–conf $FLUME_HOME/conf \
From the terminal copy date into spooldir so that flume ng picks it
% echo “Hello Flume” > /tmp/spooldir/.file1.txt
% mv /tmp/spooldir/.file1.txt /tmp/spooldir/file1.txt
The messages now would be displayed on the flume ng console.
Reference: Hadoop definitive guide by Tom White
You can also refer another example of flume here