Surveillance cameras installed in enterprise facilities and public places produce lots of video data every day. Typically, these surveillance video clips are stored as compressed video files corresponding to video for several hours. There is a need to process such huge video data sets to enable a quick summary of “interesting” events that happened during a specified time frame in a particular location. For instance, one might be interested in anomalous trajectories of objects within a scene.
In order to process such large-scale video datasets, we are using the Hadoop MapReduce framework. An earlier blog post by my colleague, Dr.Victor Fang, reviewed many use-cases that can be supported using our video analytics framework. In this blog post, I will dive further into the distributed video transcoder part of the framework that ingests the video into Hadoop, decodes the bitstream chunks in parallel and produces a sequence file (which is much more amenable for video analytics in Hadoop).
Hadoop framework stores large files in a distributed file system (HDFS: Hadoop Distributed File System) as small chunks of certain block size (typically 64MB) across a cluster of commodity machines. Given this framework, when the large input file to be processed is a text file and is split into 64MB chunks, each Mapper process can access the lines in each split independently. However, when the input file is video file (bitstream) and is split into many chunks, each Mapper process needs to interpret the bitstream chunk appropriately to provide access to the individual decoded video frames for subsequent analysis. In the following section, we will describe how each of the splits (64MB chunks) of a video bitstream can be transcoded into a sequence of JPEG images that can be subsequently processed by video analytics MapReduce jobs.
Architecture of Distributed Video Transcoder
Popular video compression formats such as MPEG-2, H.264 have a hierarchical structure in the bitstream. This hierarchical structure inherent in video bitstreams makes it possible to decode arbitrary input chunks. For concreteness, we choose MPEG-2 video as our example input video format (which is also the chosen format for DVDs, Digital Video Broadcast, possibly widely used surveillance video recording format).
Our prototype distributed video transcoder breaks the task of decoding each chunk into the two following MapReduce jobs.
Video Sequence Header MapReduce Job: This MapReduce job is essentially looking for video sequence level information that is present only in the first chunk of a large video file.
Video Decoder MapReduce Job: This MapReduce job uses the output of previous job as its input to configure the Video Decoder object to decode each chunk and writes the decoded frames in Hadoop-friendly SequenceFile format as <key,value> pairs.
In the following subsections, we will describe at a high-level some elements of the MPEG-2 bitstream and in detail the two MapReduce jobs.
MPEG-2 Bitstream Structure
Since MPEG-2 bitstream is hierarchically organized, it enables random access into a video bitstream. The whole bitstream has a sequence header containing information about picture height, width etc followed by information pertaining to a Group of Pictures (“GOP”). The GOP bitstream element typically corresponds to 15 video frames and is repeated as many times as necessary to represent all the frames in the video sequence as shown below:
Figure 1: MPEG-2 Bitstream structure
This repetitive nature of GOP is what enables us to start decoding from any point in a video bitstream. For instance, if a large video file is split into many 64 MB chunks, the block boundaries may not align with GOP boundaries in the video bitstream. GOP boundary plays the same role as the role played by a newline character for text data. The record reader for text data reads beyond 64MB block boundary until it sees a newline character. Similarly, the record reader for reading the video bitstream reads beyond the block boundary until it sees a GOP boundary.
Video Sequence Header MapReduce Job
The goal of this MapReduce job is to extract the sequence level information from the first 64MB chunk of MPEG-2 video file and write this information as a text file in HDFS. The sequence information is available only once in the bitstream and contains information that is essential to set up a MPEG-2 video decoder object.
Since our input is a Video file, we implemented a new FileInputFormat with its own record reader. The record reader corresponding to this FileInputFormat provides the following key-value pair to each Map process: <LongWritable, BytesWritable> where the input key is the byte offset into the file, and the BytesWritable is a byte array holding the video bitstream for the whole chunk. The record reader reads beyond the block boundary until it sees a GOP boundary. This ensures that all frames are decoded.
Within each Map process, the key value is compared against zero to determine if it is the first chunk of the video file. If it is, then the bitstream is parsed to get the sequence level information and written to a HDFS file with the following name: “input_video_filename_sequence_info.txt”. There is no need for a Reduce phase as we get the sequence information from the Map phase.
The parsing of the bitstream to get sequence information can either be done in Java or C code. In our implementation, we used libmpeg-2 (which is an open source MPEG-2 video decoder) to parse the bitstream through Java Native Interface (JNI). The hadoop framework supports JNI by allowing the user of the framework to submit a shared object file (*.so file) through “-files” command line option when the job is submitted.
Video Decoder MapReduce Job
The goal of this MapReduce job is to decode the individual 64MB chunks and create a sequence file containing the decoded video frames as JPEG images from each chunk. The InputFileFormat and the record reader are the same as in the first MapReduce job in the previous section. Hence, each Mapper’s input <key,value> pair is <LongWritable, BytesWritable>.
Each mapper reads the sequence information file in HDFS (which is the output of the previous MapReduce job) and passes that information along with the bitstream buffer that came as input BytesWritable. The actual decoding of the bitstream is done in C using libmpeg2 through Java Native Interface.
Then, in the Map process we convert the decoded frames to jpeg images and create the following <key,value> pair as output of the Map process. The output key of the map process encodes the input video filename and chunk number as “input_video_filename_chunkid”. The corresponding output value is BytesWritable and encodes the decoded frame number and the jpeg bitstream of the decoded frame. Note that the “chunk_id” can be derived from input key value which is the byte offset into the video file.
Each reducer simply writes the decoded frames from all the chunks into a sequence file as output format. Here is an example that illustrates the input and output <key,value> pairs for this MapReduce job. Let us assume the input video file is named “example.m2v”.
Input <key,value> for Mapper: <LongWritable, BytesWritable> e.g <67108864, BytesWritable> Output<key,value> from Mapper: <Text, BytesWritable> e.g <example.m2v_1, BytesWritable>
The reducer takes in <Text, BytesWritable> pairs and writes these pairs into a sequence file. Figure 2 below illustrates the Video Decoder Map-Reduce job.
Figure 2: Parallel, Distributed Video Transcoding in Hadoop using Map-Reduce
Key Takeaways
In this blog, we have shown how we can ingest video into Hadoop and perform parallel and distributed transcoding to create a sequence file format containing JPEG images of the individual video frames. To give you an idea of how quickly this process runs: we can decode ~10 GB worth of surveillance video (~2 hrs of content) in our 8-node cluster under 2 min 22 seconds!
This Hadoop-friendly sequence file format can then be re-used for multiple runs of video analytics jobs supporting different use-cases as outlined in our earlier blog by Victor Fang. Although we demonstrated the framework with MPEG-2 video format as an example, it is easy to imagine extending the framework to support a variety of formats with the help of corresponding decoders that can be called from Mappers.