Video Engineering

Video-compression techniques typically exploit the correlations between nearby frames. For example, if a background does not change between two frames, those sections of the image don’t need to be repeated in the bitstream.

Motion Compensated Prediction

If part of the image is in motion, the encoder can still take advantage of the correlation. An encoder spends most of its CPU time searching for such correlations to reduce the size of the output. These techniques cause compressed frames to depend on one another in a way that makes it impossible to start decoding in midstream.

Stream Access Point

Also known as Instantaneous Display Refresh or Key Frame.

Applications often want to start in midstream. For example, a television viewer may change the channel, or a YouTube or Netflix client may want to switch to a higher-quality stream. To allow this, video encoders insert Stream Access Points in the compressed bitstream—one at the beginning, and additional ones in the middle. A Stream Access Point resets the stream and serves as a dependency barrier: the decoder can begin decoding the bitstream at a Stream Access Point, without needing access to earlier portions of the bitstream. Key frames incur a significant cost. As an example, a raw frame of 4K video is about 11 megabytes. Spurious insertion of key frames significantly increases the compressed bitrate.To achieve good compression, key frames should only be inserted rarely. For example, systems like YouTube use an interval of four or five seconds


A compressed frame is a bitstring, representing either a key frame or an “interframe.” A key frame resets and initializes the decoder’s state. An interframe does depend on the decoder’s state, because it re-uses portions of the three reference images. The goal of an interframe is to be as short as possible, by exploiting correlations between the intended output image and the contents of the three reference slots in the decoder’s state


Traditional video encoders and decoders maintain a substantial amount of opaque internal state. The decoder starts decoding a sequence of compressed frames at the first key frame. This resets the decoder’s internal state. From there, the decoder produces a sequence of raw images as output. The decoder’s state evolves with the stream, saving internal variables and copies of earlier decoded images so that new frames can reference them to exploit their correlations. There is generally no way to import or export that state to resume decoding in midstream.

Encoders also maintain internal state, also with no interface to import or export it. A traditional encoder begins its output with a key frame that initializes the decoder.


A VP8 decoder’s state consists of the probability model—tables that track which values are more likely to be found in the video and therefore consume fewer bits of output—and three reference images, raw images that contain the decoded output of previous compressed frames.


In the context of video codecs, PPS refers to Picture Parameter Set and SPS refers to Sequence Parameter Set.

Let me give you a bit of a background before we start to understand these.

Parameter sets were introduced in H.264/AVC in response to the devastating effects of a loss of the sequence header and picture header, if a picture is partitioned into multiple segments (i.e., slices) and those segments are transported in their own transport unit (e.g., RTP packet)—which is desirable for MTU size matching. The loss of the first packet of a picture, which carries not only the first picture segment data, but also the picture header (and sometimes also the GOP and sequence header), might lead to a completely incorrectly reconstructed picture (and sometimes also the following pictures), even if all other packets were not lost. Some decoder implementations would not even attempt to decode the received packets of a picture, if the packet with the picture header was lost. To combat this vulnerability, transport layer based mechanisms were introduced. For example, the RTP payload format for H.263, specified in RFC 2429 [10], allowed for carrying a redundant copy of the picture header in as many packets as the encoder/packetizer chooses. During the design of H.264/AVC, it was recognized that the vulnerability of a picture header is an architectural issue of the video codec itself, rather than a transport problem, and therefore the parameter set concept was introduced as a fix for the issue.

Parameter sets can be either part of the video bitstream or can be received by a decoder through other means (including out-of-band transmission using a reliable channel, hard coding in encoder and decoder, and so on).

PPS contains data that is common to the entire picture.

SPS contains data that is common to all the pictures in a Sequence Of Pictures (SOP).

Bit Stream Filters

Let me explain by example. FFmpeg video decoders typically work by converting one video frame per call to avcodec_decode_video2. So the input is expected to be "one image" worth of bitstream data. Let's consider this issue of going from a file (an array of bytes of disk) to images for a second.

For "raw" (annexb) H264 (.h264/.bin/.264 files), the individual nal unit data (sps/pps header bitstreams or cabac-encoded frame data) is concatenated in a sequence of nal units, with a start code (00 00 01 XX) in between, where XX is the nal unit type. (In order to prevent the nal data itself to have 00 00 01 data, it is RBSP escaped.) So a h264 frame parser can simply cut the file at start code markers. They search for successive packets that start with and including 00 00 01, until and excluding the next occurence of 00 00 01. Then they parse the nal unit type and slice header to find which frame each packet belongs to, and return a set of nal units making up one frame as input to the h264 decoder.

H264 data in .mp4 files is different, though. You can imagine that the 00 00 01 start code can be considered redundant if the muxing format already has length markers in it, as is the case for mp4. So, to save 3 bytes per frame, they remove the 00 00 01 prefix. They also put the PPS/SPS in the file header instead of prepending it before the first frame, and these also miss their 00 00 01 prefixes. So, if I were to input this into the h264 decoder, which expects the prefixes for all nal units, it wouldn't work. The h264_mp4toannexb bitstream filter fixes this, by identifying the pps/sps in the extracted parts of the file header (ffmpeg calls this "extradata"), prepending this and each nal from individual frame packets with the start code, and concatenating them back together before inputting them in the h264 decoder.

You might now feel that there's a very fine line distinction between a "parser" and a "bitstream filter". This is true. I think the official definition is that a parser takes a sequence of input data and splits it in frames without discarding any data or adding any data. The only thing a parser does is change packet boundaries. A bitstream filter, on the other hand, is allowed to actually modify the data. I'm not sure this definition is entirely true (see e.g. vp9 below), but it's the conceptual reason mp4toannexb is a BSF, not a parser (because it adds 00 00 01 prefixes).


  • Digital Video Basics

  • AV and FFmpeg Basics\


  • Closed Captions (CEA 608/708)


  • 360 / VR