How much do you need to understand about digital media (video and audio)? The answer to that question very much depends of how deep you want to go. This article focuses on the main concepts that we think are necessary to get to grips with online streaming, and with the broadpeak.io products.

Dimensions

Video

The information contained in a digitised video signal is typically defined by:

A resolution: the number of pixels (width and height) that make up each frame of the video (each frame is essentially a grid of pixels, just like a photograph). For example an HD resolution is typically 1920 x 1080 pixels
An aspect ratio: the ratio of width and height, usually expressed as “w:h” - 2 numbers separated by a colon. Most of the times the aspect ratio will be horizontal (with a width bigger than the height). For example a fairly common aspect ratio of 16:9.
A frame rate: the number of frames that are captured per second of video, expressed therefore in frames per second (fps). Most standard video used in streaming uses 24, 25 or 30fps.

Audio

The information contained in an digitised audio signal is typically defined by:

The sample rate (or sampling frequency): the number of samples of audio recorded every second, expressed in Hertz (Hz). A typical figure is 44,100 samples per seconds (or 44.1 kHz).
The channels: audio streams are usually composed of multiple channels. You’ll be familiar with this concept already from the listening side of things: headphones render stereo channels (left and right), whereas most home cinema systems support 5.1 surround sound made of 6 separate audio channels to provide a convincing experience that puts you in the middle of the scene.

Other signals

Along video and audio, you will often also handle associated signals, such as subtitles/captions, or timed metadata. There isn’t really any specific dimension associated to those worth talking about here.

Compression

If digital media directly contained the raw signals captured, they would end up with a huge size that would be very slow and expensive to store or transfer. What is almost always happening (regardless of where those media come from) is that they’re made by compressing the raw signal, into what is usually referred to as a stream. A couple of key concepts are needed to understand this operation:

Codecs are the algorithms (think of them as recipes) that define how the signal can be compressed (or “coded”) and decompressed (or “decoded”) again. There are different codecs for audio and video, and a wide range of codecs available, depending on each application. The ones most commonly used in the streaming industry are called “lossy”, in that they remove some information from the original signal in order to be able to compress more. Different codecs differ in their ability to compress more or less without affecting perceived quality, and in the speed they take to do so. The most common ones used in online streaming are H264/AVC and H265/HEVC for video and AAC or Dolby Digital for audio
A key measurement in the application of the codec is the bitrate: it defines how much signal is conveyed per unit of time. Since the signal is made of bits (1s and 0s), it is usually measured in bits per second (bps or b/s). The bitrate chosen influences how big the media files are, and therefore how much storage they take, and how long they take to transfer. A high bitrate will make bigger files, slower to transfer, but on the other hand allows much more information to be retained during compression, and therefore usually will retain more quality from the captured signal. In standard online streaming applications, you will see bitrate ranges for video somewhere between 100 Kbps and 15 Mbps, and between 64 Kbps and 300 Kbps for audio

Containers

Before I can start playing back a stream, there is one last thing that is needed: wrapping the stream in a container. The container provides a mechanism to combine (often called multiplexing or muxing) one or more streams together, with metadata to describe those streams and how to access them. Here again, the industry has a wide variety of containers available, with different functionality. The most common ones used are the Transport Stream container (or “TS”) and the MPEG-4 container format (variations of which will sometimes be referred to as ISOBMFF or CMAF).

Processes

Encoding

The whole process of capturing and compressing raw signals is usually called encoding. The output of it is a file or stream, that is ready to be used as source for a media content workflow, and/or playback.

Most of the times in a media flow, a file or stream is repurposed to change some of its dimensions to make it usable in a downstream system, such as changing the resolution, reducing the bitrate, or swapping to a different container format. This process of conversion is often called transcoding, although you’ll often see the term “encoding” used interchangeably with it (because technically transcoding = decoding + re-encoding).

Packaging

Most encoders will wrap their output into a container (a necessary step to be able to pass it to the next component in the flow). However, in the context of digital video and streaming, and delivery over the internet, the more specific term packaging tends to be used to refer to the process of preparing and organizing the media. It involves a series of steps that are designed to ensure that the content can be transmitted efficiently and played back correctly in the video players. It often refers to the following sub-processes:

Segmentation: Long video files are typically divided into smaller segments to facilitate streaming.
Packaging: Each segment is then "packaged" with associated metadata into a streaming format such as HLS or DASH. You’ll learn more about this in the next chapter on Adaptive Streaming.
Encryption and DRM: If the content is to be protected from unauthorized copying or viewing, it can be encrypted during the packaging process. This often involves the use of Digital Rights Management (DRM) systems, which control access to copyrighted material.

Packagers are either

Offline packagers, which create their output once, which then gets stored on the origin server
Just-in-Time (JIT) packager, which create those outputs on-demand, from the output of the encoder

Playing

The player (often called a video player, even though it naturally should also plays audio) is fundamentally in charge of:

Parsing the file (or files) it receives, retrieving one or multiple streams from them, decoding them back into video frames, audio samples, or subtitle parts, and rendering them, that is: creating images on screen, or audio through the speakers, or overlaying subtitles as text.
Enabling the viewer to interact with the timeline, such as pausing playback, seeking to a different point, etc..
Always keeping the various streams synchronised in time, including subtitles and timed metadata.

Players will usually have tons of other features, but at the core they all need to support these core functions.

In the case of a streaming solution, in which a single file is not usually involved, the player is also in charge of gradually downloading the streams whilst they are being played, in such a way that

it offers the highest possible visual quality for the users’ internet connection speed (technically: its bandwidth), if necessary switching to higher or lower qualities as the bandwidth fluctuates to ensure smooth playback
it always has the next few seconds of content ready to decode and play in a buffer, so that it does not need to interrupt playback for lack of content if the users' internet speed is or becomes too low (which is usually called “stalling”).
for live content, especially sports, it has the lowest possible latency so that the user can watch content as close in time to when it is being captured
all that without encountering errors.

Protecting (DRM)

Depending on the value of your content, you may want the ability to control how people can consume your content and prevent them from being able to share it. Digital Rights Management (DRM) systems are one of the key mechanisms to do this.

Digital Rights Management is a technically complex and deep topic. We will not go in any technical detail about it in this guide (there are excellent resources to be found online), but suffices to say that it usually consists of at least 2 aspects:

Encryption of the stream using standard algorithms (usually a version of AES encryption) that make use of keys, a complex piece of information that cannot be easily guessed. The key used for encrypting the stream is then also needed to decrypt it.
Secure delivery of licenses to players, which wrap the decryption key, with the ability to impose additional constraints on the way the content can be played back, such as whether it can be used offline, how many times it can be watched, what types of devices it can be streamed on and at what quality levels, etc.

There are 3 main DRM systems used in the streaming industry:

FairPlay: invented by Apple
Widevine: maintained by Google
PlayReady: invented by Microsoft

The three technologies have a number of differences, but in practice the main one is that they are supported by different playback devices. For the most part, your streaming solution is therefore likely to need to adopt a multi-DRM approach. To simplify this, a number of DRM vendors offer products that allow you to work with all three systems within a single solution.