a breakdown of the pcm audio format
Pulse-code modulation (PCM) is a method used to digitally represent analog signals. It works simply by capturing the amplitude of an audio signal over time. It is a standard form of digital audio in computers and CDs. I’m writing this to demystify the format and to understand how it works.
PCM works by sampling the amplitude of an analog signal at regular intervals. You can imagine this as a surveyor capturing the height (“amplitude”) at a moment in time. Then again and again and again. Do this enough times and you have a digital representation close to but not perfectly representing the analog version.
This is what they mean when they say the samples are “quantized” into discrete values and coded as binary numbers because a tiny amount of information is lost. The number of bits used to represent each sample determines the accuracy of the digital signal. The more bits you use, the closer to the real signal you get.
The sampling rate is the number of samples taken per second. Each sample is stored as a chunk in the PCM data. List of common sample rates:
Rate | Reference |
---|---|
8,000 Hz | Adequate for human speech but without sibilance. Used in telephone/walkie-talkie. |
11,025 Hz | Used for lower-quality PCM, MPEG audio and for audio analysis of subwoofer bandpasses. |
16,000 Hz | Used in most VoIP and VVoIP, extension of telephone narrowband. |
22,050 Hz | Used for lower-quality PCM and MPEG audio and for audio analysis of low frequency energy. |
44,100 Hz | Audio CD, most commonly used rate with MPEG-1 audio (VCD, SVCD, MP3). Covers the 20 kHz bandwidth. |
48,000 Hz | Standard sampling rate used by professional digital video equipment, could reconstruct frequencies up to 22 kHz. |
88,200 Hz | Used by some professional recording equipment when the destination is CD, such as mixers, EQs, compressors, reverb, crossovers and recording devices. |
96,000 Hz | DVD-Audio, LPCM DVD tracks, Blu-ray audio tracks, HD DVD audio tracks. |
The bit depth is the number of bits used to represent each sample. The more bits used, the more accurately the digital signal can represent the original analog signal. Common bit depths are 8-bit, 16-bit, 24-bit, and 32-bit.
Depth | Levels | Reference |
---|---|---|
8-bit | 256 | Used in telephony and early computer audio. |
16-bit | 65k | Audio CDs, most commonly used for audio processing. |
24-bit | 16m | Used in professional audio equipment and high-quality audio. |
32-bit | 4b | Used in high-end audio equipment and audio processing. |
PCM audio can be mono (1 channel), stereo (two channels), or multi-channel (more than two channels). Each channel is a separate PCM stream that can be mixed together to create a single audio stream. For example, stereo audio has two channels, one for the left speaker and one for the right speaker and it can be interleaved like this:
Here’s a example of a 16-bit stereo audio file with a sampling rate of 48 kHz in binary form and decimal:
Sample 0: 0000100011011110 (decimal: 2270)
Sample 1: 0000100101100000 (decimal: 2400)
Sample 2: 0000100110111001 (decimal: 2489)
Sample 3: 0000100101100101 (decimal: 2405)
….
Sample 50: 1111111100101110 (decimal: 65326)
Sample 51: 1111111011000110 (decimal: 65222)
Sample 52: 1111111000100111 (decimal: 65063)
Sample 53: 1111110111100010 (decimal: 64994)
The size of a PCM file = (sampling rate) * (bit depth) * (channels) * (time)
As an example, a 16-bit stereo audio file with a sampling rate of 48 kHz and 1 minute of audio:
(48000 samples/sec) * (16 bits/sample) * (2 channels) * (60 sec) = 46.08 MB
If PCM data is just amplitude information, how do we hear different types of sound? PCM is just a digital conversion of an analog signal and the analog signal is just vibrations in air pressure over time. So how exactly do we understand different types of sounds. Well that gets into the characteristics of the sound itself:
Pitch (frequency) is represented by how quickly the amplitude values oscillate up and down. For example:
Volume (loudness) is represented by the range of the amplitude values:
Timbre (sound quality) comes from the complex pattern of the amplitude variations:
Our brain processes these amplitude patterns as it travels through different parts of our ear and is remarkably good at extracting meaningful information from just these simple pressure variations.