James Routley

You double-click a file and it opens in the "right" app. Usually we take that for granted.

But why did that work?

.txt, .png, and .pdf are just suffixes in a filename. The real question is what is inside the file, and how software turns those raw bytes into text, images, or audio.

A file is bytes. A format is the rulebook for interpreting those bytes.

A file is just bytes

Every file on your computer is a sequence of bytes. A byte is a number from 0 to 255. Documents, photos, songs, executables: all of them are number sequences on disk.

Those numbers do not carry meaning by themselves. Meaning comes from interpretation rules. The same bytes can be read as text, pixel values, audio samples, or machine instructions.

Try the views below and watch one byte stream show up in different forms:

The bytes stayed the same. Only the interpretation changed.

Extensions are just labels

In report.txt, .txt is the extension. It is a hint to the operating system, not proof.

You can rename any file to almost anything. Rename a PNG image to document.txt and the bytes remain identical.

Extensions can lie. Compare what each filename claims with what the bytes actually say:

If the OS trusted extensions alone, several of these files would open in the wrong program.

Bytes that identify themselves

Many binary formats begin with a fixed byte pattern called a magic number (or file signature). Those first bytes identify the format regardless of filename.

Plain text files often lack a reliable signature. Binary formats usually do not.

Here are common magic bytes and how they look at the start of a file:

Format	Extension	Magic bytes	ASCII
PNG	`.png`	`89 50 4E 47 0D 0A 1A 0A`	`\x89PNG\r\n\x1a\n`
JPEG	`.jpg`	`FF D8`	`\xff\xd8`
GIF	`.gif`	`47 49 46 38 39 61`	`GIF89a`
BMP	`.bmp`	`42 4D`	`BM`
PDF	`.pdf`	`25 50 44 46 2D`	`%PDF-`
ZIP	`.zip`	`50 4B 03 04`	`PK\x03\x04`
GZIP	`.gz`	`1F 8B`	`\x1f\x8b`
MP3 (ID3v2 tag)	`.mp3`	`49 44 33`	`ID3`
ELF		`7F 45 4C 46`	`\x7fELF`

PNG includes "PNG" in its signature. PDF starts with "%PDF-" and then a version marker.

JPEG begins with FF D8 (SOI, Start of Image), then another marker that depends on subtype: JFIF commonly continues with FF E0, Exif with FF E1.

MP3 starts with ID3 only when an ID3v2 metadata tag exists. Without that tag, the file starts directly with an MPEG frame sync.

ZIP starts with 50 4B (ASCII "PK", from Phil Katz). Most ZIP archives begin with PK\x03\x04, but empty archives and some self-extracting variants can start with other PK signatures.

Unix file works this way: it checks opening bytes against a signature database and ignores the filename.

Magic numbers solve identification for many binary formats. For plain text, things are fuzzier, so encoding matters.

How text becomes bytes

Text needs a mapping from characters to numbers. ASCII (1963) is the classic example: 128 characters, including letters, digits, punctuation, and control characters such as newline.

Type text and see each character turn into bytes:

In ASCII, each character maps to one byte. "A" is 65 (hex 41), "a" is 97 (hex 61), and space is 32 (hex 20). Uppercase and lowercase letters differ by one bit.

ASCII works for basic English text, but 128 symbols are not enough for global writing systems. It has no built-in space for characters like "é", "中", or "😀".

150,000 characters in one encoding

Unicode assigns a unique number, called a code point, to every character in every writing system, plus thousands of symbols and emoji. The current version defines over 150,000 characters.

Unicode is a numbering system, not a byte layout. You still need an encoding.

UTF-8 is the common one: one byte for ASCII characters, then two, three, or four bytes for other code points.

Walk through these examples and watch byte length change:

UTF-8 is self-synchronizing because the prefix bits carry structure. If you jump into the middle of a stream, you can scan forward until a valid start byte appears.

0 means a one-byte character. 110 starts a two-byte sequence. 1110 starts three bytes. 11110 starts four. Continuation bytes begin with 10.

These patterns let software guess UTF-8 from raw bytes, though short strings and pure ASCII are ambiguous because they are valid in several encodings.

In practice, UTF-8 won: it is used by the overwhelming majority of web pages.

Text stores characters through an encoding. Binary formats store structure too.

How binary files describe themselves

Binary files usually follow a layout: header first, payload second.

The header carries metadata such as dimensions, bit depth, and compression mode.

Inspect a BMP structure below:

BMP starts with "BM", then includes file size and the pixel-data offset. Next comes the DIB header with width, height, and bit depth. Pixel bytes come after those headers.

Because field sizes and offsets are fixed by spec, a parser knows exactly where each value lives.

Most binary formats follow this same pattern: magic bytes, then a metadata header, then the payload.

Pixels as bytes

In uncompressed 24-bit BMP, each pixel is three bytes in BGR order (blue, green, red), not RGB.

FF means full intensity (255). 00 means zero. Rows are padded to 4-byte boundaries, so some rows end with extra padding bytes.

Paint on the grid below and watch the hex bytes update in real time:

A red pixel in BMP is 00 00 FF (blue=0, green=0, red=255). White is FF FF FF. Black is 00 00 00.

A 1920x1080 image at 3 bytes per pixel needs a little over 6 MB for pixel data alone, before header and padding.

JPEG and PNG exist largely to compress this data.

JPEG usually uses lossy compression, discarding details that are less visible to human vision. (JPEG does define a lossless mode, but it is uncommon.)

PNG uses lossless compression, so decoded pixels match the original bit for bit.

So far, each file format has held one main payload. Containers are different.

Containers: files within files

Container formats hold many files inside one outer file. ZIP is the familiar example.

Each entry can be compressed separately, and the archive includes an index so tools can jump to a specific file.

Explore ZIP layout:

Each ZIP entry has a local header (PK\x03\x04) followed by compressed bytes. After all entries, ZIP stores a central directory with each file's name, size, and offset.

That directory is at the end, which makes appending practical: write new entry data, then rewrite the directory.

DOCX, JAR, EPUB, and APK all use ZIP containers with their own internal conventions. If you rename a .docx file to .zip, you can inspect the XML and media files directly.

Binary formats like ZIP, PNG, and BMP all have headers that declare how to read them. Plain text files don't.

When encoding goes wrong

Text files have an annoying weakness: encoding ambiguity.

A PNG declares structure in headers. A plain text file is often just bytes with no explicit encoding marker. If someone writes UTF-8 and another tool reads Latin-1, multi-byte sequences get decoded as separate Latin-1 characters. That artifact is mojibake.

Switch encodings here and watch the same bytes turn into different characters:

The UTF-8 BOM (EF BB BF) can help because it signals encoding at the start of a file. But BOMs are optional and many tools omit them.

The practical fix has been social, not technical: default to UTF-8 almost everywhere.

So what happens at open time?

How the OS identifies files

Operating systems combine multiple checks, trading speed for certainty.

The first pass is usually the extension: very fast, sometimes wrong.

Next comes signature checking (magic bytes), which is much more reliable for binary formats.

If that still is not enough, the OS or app may inspect deeper structure. That is how software can tell a DOCX (internally ZIP + Word-specific files) from a generic ZIP.

Systems weigh these signals differently. Windows leans on extensions and registry mappings. macOS uses Uniform Type Identifiers (UTIs) from extensions, declared mappings, and sometimes content checks. Linux tooling often relies on magic-byte databases (for example, file). Browsers use server Content-Type headers and may sniff content when headers are missing or incorrect.

No single method is enough. Extensions are cheap but fragile. Magic bytes are reliable but require I/O. Deep sniffing is slower but resolves edge cases.

What exactly is a file format? An interactive guide