Back Original

An initial analysis of the discovered Unix V4 tape

Several news outlets reported the discovery of a 1970s Fourth Edition Research Unix magnetic tape at the University of Utah in July 2025 and its successful restoration. This is a significant find, because up to now only the Fourth Edition’s manual was thought to have survived. Over the past few days I incorporated the tape’s source code into the Unix History Repository hosted on GitHub (see it here) and studied the code’s composition.

The Fourth Research Edition Unix came out of the famous AT&T Bell Laboratories in November 1973. A significant development it introduced was the rewriting of large parts of the system’s kernel in a high-level language (early C) rather than PDP-11 assembly language. The tape contains a complete system dump, including both source code and the compiled binaries and kernel. For inclusion in the Unix history repository, I removed the binaries, to match what is normally put under source code version control.

find $dir -name '*.[oa]' | xargs rm
rm -rf $dir/bin $dir/usr/bin $dir/usr/games $dir/lib $dir/dev
rm $dir/etc/{lpd,init,msh,getty,mkfs,mknod,glob,update,umount,mount}
rm $dir/unix
rm $dir/usr/mdec/[tm]boot $dir/usr/sys/conf/mkconf $dir/usr/fort/fc1
rm $dir/usr/c/cvopt $dir/usr/lib/suftab

The discovered Fourth Edition tape (photo credit Rob Ricci) As with other source code snapshots included in the Unix history repository, the (synthetic) Git commit timestamps are derived from the file timestamps while the commit authors are derived from a manually-created map file. I updated the existing V4 author map file based on information I had gathered for preceding and following Unix Research editions. I explicitly put ken,dmr (Ken Thompson and Dennis Ritchie the system’s main developers) in all source code files where I lacked author information (this is also the default introduced via a .* regular expression) to mark missing details. Two members of the original Bell Labs Unix development team kindly provided me information to fill some details, such as the developer of the SNOBOL III interpreter (Ken Thompson) and the implementer of the math library and emulator (Robert H. Morris).

Some have claimed that the tape’s contents are very close to the Fifth Edition rather to what really was the Fourth Edition. The reason for this claim is that, in contrast to Unix manual editions (which were formally numbered and give the Unix Research Editions their name) distributed software tapes were mostly a copy of whatever was at the time in the (single) Unix development computer. I set out to see the differences between the two versions. First, I looked at the base file names included in the two.

normalize()
{
  sed 's|.*/||' | sort -u
}

comm  -3 \
  <(git ls-tree -r --name-only Research-V4-Snapshot-Development | normalize) \
  <(git ls-tree -r --name-only Research-V5-Snapshot-Development | normalize)

The above command, which outputs files whose base file name occurs only in one of the two releases, shows only the following files introduced in the Fifth Edition.

        c13.c
        c21.c
        c2h.c
        cmp.c
        ldfps.s

So, the C compiler grew by a few files, and the cmp (compare) utility was written in C.

To dig deeper I then run git blame on each file of the two editions, to see what parts of preceding editions they incorporated.

# For each edition
for ref in Research-V4-Snapshot-Development \
  Research-V5-Snapshot-Development ; do
  echo $ref
  # For all the edition's files
  git ls-tree -r --name-only $ref |
    # Exclude administrative files introduced in the repo.
    grep -Ev 'README|LICENSE|\.pdf|\.ref' |
    # Run git-blame on each.
    xargs -I '{}' git blame -M -M -C -C $ref -- '{}' |
    sort |
    # Sum lines for each commit
    uniq -c |
    # Obtain lines and provenance of each commit; output totals.
    awk '{("git show " $2 "| awk '\''/Synthesized-from:/{print $2}'\''") | getline ver; total[ver] += $1 }
      END {for (v in total) print v, total[v]}'
done

The output gave me the following Fourth Edition’s composition in terms of code lines:

v4 75676
v3 6590
v2 168

This shows a lot of new material and about 10% coming from earlier editions.

The corresponding output for the Fifth Edition is as follows.

v5 11181
v4 52238
v3 3296
v2 168

This shows that 52 thousand lines of the Fourth Edition are indeed part of the Fifth Edition, but the Fifth Edition also introduces about eleven new thousand lines of code. This is not an insignificant amount.

Finally, I also looked at the average timestamp of the files included in each release.

# For each Research Edition
for v in $(seq 1 7) ; do
  ref=Research-V$v-Snapshot-Development
  printf '%s\t' $ref
  # List all files
  git ls-tree -r --name-only $ref |
    # Exclude administrative files introduced in the repo.
    grep -Ev 'README|LICENSE|\.pdf|\.ref' |
    # Output each file's commit time.
    xargs -I@ git log -1 --format=%at $ref -- @ |
    # Obtain average value and format it as a date.
    date -I -d @$(awk '{s += $1} END {printf("%.0f", s / NR)}')
done

Here are the results.

V1        1972-06-20
V2        1972-05-31
V3        1973-03-10
V4        1974-03-06
V5        1974-11-28
V6        1975-06-15
V7        1979-01-25

The results indicate that the Fourth Edition precedes the Fifth Edition by about about eight months — a significant period for the pace by which the system was evolving at the time. (The results also show that I need to examine the apparent timing mismatch between the First and Second Editions.)

Comments   Post Toot!