Skip to main content

Putting the "Portable" into documents

Episode No. 2   •   January 30, 2025   •   8 min, 32 seconds
--:--
--:--

When PDF was introduced in 1993, one of the most persistent problems in mainstream computing was that reliably publishing documents (either literally via printing, or simply electronically distributing them for others to view) was hard.

There were a lot of hurdles:

  • Simply moving a document (whether an office document, Postscript file, or something else) from one computer to another could result in an unreadable or unpleasant display.
  • Printers (from consumer models up to high-end typesetters) each had their own proprietary formats and requirements.
  • Many document formats were tied to a single vendor, or a single operating system.

One of PDF's initial design criteria and fundamental promises was to address this family of problems, so that one could distribute and use documents with any display, any operating system, and any print device, with confidence that the result would remain faithful to the author's intent. This was such a pressing, unmet concern that it gave the file type its name: the Portable Document Format. Let's talk about how that portability is accomplished.

Documents are heterogeneous…

Most document formats focus on text: oftentimes its logical structure, sometimes some aspects of its appearance, and occasionally some metadata. However, for a document to be faithfully rendered away from its author's computer, a host of other data is needed: fonts, images (if any), vector graphics, essential auxillary data, and so on. Documents are definitionally heterogeneous, and missing any part of a document's data or dependencies can render it useless.

The way that web content handles this is by referring to these external data, with the expectation that browsers will fetch and integrate them appropriately. This is how most non-PDF document formats are also structured: for example, Postscript files, PDF's predecessor, refer to fonts and images in a similar way as HTML (though using names and sometimes hard-coded relative file paths instead of URLs), and those resources have to be carried around alongside the document(s) that refer to them. But if a Postscript or HTML file refers to some resources that aren't available or have moved unexpectedly, the document's rendering will be fundamentally broken.

…so every PDF carries what it needs

PDF's solution to this problem is to avoid referring to external resources entirely1. Instead, PDF documents are self-contained: all of the data needed to render the document is included, from fonts to images to metadata to interactive elements and auxiliary data. Satisfying this most basic premise — knowing that a document's resources would always travel with it — clears the lowest bar of portability.

Next time, I'll talk about the (very cool) fundamental structures within every PDF document, and how they are designed to support including all of these disparate data types and resources in a single container file.

Rendering documents to different devices is hard…

At the time of PDF's introduction, document rendering was done in a bespoke way by each individual application, and often was tied to the particular operating system and output device being targeted. That is, a word processing program would need to use a completely different rendering approach when rendering to a display on Windows vs. a display on a Mac vs. sending a document to a printer.

…so PDF uses an abstract rendering model for all of them

Adobe changed that by introducing2 (as part of Postscript) what would later come to be known as the Adobe Imaging Model, a high-level procedural rendering approach that provided abstractions over the details of operating system and output device. The model includes command primitives for drawing text, lines, shapes, images, setting fonts, colors, clipping paths, and so on. PDF adopted most of the Postscript graphics model's semantics, and then extended it over the decades to support new features, media types, and usage patterns.

It was a good abstraction, in large part because it neatly separated concerns between groups with different incentives and requirements: applications could target a relatively high-level rendering model, a far simpler task than needing to know the details of each class of display or printer they might render to; and groups responsible for implementing displays (usually operating system vendors) and manufacturing printers could focus on distilling those high-level graphics commands into concrete actions to color pixels, move print heads, and so on.

This imaging model was such a successful abstraction that it effectively redefined how 2D graphics are programmed and rendered. If you've done any graphics programming in the last 30 years, you've benefitted from the results of that progress, as you've surely used a library or API that provides a similar abstraction; the Adobe Imaging Model was the direct precursor to the most widely-used modern 2D graphics APIs like Java's Graphics2D, .NET's System.Drawing, Skia's Canvas, and the web-standard canvas API. We'll talk a lot about this graphics model in future posts.

Proprietary document formats actively prevent portability…

Before PDF, most document formats were proprietary, and choices were regularly made by vendors to use document formats as competitive leverage, usually to the detriment of users' interests.

Microsoft Word was a particularly notorious offender, as there was not a single "Word document format", but rather a matrix of format variations depending on the version of Word and the operating system being used, each with its own quirks and limitations when it came to importing other variants. While this was a great benefit to Microsoft's Word and Windows businesses, it was a nightmare for users who needed to share documents with others using different programs or operating systems.

When Adobe first introduced PDF in 1993, it could have kept the format strictly proprietary, so that only Adobe and its designated partners could implement PDF generators, viewers, and so on. After all, other peer companies and file formats (e.g. Microsoft with Word, Apple with QuickTime) had taken that approach, to great commercial success.

…so PDF was "open" from the start

Instead of introducing yet another proprietary file format, Adobe did two things with PDF that were quite unusual:

  1. They published a detailed specification of the format in 1993, including the algorithms and data structures used to encode and decode PDF documents. Further, they explicitly encouraged software vendors, printer manufacturers, and others to adopt and implement PDF. This was a big deal: it meant that anyone could write software to read or write PDF documents, without needing to reverse engineer the format. This made it possible for a wide variety of software to support PDF, from word processors and web browsers to printers and image editors.
  2. Later, in 2008, Adobe submitted the PDF specification to the International Standards Organization (ISO), where it was accepted as an open standard, and has since been further refined and expanded in concert among a diversity of interested vendors. As part of this, Adobe also issued a public patent license3, where they explicitly swore off any claim to enforce patents that covered technologies within the PDF standard4.

If Adobe had treated PDF as a strictly proprietary format, existing only to enrich themselves and provide them with a unique competitive advantage, I don't think PDF would be as widely-used as it is today. More importantly, though, without a coordinated expectation of "openness" (however vaguely defined or informal in the early days), and then the tangible commitment to remove all remaining proprietary interests from the PDF landscape5, it's likely that other vendors and groups would have attempted to create mutually-incompatible PDF variants over time.

Such fragmentation would have significantly degraded the real-world portability of PDF documents: just imagine if Microsoft or Apple or Google had successfully pushed their own incompatible PDF variants (or some other wholly-different document format6), to the extent that "real" PDF documents were no longer guaranteed to render correctly on Windows, or Mac, or iPhone or Android devices. The promise of PDF's portability would have been broken.


PDF effectively solved the problem of document portability by addressing these three fundamental issues: structurally guaranteeing that document resources would always move with the document; disentangling document rendering from any particular display, device, or operating system via an abstract rendering model; and by being first an open and then a standardized specification that anyone could implement. This accomplishment did not come without its own set of tradeoffs, which we'll come back to in later posts.

Footnotes

  1. PDF documents are allowed by refer to certain types of resources using file paths, but this rare practice is a concession to certain specialized workflows where it would be extremely costly to repeatedly embed frequently-updated resources on every edit.

  2. The actual graphics model was first introduced in a 1982 paper, published well before Adobe was ever founded. 'A device independent graphics imaging model for use with raster devices' is a short paper, easy to read, and is very worth taking in to better understand the design decisions that underpin the graphics model, and thus, PDF itself.

  3. https://www.adobe.com/pdf/pdfs/ISO32000-1PublicPatentLicense.pdf

  4. Prior to this, Adobe had made informal assurances about their disinterest in enforcing PDF-related patents against third party vendors and open source projects that implemented PDF support. Those assurances were not legally binding, so the formal patent grant took the legal risk associated with implementing PDF software off the table for good.

  5. This is not to say that Adobe has not benefitted from making PDF an open standard. They have, and continue to do so, in many ways. However, the point is that the benefits of making PDF an open standard have been widely distributed, and have accrued to many parties, not just Adobe.

  6. Microsoft did try to push their own document format, XPS, as a competitor to PDF. It never gained significant traction, and Microsoft has since deprecated it.