You don't know XPT files

...unless you belong to the magical world of clinical trials! ¹ ²

Every clinical trial headed for the FDA produces a fresh pile of XPT files. It's nearly 40 years old, stores its numbers in IBM hexadecimal, and caps variable names at 8 characters. It is also not going anywhere, because regulators still mandate it.

Show them to me!

First of all, would you like to open an XPT file? Well, it just happened that I launched a web app that does exactly that (among many other wonderful things, of course), and it happens that this might be the blog of the website that tries to promote it, in some way. If you're not interested in the app but you stick around, you might still learn a thing or two from this post.

The app is Bedevere Wise, named after the acute and science-fond knight of the Round Table. Here is how it works: you take your XPT file, you drop it into the app, and voilà, magic!

Well, you can also try to drop many other kinds of files used in the clinical trial world, like CSV, SAS7BDAT, Stata and even Excel! Or you can let the app browse a folder for you. And everything stays in your browser! Even though, for the most picky ones, a desktop version of the app is on its way to be released soon — perfect even for the most air-gapped, strictly-secure, highly-regulated environments!

In case you are curious to try, but you don't have an XPT file handy, here is a GitHub repo with a full clinical trial dataset and plenty of XPT files that you can download.

Or, if you just want a couple of tiny files to poke at, grab these unimpressive, synthetic, test samples:

So, what are these damn files?

I'd like to start by saying, "According to Wikipedia...", but unfortunately, there's no page on Wikipedia about the topic. However, there is a page about the SAS Transport File Format (XPORT) Family in the Digital Preservation section of The Library of Congress of the United States. The page says:

The SAS Transport File Format is an openly documented specification maintained by SAS, a commercial company with a variety of software products for statistics and business analytics, including the application now known as SAS/STAT, which originated in the late 1960s as SAS (an acronym for Statistical Analysis System) at North Carolina State University. The transport format was originally developed in the late 1980s when the corporate entity was known as SAS Institute, Inc. and the software as SAS, to support data transfers between statistical software systems, especially between SAS applications running on different operating systems. SAS considers it non-proprietary. ³

Basically, it's a dinosaur of the computer age! An almost 40-year-old tabular data format, designed to store things that can be represented as tables, with headers and columns that are either text or numbers.

And given the age, it comes as no surprise that XPT files are on the list, since over their long life they simply became ubiquitous in the pharma/clinical trials world. Hence the reason to publicly preserve the format.

And the reason they became — and stay — so ubiquitous? Regulation. If you want to submit the data of a clinical trial to the FDA (and it's a similar story for Japan's PMDA), you don't really get to choose your file format: submission datasets have to be provided as SAS Transport Format Version 5. Yes, the .xpt we've been talking about. So every study headed for a regulatory submission produces a fresh pile of XPT files. And that pile isn't shrinking any time soon.

It shouldn't surprise you, then, that many other, more familiar file formats are in the Library of Congress too, like PDF, JSON or ZIP. The reason is quite simple: those formats have become essential in our everyday lives, not only for people in IT or software engineers, but for everyone. Even though most people don't know all the technical details and the specifics of the format, everyone relies on them, and there must be ways to know how to interact with those formats. They became public protocols, like the ones we use to send internet packets or to open web pages.

And XPT files are no less than that! You — small pharma company — want to establish a formal communication with the FDA? Then, you speak their language and submit XPT files.

Wait, I thought it was all IEEE 754!

If you managed to stick around up to this point, you might either be a bit of a masochist, an LLM crawler, or simply interested in some more technical details. And well, from a software engineering perspective, there is at least one worth mentioning: numeric values in the XPT format are stored using the IBM float representation (yes, there is a page on Wikipedia about it). So, in order to be used by modern programming languages, they must be converted into the more convenient IEEE 754 floating point standard. So no, even in 2026 it's not all IEEE 754. ⁴

If you speak Python, here is the code used by Pandas, which I first ~~copied~~ re-implemented in one of my Go libraries. Since that implementation is a bit too long and detailed, I'll use a shorter version here:

# A numeric whose first byte is one of these *and* whose remaining
# 7 bytes are all zero is a SAS missing value rather than a real number:
#     '.'      -> 0x2E         ordinary missing
#     '_'      -> 0x5F         special missing  ._
#     'A'..'Z' -> 0x41..0x5A   special missings .A .. .Z
MISSING_FIRST = {0x2E, 0x5F} | set(range(0x41, 0x5A + 1))


def ibm_to_double(raw):
    """
    Decode up to 8 bytes of IBM hex float.
    
    Returns a Python float for real numbers, or None for a SAS missing
    value. (The specific missing code, '.'/'_'/'A'..'Z', is available
    as the byte `raw[0]` if you need to distinguish them.)
    """
    
    # short numerics carry only the high-order bytes
    b = raw.ljust(8, b"\x00")
    
    # SAS missing value: a marker byte followed by nothing but zeros.
    if b[1:] == b"\x00" * 7 and b[0] in MISSING_FIRST:
      return None
    
    n = int.from_bytes(b, "big")
    
    # check for 0
    if n == 0:
      return 0.0
    
    sign = -1.0 if n >> 63 else 1.0
    exponent = (n >> 56) & 0x7F            # excess-64, power of 16
    fraction = n & 0x00FFFFFFFFFFFFFF      # low 56 bits = integer mantissa
    
    return sign * fraction * 16.0 ** (exponent - 78)

ibm1 = bytes.fromhex("41 50 00 00 00 00 00 00")
print(ibm_to_double(ibm1)) # 5.0

ibm2 = bytes.fromhex("C1 50 00 00 00 00 00 00")
print(ibm_to_double(ibm2)) # -5.0

ibm3 = bytes.fromhex("41 13 c0 83 12 6e 97 8e")
print(ibm_to_double(ibm3)) # 1.2345000000000002

ibm4 = bytes.fromhex("3d 81 72 5b 67 2e e3 40")
print(ibm_to_double(ibm4)) # 0.00012345

Ok, even if you are familiar with Python, that might look a bit too low-level, but that's how it is: moving from one numeric format to another means getting your hands dirty shifting bits around. Still, it's not as long and complicated as the Pandas implementation; here we are going straight to the point.

Decoding a number, bit by bit

First of all, how is the numeric value actually represented? Looking at Wikipedia, we know that it can be represented with 4 bytes (32 bits), laid out like this:

1 7 24

Sign 1 bit · Exponent 7 bits · Fraction 24 bits

However, there might be cases in which the fraction uses 4 more bytes (56 bits in total), because the value is stored using the double precision format⁵:

1 7 56

Sign 1 bit · Exponent 7 bits · Fraction 56 bits

Nothing really changes, since we read the most significant digits from left to right, and that's why we can just pad the bytes on the right with 0s until we get 8 bytes in total⁶:

b = raw.ljust(8, b"\x00")

Next step: we check whether the number is a SAS missing value (the format's take on Not-a-Number). That's as simple as checking whether the first byte is one of the MISSING_FIRST special values while the remaining 7 bytes are all 0s. Then we check whether the number is exactly 0 — nothing special here either, all bits must be 0. To check that, we read the value as a big-endian integer, which will also come in handy for the next step.

Now that we handled the special cases, we can do the actual conversion. From Wikipedia, we get:

(-1)^{\text{sign}} \times 0.\text{significand} \times 16^{\text{exponent} - 64}

which is almost what we have in the last instruction of our ibm_to_double function:

return sign * fraction * 16.0 ** (exponent - 78)

Where does that 78 come from, instead of the 64 in the formula? It's the same thing, just accounted for differently. Wikipedia treats the 56 fraction bits as 0.significand, i.e. a binary fraction between 0 and 1. We, on the other hand, are lazy and read those same 56 bits as a plain integer (that's the big-endian int.from_bytes from before). Reading 56 bits as an integer instead of a fraction means multiplying by 2^56, which is 16^14, so we have to compensate by lowering the exponent by 14: 64 + 14 = 78. No magic, just moving the same factor from one place to another.

Sign. If the first bit is 1, the number is negative, so we shift away everything but that first bit and use it in an if/else. Take the first example in the code, 41 50 00 00 00 00 00 00, which represents 5.0. The first byte 41 is 0100 0001 in binary. Shift everything 63 places to the right with the >> operator, and we're left with 63 0s plus that original first bit — still a 0 — so the number is positive. If we consider the second example instead, C1 50 00 00 00 00 00 00, the first byte C1 in binary is 1100 0001, and since the only thing that changed was the first bit, we get -5.0.

Exponent. We've already seen the shift used to select the first bit of our value. This expression

exponent = (n >> 56) & 0x7F

selects the first byte instead (>> 56) and ignores the first bit (& 0x7F), which is the sign, of course, and we don't need it here. If we select the first byte from the first and the second examples we have 0100 0001 and 1100 0001, which, after the bitwise and with 0x7F (0111 1111), both give 0100 0001, which in base 10 is 65. Then 65 - 78 is -13, which doesn't really make sense on its own, since the number we want is 5.0 or -5.0.

Fraction. That -13 only starts to make sense once we bring in the fraction — everything but the first byte, obtained with & 0x00FFFFFFFFFFFFFF: 00 50 00 00 00 00 00 00, which in base 10 is 22517998136852480.

So, if we put everything together, we get

1.0 \times 22517998136852480 \times 16^{-13}

which is exactly 5.0! The magic of floating point representation!

My brain hurts

Everything else

Good! Now that we've sorted that out, everything else about the XPT format is quite straightforward. There are a handful of version numbers floating around (5, 6, 8, 9), but really just two families: the classic v5/6 — 8-character variable names, which is what we'll parse here — and v8/9, which mostly exists to allow longer variable names (through a few extra header records). Either way it's the same idea: a stack of fixed-layout records that look more or less like this (without the newline characters):

HEADER RECORD*******LIBRARY HEADER RECORD!!!!!!!000000000000000000000000000000
SAS     SAS     SASLIB  9.4     X64_10PR                        25NOV23:15:18:17
25NOV23:15:18:17
HEADER RECORD*******MEMBER  HEADER RECORD!!!!!!!000000000000000001600000000140
HEADER RECORD*******DSCRPTR HEADER RECORD!!!!!!!000000000000000000000000000000
SAS     VALUES  SASDATA 9.4     X64_10PR                        25NOV23:15:18:17
25NOV23:15:18:17
HEADER RECORD*******NAMESTR HEADER RECORD!!!!!!!000000000400000000000000000000

Take that second record — the one starting with SAS — it's the library's real header, and it decodes into a row of fixed-width fields:

SASmarker · 8 SASmarker · 8 SASLIBmarker · 8 9.4version · 8 X64_10PROS · 8 ·····padding · 24 25NOV23:15:18:17created · 16

The real header record: 80 bytes total — most fields are 8 bytes wide, with 24 bytes of padding and a 16-byte timestamp.

Apart from the dates (creation and last update), operating system, and SAS version, what matters here are the 4th and the last records:

...
HEADER RECORD*******MEMBER  HEADER RECORD!!!!!!!000000000000000001600000000140
...
HEADER RECORD*******NAMESTR HEADER RECORD!!!!!!!000000000400000000000000000000

The number 140 at the end of the 4th record tells us the number of bytes for the variable descriptor structure, while the last record tells us how many variables to expect (4 in this case). If that number 4, floating among a bunch of 0s, looks weird to you, you're not alone. Why did they decide to put it there, and not at the very end? Who knows...

Ok, now that we know how many variables we need to read next, we can start parsing them into this structure:

struct NAMESTR {
    short ntype;    /* VARIABLE TYPE: 1=NUMERIC, 2=CHAR */
    short nhfun;    /* HASH OF NNAME (always 0) */
    short nlng;     /* LENGTH OF VARIABLE IN OBSERVATION */
    short nvar0;    /* VARNUM */
    char8 nname;    /* NAME OF VARIABLE*/
    char40 nlabel;  /* LABEL OF VARIABLE */
    char8 nform;    /* NAME OF FORMAT */
    short nfl;      /* FORMAT FIELD LENGTH OR 0 */
    short nfd;      /* FORMAT NUMBER OF DECIMALS */
    short nfj;      /* 0=LEFT JUSTIFICATION, 1=RIGHT JUST */
    char nfill[2];  /* (UNUSED, FOR ALIGNMENT AND FUTURE) */
    char8 niform;   /* NAME OF INPUT FORMAT */
    short nifl;     /* INFORMAT LENGTH ATTRIBUTE */
    short nifd;     /* INFORMAT NUMBER OF DECIMALS */
    long npos;      /* POSITION OF VALUE IN OBSERVATION */
    char rest[52];  /* remaining fields are irrelevant */
};

I know, that's C, a language that's too scary for many programmers nowadays!

This is fine

But that's how it's reported in the technical documents, and if you want to read an XPT file in a different language (or even C itself), you need to know how short a short is and how long a long is.

After we read the variable descriptors, we get another header record:

HEADER RECORD*******OBS     HEADER RECORD!!!!!!!000000000000000000000000000000

and after that, if we got the type and size of each variable right, we can finally read the observations! (row by row)

Final Thoughts

If you made it this far, you now know more about IBM hexadecimal floats than you probably ever wanted to. The good news: the whole point of Bedevere Wise is that you don't have to. Drop your XPT file in and read it right away — no bit-shifting required. And it's about to get better: the next release makes reading XPT files faster and lets you export to XPT as well, so you can round-trip them instead of just opening them.

There are some well-known limitations though. If you pay attention to the variable descriptor structure, you might notice this:

struct NAMESTR {
    /* ... */
    char8 nname;    /* NAME OF VARIABLE*/
    /* ... */
};

which means that variable names can't be longer than 8 characters — at least in the classic v5/6 files (the ones regulators want), since v8/9 lifts the limit, as we saw. Yes, that's quite a constraint, but that's how it was designed! And, if you're familiar with clinical trials, I'm sure you've seen cryptic names like these: R2A1LO, AENTMTFL, or LBNRIND. Now you know why.

Did I mention XPT files enough?

Actually, if .xpt rings a bell and you've never been near a clinical trial, you might know it from Mozilla: XPCOM used .xpt files (XPConnect Typelibs) to store compiled interface metadata. If that's you: I'm sorry, this post really isn't for you. :( ↩
Or maybe you belong to an economics department. ↩
I was quite amazed myself to discover how easily I could grab the technical documents that I needed when I first started implementing my first XPT parser in Go, before the LLM/agentic Age (2 BVC, Before Vibe Coding, to be precise; here is the commit). ↩
Ok, I already know what to expect here: yes, for some applications, there might be more appropriate floating point representations. ↩
Strictly speaking, XPT always uses the 8-byte IBM double layout and just truncates it to fewer bytes. A truncated double happens to be bit-for-bit identical to IBM single precision — both share the 1-bit sign and 7-bit exponent, and the double simply carries more fraction bits — which is exactly why right-padding the missing low bytes with zeros is lossless. ↩
Numeric values aren't always 4 or 8 bytes: a variable can be any length from 2 to 8 bytes, set per variable by the nlng field in the NAMESTR struct you'll meet below. Shorter numerics keep only the high-order bytes, so padding back out to 8 recovers the full double. ↩