Thursday, December 31, 2015

An Aside on Endianness

When I first created this blog, I chose the title as sort of a whimsical phrase that reflected my life as a software developer, even though I had no plans to discuss software in this blog. I did not anticipate, though I certainly should have, that this choice would cause people searching for information on big- and little-endianness to find this blog, where I am sure they have not found what they are looking for. As a bit of penance, this post is for those people who find their way here by searching for information on endianness. Update: I am post-dating this to move it to the top of the first page. Originally posted 1/27/2007.

Computers store numbers as a sequence of binary digits, or "bits". Bits, in a fit of cleverness, are organized in groups called "bytes". Modern computers, to my knowledge, all use eight-bit bytes, though I am fairly certain that some older computers used other sizes. Occasionally, you will also hear of half-byte entities (four bits), naturally termed "nibbles".

An eight-bit byte can represent 256 values (28). This can be either an unsigned number from 0 to 255 or a signed number from -128 to 127, at least using the most common number representation, "two's complement". You can read more about the various representations of signed numbers at Wikipedia. This clearly does not handle large enough numbers for many problems, so groups of bytes are used to represent still larger numbers. The maximum number of bytes that can be grouped together and handled as a unit by a CPU varies by the type of processor, and may in fact vary in different functional units in the same computer. For instance, many modern processors have functional units devoted to floating point operations that process data in larger chunks than the integer-based functional units. (See Computer numbering formats for the differences between integer, fixed-point and floating-point numbers.)

For purposes of this post, I will discuss 32-bit (4 byte) integers, though the same concepts will clearly apply to both smaller and larger representations. I will also examine only unsigned integers, since that makes it easier to explain the differences between big and little endian and other byte-orderings.

When you write a decimal number like 42357, that is shorthand for 4 * 104 + 2 * 103 + 3 * 102 + 5 * 101 + 7 * 100. That is, each numeral is multiplied by a power of ten according to its position in the number. Likewise, a binary number like 10110 is shorthand for 1 * 24 + 1 * 03 + 1 * 22 + 1 * 21 + 0 * 20. We write decimal numbers and binary numbers with the highest non-zero coefficient on the left and the lowest on the right. A 32-bit binary number would have the co-efficient of 231 first, on the left, and the co-efficient of 20 on the right. If we break that 32-bit binary number into bytes, the first byte has the co-efficients of 231 through 224, the next has 223 through 216, then 215 through 28 and finally 27 through 20. Let's give these four bytes (groups of co-efficients) the names A, B, C and D. "A" is also called the MSB (most significant byte) and D is the LSB (least significant byte).

A computer that represents 4-byte integers in ABCD order is said to be "big-endian", because the "big" end of the number comes first: ABCD. But this is not the only possible order. We could represent 4-byte integers as BDAC or CBDA, but that would be confusing. But storing them in reverse order, DCBA, while perhaps a bit unusual compared to our usual notation when writing, is quite feasible and in fact has some advantages (and some disadvantages) when representing numbers in a digital device. This reverse order with the byte containing the coefficients of the least significant bits first is called "little-endian". And while the unusual orders that I suggested earlier (BDAC and CBDA) are not actually used, there are (or have been) computers that used BADC, known as "middle-endian".

The terms "big-endian" and "little-endian" are actually taken from "Gulliver's Travels" by Jonathan Swift, wherein they were used to describe two factions that differed over which end of a soft-boiled egg ought to be cracked.

Intel and other x86 processors are little-endian. Some processors (for instance, the MIPS R2000) have been able to process integers with either byte order. The Motorola 68000 processors and their descendants were big-endian. The PDP-11 was middle-endian.

When computers communicate, the representation of multi-byte integers in the data must be specified. Computers that use a different representation from that used in the communicated data must re-arrange the bytes before processing. For instance, the suite of protocols under the umbrella of TCP/IP that are used to communicate across the internet define a "network byte order", which is big-endian.

You can read more about endianness at Wikipedia.

As one additional contribution to atone for my choice of blog title, here is a C program to print the byte-order of the computer on which it is run.

#include <stdio.h>

void main(void)
  unsigned value = 0;
  char     ch    = 'A';
  int      i;

  for (i = 0; i < sizeof(unsigned); i++)
    value = (value << 8) + ch;

  char order[sizeof(unsigned) + 1];
  char *p = (char *) &value;

  for (i = 0; i < sizeof(unsigned); i++)
    order[i] = *p++;

  order[sizeof(unsigned)] = '\0';

  printf("Byte order: %s\n", order);

This should work regardless of how many bytes are in an integer, though I do not actually have a C compiler handy to test it. Again, for 32-bit integers, "ABCD" is big-endian, "DCBA" is little-endian and "BADC" is middle-endian.

Update: If you are willing to assume that little-endian and big-endian are the only two possibilities, this function can be used to determine the endianness of the processor.

int is_little_endian(void)
  unsigned value = 1;

  return (*(char *)&value == 0x01);