Squashfs Binary Format (WIP)

The superblock

The superblock is the first section of a squashfs archive, and contains important information about the archive, including the locations of other sections of the archive.

Name Type Description
magic u32 Must match the value of 0x73717368 to be considered a squashfs archive
inode_count u32 The number of inodes stored in the inode table
modification_time u32 The number of seconds (not counting leap seconds) since 00:00, Jan 1 1970 UTC when the archive was created (or last appended to). This is unsigned, so it expires in the year 2106 (as opposed to 2038).
block_size u32 The size of a data block in bytes. Must be a power of two between 4096 and 1048576 (1 MiB)
fragment_entry_count u32 The number of entries in the fragment table
compression_id u16
1 - GZIP
2 - LZMA
3 - LZO
4 - XZ
5 - LZ4
6 - ZSTD
block_log u16 The log2 of block_size. If block_size and block_log do not agree, the archive is considered corrupt
flags u16
(Flags)
See Superblock Flags
id_count u16 The number of entries in the id lookup table
version_major u16 The major version of the squashfs file format. Should always equal 4
version_minor u16 The minor version of the squashfs file format. Should always equal 0
root_inode_ref u64
(InodeRef)
A reference to the inode of the root directory of the archive
bytes_used u64 The number of bytes used by the archive. Because squashfs archives are often padded to 4KiB, this can often be less than the file size
id_table_start u64 The byte offset at which the id table starts
xattr_id_table_start u64 The byte offset at which the xattr id table starts
inode_table_start u64 The byte offset at which the inode table starts
directory_table_start u64 The byte offset at which the directory table starts
fragment_table_start u64 The byte offset at which the fragment table starts
export_table_start u64 The byte offset at which the export table starts

Superblock Flags

Name Value Description
UNCOMPRESSED_INODES 0x0001 Inodes are stored uncompressed. For backward compatibility reasons, UID/GIDs are also stored uncompressed.
UNCOMPRESSED_DATA 0x0002 Data are stored uncompressed
CHECK 0x0004 Unused in squashfs 4+. Should always be unset
UNCOMPRESSED_FRAGMENTS 0x0008 Fragments are stored uncompressed
NO_FRAGMENTS 0x0010 Fragments are not used. Files smaller than the block size are stored in a full block.
ALWAYS_FRAGMENTS 0x0020 If the last block of a file is smaller than the block size, it will be instead stored as a fragment
DUPLICATES 0x0040 Identical files are recognized, and stored only once
EXPORTABLE 0x0080 Filesystem has support for export via NFS (The export table is populated)
UNCOMPRESSED_XATTRS 0x0100 Xattrs are stored uncompressed
NO_XATTRS 0x0200 Xattrs are not stored
COMPRESSOR_OPTIONS 0x0400 The compression options section is present
UNCOMPRESSED_IDS 0x0800 UID/GIDs are stored uncompressed. Note that the UNCOMPRESSED_INODES flag also has this effect. If that flag is set, this flag has no effect. This flag is currently only available on master in git, no released version of squashfs yet supports it.

Metadata Blocks

Metadata blocks are compressed in 8KiB blocks. A metadata block is prefixed by a u16 header. The highest bit of the header is set if the block is stored uncompressed (this will happen if the block grew when compressed, or e.g. the UNCOMPRESSED_INODES superblock flag is set). The lower 15 bits specifies the size of the metadata block (not including the header) on disk.

To read a metadata block, read a u16. If the highest bit is set (size & 0x8000 == 0x8000) the following data is uncompressed. Mask out the highest bit to get the size of the block data on disk (this should always be <= 8KiB). Read that many bytes. If the data is compressed, uncompress the data. In pseudocode:

header = read_u16(offset=offset)
data_size = header & 0x7FFF
compressed = !(header & 0x8000)
data = read(offset=offset+2, len=data_size)
if(compressed) {
    data = uncompress(data)
}
return data

Neither the size on disk, nor the compressed size should exceed 8KiB. The uncompressed size should always be equal to 8KiB, with the exception of the last metadata block of a section, which may have an uncompressed size less than 8KiB.

Compression Options

If the COMPRESSOR_OPTIONS flag is set, this section will be present immediately after the superblock, otherwise this section will not be present. If this section is present, it consists of a single metadata block, which is always uncompressed. The data is interpreted differently based on the compressor (compression_id).

For LZ4, the compressor options always have to be present.

LZMA

LZMA does not support any compression options

GZIP

Name Type Description
compression_level i32 Should be in range 1…9 (inclusive). Defaults to 9.
window_size i16 Should be in range 8…15 (inclusive) Defaults to 15.
strategies i16 A bitfield describing the enabled strategies. If no flags are set, the default strategy is implicitly used. Flags:
Default 0x01
Filtered 0x02
Huffman Only 0x04
Run Length Encoded 0x08
Fixed 0x10

XZ

Name Type Description
dictionary_size i32 Should be > 8KiB, and must be either the sum of a power of two, or the sum of two sequential powers of two (2n or 2n + 2n+1)
executable_filters i32 A bitfield describing the additional enabled filters attempted to better compress executable code. Flags:
x86 0x01
powerpc 0x02
ia64 0x04
arm 0x08
armthumb 0x10
sparc 0x20

LZ4

Name Type Description
version i32 The only supported value is 1 (LZ4_LEGACY)
flags i32 A bitfield describing the enabled LZ4 flags. There is currently only one possible flag:
Use LZ4 High Compression(HC) mode 0x01

ZSTD

Name Type Description
compression_level i32 Should be in range 1..22 (inclusive). The real maximum is the zstd defined ZSTD_maxCLevel()

LZO

Name Type Description
algorithm i32 Which variant of LZO to use (default is lzo1x_999):
lzo1x_1 0
lzo1x_1_11 1
lzo1x_1_12 2
lzo1x_1_15 3
lzo1x_999 4
level i32 Compression level. For lzo1x_999, this can be a value between 0 and 9 (defaults to 8). Has to be 0 for all other algorithms.

Datablocks and Fragments

Datablocks and fragments contain the data which is contained by the files in this archive. A single file's data is stored in a number of data blocks, which are stored sequentially in this section. Because blocks are stored sequentially, the inode for a file only needs to store the position of the first block, and the compressed sizes of each block. All data blocks must be of size block_size.

If the size of a file is not equally divisible by block_size, the final chunk can either be stored in a short block that does not uncompress to full size, or it can be stored in a fragment, if fragments are enabled (NO_FRAGMENTS is not set).

Fragments of multiple files are combined into data blocks of at most size block_size, and compressed as a single block (unless compression fails to shrink the fragment block).

Datablocks do not have headers. Information about the size and position of datablocks is stored in the inode of the file to which the datablocks belong. Information about the size and position of fragment blocks are stored in the Fragment Table, and the size and offset of fragments within the blocks are stored in the inode of the file to which the fragment belongs.

In both the fragment table, and file inodes, the size of a data block is represented by a u32. If the 1 << 24 bit is set, the data block is stored uncompressed. The size of the block on disk is described by this u32 when this bit is masked out, though the value should always be less than the max block size (1MiB).

Sparse files are handled at block_size granularity. If an entire block is found to be full of zero bytes, the block isn't written to disk. Instead a size of zero is stored in the inode.

Inode Table

The inode table starts at inode_table_start and ends at directory_table_start. In this range are stored enough metadata blocks to contain all inodes. All metablocks in the table (except for the last block) should have an uncompressed size of 8KiB.

Inodes are packed into metadata blocks. Inodes are not aligned to block boundaries, and can therefore span the boundary between metadata blocks. To maximise compression there are different inodes for each item type (regular file, directory, device, etc.), the inode contents and length varying with the type.

To further maximise compression, inodes come in two flavors: simple inode types optimised for frequently occurring items, and extended inode types where extra information has to be stored.

If the UNCOMPRESSED_INODES flag is set, all metadata blocks should be stored uncompressed. If the flag is not set, metadata blocks will be stored compressed if compression decreases the size of the block as described in the metadata block section.

Inode References

Entries in the Inode table are referenced (for example, in directory entries) with u64 values. The upper 16 bits are unused. The next 32 bits are the position of the first byte of the metadata block, relative to the start of the inode table. The lower 16 bits describe the (uncompressed) offset within the metadata block where the inode begins (remember that an inode may straddle the border between two metadata blocks). For example, an inode reference with the value 0x0000_000001FF_01A0 will be located in the metadata block that starts at byte inode_table_start+0x000001FF. After decompressing the block, the inode itself then starts at the byte at index 0x01A0 (the 417th byte) of the uncompressed data.

Inode Header

All Inodes share a common header, which contains some common information, as well as describing the type of Inode which follows. This header has the following structure:

Name Type Description
inode_type u16
(InodeType)
The type of item described by the inode which follows this header.
permissions u16
(Permissions)
A bitmask representing the permissions for the item described by the inode. The values match with the permission values of mode_t (the mode bits, not the file type)
uid_idx u16 The index of the user id in the UID/GID Table
gid_idx u16 The index of the group id in the UID/GID Table
modified_time u32 The unsigned number of seconds (not counting leap seconds) since 00:00, Jan 1 1970 UTC when the item described by the inode was last modified
inode_number u32 The position of this inode in the full list of inodes. Value should be in the range [1, inode_count] (from the superblock) This can be treated as a unique identifier for this inode, and can be used as a key to recreate hard links: when processing the archive, remember the visited values of inode_number. If an inode number has already been visited, this inode is hardlinked

Inode Types

There are seven types of inodes, and each type comes in two variants, a basic variant which is smaller and contains only the most-used properties, and an extended variant which has more properties, and will be used when those less used properties are required (e.g. xattrs). Note that some inodes types are variable sized (symlinks targets, sizes of file data blocks, etc).

The value of inode_type (in the inode header) can have any of the following values:


Basic Directory
Name Type Description
dir_block_start u32 The location of the of the block in the Directory Table where the directory entry information starts
hard_link_count u32 The number of hard links to this directory. Note that for historical reasons, the hard link count of a directory includes the number of entries in the directory and is initialized to 2 for an empty directory. I.e. a directory with N entries has at least N + 2 link count.
file_size u16 Total (uncompressed) size in bytes of the entries in the Directory Table, including headers plus 3. The extra 3 bytes are for a virtual "." and ".." item in each directory which is not written, but can be considered to be part of the logical size of the directory.
block_offset u16 The (uncompressed) offset within the block in the Directory Table where the directory entry information starts
parent_inode_number u32 The inode_number of the parent of this directory. If this is the root directory, this will be 1
Extended Directory
Name Type Description
hard_link_count u32 The number of hard links to this directory. Note that for historical reasons, the hard link count of a directory includes the number of entries in the directory and is initialized to 2 for an empty directory. I.e. a directory with N entries has at least N + 2 link count.
file_size u32 Total (uncompressed) size in bytes of the entries in the Directory Table, including headers plus 3. The extra 3 bytes are for a virtual "." and ".." item in each directory which is not written, but can be considered to be part of the logical size of the directory.
dir_block_start u32 The location of the of the block in the Directory Table where the directory entry information starts
parent_inode_number u32 The inode_number of the parent of this directory. If this is the root directory, this will be 1
index_count u16 The number of directory index entries following the inode structure
block_offset u16 The (uncompressed) offset within the block in the Directory Table where the directory entry information starts
xattr_idx u32 An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes
index dir_index_t[index_count] A list of directory index entries for faster lookup in the directory table
Basic File
Name Type Description
blocks_start u32 The offset from the start of the archive where the data blocks are stored
fragment_block_index u32 The index of a fragment entry in the fragment table which describes the data block the fragment of this file is stored in. If this file does not end with a fragment, this should be 0xFFFFFFFF
block_offset u32 The (uncompressed) offset within the fragment data block where the fragment for this file. Information about the fragment can be found at fragment_block_index. The size of the fragment can be found as file_size % superblock.block_size If this file does not end with a fragment, the value of this field is undefined (probably zero)
file_size u32 The (uncompressed) size of this file
block_sizes u32[] A list of block sizes. If this file ends in a fragment, the size of this list is the number of full data blocks needed to store file_size bytes. If this file does not have a fragment, the size of the list is the number of blocks needed to store file_size bytes, rounded up. Each item in the list describes the (possibly compressed) size of a block. See datablocks & fragments for information about how to interpret this size.
Extended File
Name Type Description
blocks_start u64 The offset from the start of the archive where the data blocks are stored
file_size u64 The (uncompressed) size of this file
sparse u64 The number of bytes saved by omitting blocks of zero bytes. Used in the kernel for sparse file accounting
hard_link_count u32 The number of hard links to this node
fragment_block_index u32 The index of a fragment entry in the fragment table which describes the data block the fragment of this file is stored in. If this file does not end with a fragment, this should be 0xFFFFFFFF
block_offset u32 The (uncompressed) offset within the fragment data block where the fragment for this file. Information about the fragment can be found at fragment_block_index. If this file does not end with a fragment, the value of this field is undefined (probably zero)
xattr_idx u32 An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes
block_sizes u32[] A list of block sizes. If this file ends in a fragment, the size of this list is the number of full data blocks needed to store file_size bytes. If this file does not have a fragment, the size of the list is the number of blocks needed to store file_size bytes, rounded up. Each item in the list describes the (possibly compressed) size of a block. See datablocks & fragments for information about how to interpret this size.
Name Type Description
hard_link_count u32 The number of hard links to this symlink
target_size u32 The size in bytes of the target_path this symlink points to
target_path u8[target_size] The target path this symlink points to. This path is target_size bytes long. There is no trailing null byte
Name Type Description
hard_link_count u32 The number of hard links to this symlink
target_size u32 The size in bytes of the target_path this symlink points to
target_path u8[target_size] The target path this symlink points to. This path is target_size bytes long. There is no trailing null byte
xattr_idx u32 An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes
Basic Device (Block/Char)
Name Type Description
hard_link_count u32 The number of hard links to this device
device u32 To extract the major device number, (device & 0xfff00) >> 8. To extract the minor device number, use (dev & 0xff) | ((dev >> 12) & 0xfff00)
Extended Device (Block/Char)
Name Type Description
hard_link_count u32 The number of hard links to this device
device u32 To extract the major device number, (device & 0xfff00) >> 8. To extract the minor device number, use (dev & 0xff) | ((dev >> 12) & 0xfff00)
xattr_idx u32 An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes
Basic IPC (Fifo/Socket)
Name Type Description
hard_link_count u32 The number of hard links to this ipc item
Extended IPC (Fifo/Socket)
Name Type Description
hard_link_count u32 The number of hard links to this ipc item
xattr_idx u32 An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes

Directory Table

For each directory inode, the directory table stores a list of all entries stored inside, with references back to the inodes that describe those entries.

The directory inodes store the total, uncompressed size of the entire listing, including headers. Using this size, a SquashFS reader can determine if another header with further entries should be following once it reaches the end of a run.

The entry list is self is sorted ASCIIbetically by entry name. To save space, a delta encoding is used to store the inode number, i.e. the list is preceeded by a header with a reference inode number and all entries store the difference to that. Furthermore, the header also includes the location of a metadata block that the inodes of all of the following entries are in. The entries just store an offset into the uncompressed metadata block.

Directory Header

Name Type Description
count u32 One less than the number of entries following the header
start u32 The starting byte offset of the block in the Inode Table where the inodes are stored
inode number u32 An arbitrary inode number. The entries that follow store their inode number as a difference to this. Typically the inode numbers are allocated in a continuous sequence for all children of a directory and the header simply stores the first one. Hard links of course break the sequence and require a new header if they are further away than +/- 32k of this number. Inode number allocation and picking of the reference could of course be optimized to prevent this.

Every time, the inode block changes or the difference of the inode number cannot be encoded in 16 bits anymore, a new header is emitted.

A header must have at least one entry. A header must not be followed by more than 256 entries. If there are more entries, a new header is emitted.

The file names are stored without trailing null bytes. Since a zero length name makes no sense, the name length is stored off-by-one, i.e. the value 0 cannot be encoded.

Directory Entry

Name Type Description
offset u16 An offset into the uncompressed inode metadata block
inode offset i16 The difference of this inode's number to the reference stored in the header
type u16 The inode type. For extended inodes, the corresponding basic type is stored here instead
name_size u16 One less than the size of the entry name
name u8[name_size + 1] The file name of the entry without a trailing null byte

The basic and extended inode types both have a size field that stores the uncompressed size of all the directory entries (including all headers) belonging to the inode. This field is used to deduce if more data is following while iterating over directory entries, even without knowing how many headers and partial lists there will be.

Directory Index

To speed up lookups on directories with lots of entries, the extended directory inode can store an index table, holding the locations of all directory headers and the name of the first entry after the header.

To allow for fast lookups, a new directory header should be emitted every time the entry list crosses a metadata block boundary.

Name Type Description
index u32 This stores a byte offset from the first directory header to the current header, as if the uncompressed directory metadata blocks were laid out in memory consecutively.
start u32 Start offset of a directory table metadata block
name_size u32 One less than the size of the entry name
name u8[name_size + 1] The name of the first entry following the header without a trailing null byte

Fragment Table

Fragments are combined into fragment blocks of at most block_size bytes long. This table describes the location and size of these fragment blocks, not the fragments within them.

This table is stored in two levels: The fragment block entries are stored in metadata blocks, and the file offsets to these metadata blocks are stored at the offset specified by the fragment_table_start field of the superblock.

Each metadata block can store 512 fragment block entries (16 bytes per fragment block entry), so there will be ceil(fragment_entry_count / 512.0) metadata blocks (and the same number of u64 offsets stored at fragment_table_start)

To read the list of fragment block entries, read ceil(fragment_entry_count / 512.0) u64 offsets starting at fragment_table_start, then read the metadata blocks at the offsets read, interpreting the data of the metadata blocks as a packed array of fragment block entries.

Fragment Block Entry

Name Type Description
start u64 The offset within the archive where the fragment block starts
size u32 This stores two pieces of information. If the block is uncompressed, the 0x1000000 (1<<24) bit wil be set. The remaining bits describe the size of the fragment block on disk. Because the max value of block_size is 1 MiB (1<<20), and the size of a fragment block should be less than block_size, the uncompressed bit will never be set by the size.
_unused u32 This field is unused

Export Table

To support NFS exports, squashfs needs a fast way to resolve an inode number to an inode structure.

For this purpose, a squashfs archive can optionally contain an export table, which is basically a flat array of 64 bit inode references with the inode number being used as an index into the array.

Because the inode number 0 is not used (as it is reserved as a sentinel value in Linux and other UNIX-like OS kernels), the array actually starts at inode number 1 and the index is thus inode_number - 1.

The array itself is stored in a series of metadata blocks. Since each block can store 1024 references (8 byte per reference), there will be ceil(inode_count / 1024.0) metadata blocks for the entire array.

To locate the metadata blocks, a secondary list is used, containing the absolute, on-disk locations of the blocks. This list is stored uncompressed and starts at export_table_start.

UID/GID Table

UID/GIDs are both stored as u32s. Both UIDs and GIDs are treated as IDs, if a file is owned by user 1000, and group 1000, the ID 1000 will be stored only once. UID/GIDs in inodes are stored as u16 indexes into this table (and thus, a maximum of 65535 unique IDs may be stored).

This table is stored in two levels: The IDs are stored in metadata blocks, and the file offsets to these metadata blocks are stored at the offset specified by the id_table_start field of the superblock.

Each metadata block can store 2048 UID/GIDs (4 bytes per ID), so there will be ceil(id_count / 2048.0) metadata blocks used for IDs (and the same number of u64 offsets stored at id_table_start).

To read the list of IDs, read ceil(id_count / 2048.0) u64 offsets starting at id_table_start, then read the metadata blocks at the offsets read.

Xattr Table

Extended attributes are arbitrary key value pairs attached to inodes. The key names use dots as separators to create a hierarchy of namespaces.

Squashfs uses multiple levels of indirection to store Xattr key value pairs associated with inodes. To saves space, the topmost namespace prefix is removed and encoded as an integer ID instead. This approach limits squashfs xattr support to the following, commonly used namespaces:

0 - user.
1 - trusted.
2 - security.

This means that on the one hand squashfs can store SELinux labels or capabilities since those are stored in the security.* namespaces, but cannot store ACLs which are stored in system.posix_acl_access because it has no way to encode the system. prefix yet.

The key value pairs of all inodes are stored consecutively in metadata blocks. The values can be either be stored inline, i.e. an Xattr Key Entry is directly followed by an Xattr Value Entry, or out of line to deduplicate identical values.

If a value is stored out of line, the value entry structure holds a 64 bit reference instead of a string that specifies the location of the value string, similar to an inode reference, but relative to the the first metadata block containing the key value pairs.

Typically, the first occurrence of a value is stored in line and every consecutive use of the same value uses an out of line value to refer back to the first one.

Xattr Key Entry

Name Type Description
type u16 The ID of the key prefix. If the value that follows is stored out of line, the flag 0x0100 is ORed to the type ID
name_size u16 The size of the key name excluding the omitted prefix and the trailing null byte
name u8[name_size - strlen(prefix)] The remainder of the key without the prefix and without trailing null byte

Xattr Value Entry

Name Type Description
value_size u32 The size of the value string. If the value is stored out of line, this is always 8, i.e. the size of an unsigned 64 bit integer
value u8[value_size] The raw value assigned to the key without trailing null byte, or if the value is stored out of line, a reference to the location of the value string

To actually address a block of key value pairs associated with an inode, a lookup table is used that specifies the start and size of a block of key value pairs.

All an inode needs to store is a 32 bit index into this table. If two inodes have the identical xattrs (e.g. they have the same SELinux labels and no other attributes), the key/value block is only written once, there is only one lookup table entry and both inodes have the same index.

Xattr Lookup Table

Name Type Description
xattr_ref u64 A reference to the start of the key value block. Similar to inode references, the bits 16 to 48 hold the absolute position of the first metadata block and the lower 16 bits hold an offset into the uncompressed block.
count u32 The number of key value pairs
size u32 The exact, uncompressed size in bytes of the entire block of key value pairs, counting only what has been written to disk and including the key/value entry structures

In order to locate both tables on disk, an approach similar to ID and fragment tables is used. The following data structure is stored directly on in the archive (i.e. uncompressed and without additional headers).

The xattr_id_table_start in the superblock stores the absolute position of this table.

Xattr ID Table

Name Type Description
xattr_table_start u64 The absolute position of the first metadata block holding the key/value pairs.
xattr_ids u32 The number of entries in the Xattr Lookup Table
_unused u32 This field is unused
table u64[] The absolute locations of each metadata block of the Xattr Lookup Table. Each entry is 16 bytes in size, so a block can hold at most 512 entries. Thus this array has ceil(xattr_ids / 512.0) entries.