Squashfs Binary Format (WIP)
A squashfs filesystem consists of a maximum of nine parts, packed together on a byte alignment:
The superblock
The superblock is the first section of a squashfs archive, and contains important information about the archive, including the locations of other sections of the archive.
Name | Type | Description |
---|---|---|
magic |
u32 | Must match the value of 0x73717368 to be considered a squashfs archive |
inode_count |
u32 | The number of inodes stored in the inode table |
modification_time |
u32 | The number of seconds (not counting leap seconds) since 00:00, Jan 1 1970 UTC when the archive was created (or last appended to). This is unsigned, so it expires in the year 2106 (as opposed to 2038). |
block_size |
u32 | The size of a data block in bytes. Must be a power of two between 4096 and 1048576 (1 MiB) |
fragment_entry_count |
u32 | The number of entries in the fragment table |
compression_id |
u16 |
1 - GZIP
2 - LZMA
3 - LZO
4 - XZ
5 - LZ4
6 - ZSTD
|
block_log |
u16 | The log2 of block_size . If block_size and block_log do not agree, the archive is considered corrupt |
flags |
u16 (Flags) |
See Superblock Flags |
id_count |
u16 | The number of entries in the id lookup table |
version_major |
u16 | The major version of the squashfs file format. Should always equal 4 |
version_minor |
u16 | The minor version of the squashfs file format. Should always equal 0 |
root_inode_ref |
u64 (InodeRef) |
A reference to the inode of the root directory of the archive |
bytes_used |
u64 | The number of bytes used by the archive. Because squashfs archives are often padded to 4KiB, this can often be less than the file size |
id_table_start |
u64 | The byte offset at which the id table starts |
xattr_id_table_start |
u64 | The byte offset at which the xattr id table starts |
inode_table_start |
u64 | The byte offset at which the inode table starts |
directory_table_start |
u64 | The byte offset at which the directory table starts |
fragment_table_start |
u64 | The byte offset at which the fragment table starts |
export_table_start |
u64 | The byte offset at which the export table starts |
Superblock Flags
Name | Value | Description |
---|---|---|
UNCOMPRESSED_INODES |
0x0001 | Inodes are stored uncompressed. For backward compatibility reasons, UID/GIDs are also stored uncompressed. |
UNCOMPRESSED_DATA |
0x0002 | Data are stored uncompressed |
CHECK |
0x0004 | Unused in squashfs 4+. Should always be unset |
UNCOMPRESSED_FRAGMENTS |
0x0008 | Fragments are stored uncompressed |
NO_FRAGMENTS |
0x0010 | Fragments are not used. Files smaller than the block size are stored in a full block. |
ALWAYS_FRAGMENTS |
0x0020 | If the last block of a file is smaller than the block size, it will be instead stored as a fragment |
DUPLICATES |
0x0040 | Identical files are recognized, and stored only once |
EXPORTABLE |
0x0080 | Filesystem has support for export via NFS (The export table is populated) |
UNCOMPRESSED_XATTRS |
0x0100 | Xattrs are stored uncompressed |
NO_XATTRS |
0x0200 | Xattrs are not stored |
COMPRESSOR_OPTIONS |
0x0400 | The compression options section is present |
UNCOMPRESSED_IDS |
0x0800 | UID/GIDs are stored uncompressed. Note that the UNCOMPRESSED_INODES flag also has this effect. If that flag is set, this flag has no effect. This flag is currently only available on master in git, no released version of squashfs yet supports it. |
Metadata Blocks
Metadata blocks are compressed in 8KiB blocks. A metadata block is prefixed by a u16 header. The highest bit of the header is set if the block is stored uncompressed (this will happen if the block grew when compressed, or e.g. the UNCOMPRESSED_INODES
superblock flag is set). The lower 15 bits specifies the size of the metadata block (not including the header) on disk.
To read a metadata block, read a u16. If the highest bit is set (size & 0x8000 == 0x8000
) the following data is uncompressed. Mask out the highest bit to get the size of the block data on disk (this should always be <= 8KiB). Read that many bytes. If the data is compressed, uncompress the data. In pseudocode:
header = read_u16(offset=offset)
data_size = header & 0x7FFF
compressed = !(header & 0x8000)
data = read(offset=offset+2, len=data_size)
if(compressed) {
data = uncompress(data)
}
return data
Neither the size on disk, nor the compressed size should exceed 8KiB. The uncompressed size should always be equal to 8KiB, with the exception of the last metadata block of a section, which may have an uncompressed size less than 8KiB.
Compression Options
If the COMPRESSOR_OPTIONS
flag is set, this section will be present immediately after the superblock, otherwise this section will not be present. If this section is present, it consists of a single metadata block, which is always uncompressed. The data is interpreted differently based on the compressor (compression_id
).
For LZ4, the compressor options always have to be present.
LZMA
LZMA does not support any compression options
GZIP
Name | Type | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
compression_level | i32 | Should be in range 1…9 (inclusive). Defaults to 9. | ||||||||||
window_size | i16 | Should be in range 8…15 (inclusive) Defaults to 15. | ||||||||||
strategies | i16 | A bitfield describing the enabled strategies. If no flags are set, the default strategy is implicitly used. Flags:
|
XZ
Name | Type | Description | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
dictionary_size | i32 | Should be > 8KiB, and must be either the sum of a power of two, or the sum of two sequential powers of two (2n or 2n + 2n+1) | ||||||||||||
executable_filters | i32 | A bitfield describing the additional enabled filters attempted to better compress executable code. Flags:
|
LZ4
Name | Type | Description | ||
---|---|---|---|---|
version | i32 | The only supported value is 1 (LZ4_LEGACY ) |
||
flags | i32 | A bitfield describing the enabled LZ4 flags. There is currently only one possible flag:
|
ZSTD
Name | Type | Description |
---|---|---|
compression_level | i32 | Should be in range 1..22 (inclusive). The real maximum is the zstd defined ZSTD_maxCLevel() |
LZO
Name | Type | Description | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
algorithm | i32 | Which variant of LZO to use (default is lzo1x_999):
|
||||||||||
level | i32 | Compression level. For lzo1x_999, this can be a value between 0 and 9 (defaults to 8). Has to be 0 for all other algorithms. |
Datablocks and Fragments
Datablocks and fragments contain the data which is contained by the files in this archive. A single file's data is stored in a number of data blocks, which are stored sequentially in this section. Because blocks are stored sequentially, the inode for a file only needs to store the position of the first block, and the compressed sizes of each block. All data blocks must be of size block_size
.
If the size of a file is not equally divisible by block_size
, the final chunk can either be stored in a short block that does not uncompress to full size, or it can be stored in a fragment, if fragments are enabled (NO_FRAGMENTS
is not set).
Fragments of multiple files are combined into data blocks of at most size block_size
, and compressed as a single block (unless compression fails to shrink the fragment block).
Datablocks do not have headers. Information about the size and position of datablocks is stored in the inode of the file to which the datablocks belong. Information about the size and position of fragment blocks are stored in the Fragment Table, and the size and offset of fragments within the blocks are stored in the inode of the file to which the fragment belongs.
In both the fragment table, and file inodes, the size of a data block is represented by a u32. If the 1 << 24
bit is set, the data block is stored uncompressed. The size of the block on disk is described by this u32 when this bit is masked out, though the value should always be less than the max block size (1MiB).
Sparse files are handled at block_size
granularity. If an entire block is found to be full of zero bytes, the block isn't written to disk. Instead a size of zero is stored in the inode.
Inode Table
The inode table starts at inode_table_start
and ends at directory_table_start
. In this range are stored enough metadata blocks to contain all inodes. All metablocks in the table (except for the last block) should have an uncompressed size of 8KiB.
Inodes are packed into metadata blocks. Inodes are not aligned to block boundaries, and can therefore span the boundary between metadata blocks. To maximise compression there are different inodes for each item type (regular file, directory, device, etc.), the inode contents and length varying with the type.
To further maximise compression, inodes come in two flavors: simple inode types optimised for frequently occurring items, and extended inode types where extra information has to be stored.
If the UNCOMPRESSED_INODES
flag is set, all metadata blocks should be stored uncompressed. If the flag is not set, metadata blocks will be stored compressed if compression decreases the size of the block as described in the metadata block section.
Inode References
Entries in the Inode table are referenced (for example, in directory entries) with u64 values. The upper 16 bits are unused. The next 32 bits are the position of the first byte of the metadata block, relative to the start of the inode table. The lower 16 bits describe the (uncompressed) offset within the metadata block where the inode begins (remember that an inode may straddle the border between two metadata blocks). For example, an inode reference with the value 0x0000_000001FF_01A0
will be located in the metadata block that starts at byte inode_table_start
+0x000001FF
. After decompressing the block, the inode itself then starts at the byte at index 0x01A0
(the 417th byte) of the uncompressed data.
Inode Header
All Inodes share a common header, which contains some common information, as well as describing the type of Inode which follows. This header has the following structure:
Name | Type | Description |
---|---|---|
inode_type | u16 (InodeType) |
The type of item described by the inode which follows this header. |
permissions | u16 (Permissions) |
A bitmask representing the permissions for the item described by the inode. The values match with the permission values of mode_t (the mode bits, not the file type) |
uid_idx | u16 | The index of the user id in the UID/GID Table |
gid_idx | u16 | The index of the group id in the UID/GID Table |
modified_time | u32 | The unsigned number of seconds (not counting leap seconds) since 00:00, Jan 1 1970 UTC when the item described by the inode was last modified |
inode_number | u32 | The position of this inode in the full list of inodes. Value should be in the range [1, inode_count]
(from the superblock) This can be treated as a unique identifier for this inode, and can be used as a key to recreate hard links: when processing the archive, remember the visited values of inode_number . If an inode number has already been visited, this inode is hardlinked |
Inode Types
There are seven types of inodes, and each type comes in two variants, a basic variant which is smaller and contains only the most-used properties, and an extended variant which has more properties, and will be used when those less used properties are required (e.g. xattrs). Note that some inodes types are variable sized (symlinks targets, sizes of file data blocks, etc).
The value of inode_type
(in the inode header) can have any of the following values:
Basic Directory
Name | Type | Description |
---|---|---|
dir_block_start | u32 | The location of the of the block in the Directory Table where the directory entry information starts |
hard_link_count | u32 |
The number of hard links to this directory. Note that for historical reasons, the hard link count of a directory includes
the number of entries in the directory and is initialized to 2 for an empty
directory. I.e. a directory with N entries has at least N + 2 link count.
|
file_size | u16 | Total (uncompressed) size in bytes of the entries in the Directory Table, including headers plus 3. The extra 3 bytes are for a virtual "." and ".." item in each directory which is not written, but can be considered to be part of the logical size of the directory. |
block_offset | u16 | The (uncompressed) offset within the block in the Directory Table where the directory entry information starts |
parent_inode_number | u32 | The inode_number of the parent of this directory. If this is the root directory, this will be 1 |
Extended Directory
Name | Type | Description |
---|---|---|
hard_link_count | u32 |
The number of hard links to this directory. Note that for historical reasons, the hard link count of a directory includes
the number of entries in the directory and is initialized to 2 for an empty
directory. I.e. a directory with N entries has at least N + 2 link count.
|
file_size | u32 | Total (uncompressed) size in bytes of the entries in the Directory Table, including headers plus 3. The extra 3 bytes are for a virtual "." and ".." item in each directory which is not written, but can be considered to be part of the logical size of the directory. |
dir_block_start | u32 | The location of the of the block in the Directory Table where the directory entry information starts |
parent_inode_number | u32 | The inode_number of the parent of this directory. If this is the root directory, this will be 1 |
index_count | u16 | The number of directory index entries following the inode structure |
block_offset | u16 | The (uncompressed) offset within the block in the Directory Table where the directory entry information starts |
xattr_idx | u32 | An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes |
index | dir_index_t[index_count] | A list of directory index entries for faster lookup in the directory table |
Basic File
Name | Type | Description |
---|---|---|
blocks_start | u32 | The offset from the start of the archive where the data blocks are stored |
fragment_block_index | u32 | The index of a fragment entry in the fragment table which describes the data block the fragment of this file is stored in. If this file does not end with a fragment, this should be 0xFFFFFFFF |
block_offset | u32 | The (uncompressed) offset within the fragment data block where the fragment for this file. Information about the fragment can be found at fragment_block_index . The size of the fragment can be found as file_size % superblock.block_size If this file does not end with a fragment, the value of this field is undefined (probably zero) |
file_size | u32 | The (uncompressed) size of this file |
block_sizes | u32[] | A list of block sizes. If this file ends in a fragment, the size of this list is the number of full data blocks needed to store file_size bytes. If this file does not have a fragment, the size of the list is the number of blocks needed to store file_size bytes, rounded up. Each item in the list describes the (possibly compressed) size of a block. See datablocks & fragments for information about how to interpret this size. |
Extended File
Name | Type | Description |
---|---|---|
blocks_start | u64 | The offset from the start of the archive where the data blocks are stored |
file_size | u64 | The (uncompressed) size of this file |
sparse | u64 | The number of bytes saved by omitting blocks of zero bytes. Used in the kernel for sparse file accounting |
hard_link_count | u32 | The number of hard links to this node |
fragment_block_index | u32 | The index of a fragment entry in the fragment table which describes the data block the fragment of this file is stored in. If this file does not end with a fragment, this should be 0xFFFFFFFF |
block_offset | u32 | The (uncompressed) offset within the fragment data block where the fragment for this file. Information about the fragment can be found at fragment_block_index . If this file does not end with a fragment, the value of this field is undefined (probably zero) |
xattr_idx | u32 | An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes |
block_sizes | u32[] | A list of block sizes. If this file ends in a fragment, the size of this list is the number of full data blocks needed to store file_size bytes. If this file does not have a fragment, the size of the list is the number of blocks needed to store file_size bytes, rounded up. Each item in the list describes the (possibly compressed) size of a block. See datablocks & fragments for information about how to interpret this size. |
Basic Symlink
Name | Type | Description |
---|---|---|
hard_link_count | u32 | The number of hard links to this symlink |
target_size | u32 | The size in bytes of the target_path this symlink points to |
target_path | u8[target_size] | The target path this symlink points to. This path is target_size bytes long. There is no trailing null byte |
Extended Symlink
Name | Type | Description |
---|---|---|
hard_link_count | u32 | The number of hard links to this symlink |
target_size | u32 | The size in bytes of the target_path this symlink points to |
target_path | u8[target_size] | The target path this symlink points to. This path is target_size bytes long. There is no trailing null byte |
xattr_idx | u32 | An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes |
Basic Device (Block/Char)
Name | Type | Description |
---|---|---|
hard_link_count | u32 | The number of hard links to this device |
device | u32 | To extract the major device number, (device & 0xfff00) >> 8 . To extract the minor device number, use (dev & 0xff) | ((dev >> 12) & 0xfff00) |
Extended Device (Block/Char)
Name | Type | Description |
---|---|---|
hard_link_count | u32 | The number of hard links to this device |
device | u32 | To extract the major device number, (device & 0xfff00) >> 8 . To extract the minor device number, use (dev & 0xff) | ((dev >> 12) & 0xfff00) |
xattr_idx | u32 | An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes |
Basic IPC (Fifo/Socket)
Name | Type | Description |
---|---|---|
hard_link_count | u32 | The number of hard links to this ipc item |
Extended IPC (Fifo/Socket)
Name | Type | Description |
---|---|---|
hard_link_count | u32 | The number of hard links to this ipc item |
xattr_idx | u32 | An index into the xattr lookup table. Set to 0xFFFFFFFF if the inode has no extended attributes |
Directory Table
For each directory inode, the directory table stores a list of all entries stored inside, with references back to the inodes that describe those entries.
The directory inodes store the total, uncompressed size of the entire listing, including headers. Using this size, a SquashFS reader can determine if another header with further entries should be following once it reaches the end of a run.
The entry list is self is sorted ASCIIbetically by entry name. To save space, a delta encoding is used to store the inode number, i.e. the list is preceeded by a header with a reference inode number and all entries store the difference to that. Furthermore, the header also includes the location of a metadata block that the inodes of all of the following entries are in. The entries just store an offset into the uncompressed metadata block.
Directory Header
Name | Type | Description |
---|---|---|
count | u32 | One less than the number of entries following the header |
start | u32 | The starting byte offset of the block in the Inode Table where the inodes are stored |
inode number | u32 | An arbitrary inode number. The entries that follow store their inode number as a difference to this. Typically the inode numbers are allocated in a continuous sequence for all children of a directory and the header simply stores the first one. Hard links of course break the sequence and require a new header if they are further away than +/- 32k of this number. Inode number allocation and picking of the reference could of course be optimized to prevent this. |
Every time, the inode block changes or the difference of the inode number cannot be encoded in 16 bits anymore, a new header is emitted.
A header must have at least one entry. A header must not be followed by more than 256 entries. If there are more entries, a new header is emitted.
The file names are stored without trailing null bytes. Since a zero length name makes no sense, the name length is stored off-by-one, i.e. the value 0 cannot be encoded.
Directory Entry
Name | Type | Description |
---|---|---|
offset | u16 | An offset into the uncompressed inode metadata block |
inode offset | i16 | The difference of this inode's number to the reference stored in the header |
type | u16 | The inode type. For extended inodes, the corresponding basic type is stored here instead |
name_size | u16 | One less than the size of the entry name |
name | u8[name_size + 1] | The file name of the entry without a trailing null byte |
The basic and extended inode types both have a size field that stores the uncompressed size of all the directory entries (including all headers) belonging to the inode. This field is used to deduce if more data is following while iterating over directory entries, even without knowing how many headers and partial lists there will be.
Directory Index
To speed up lookups on directories with lots of entries, the extended directory inode can store an index table, holding the locations of all directory headers and the name of the first entry after the header.
To allow for fast lookups, a new directory header should be emitted every time the entry list crosses a metadata block boundary.
Name | Type | Description |
---|---|---|
index | u32 | This stores a byte offset from the first directory header to the current header, as if the uncompressed directory metadata blocks were laid out in memory consecutively. |
start | u32 | Start offset of a directory table metadata block |
name_size | u32 | One less than the size of the entry name |
name | u8[name_size + 1] | The name of the first entry following the header without a trailing null byte |
Fragment Table
Fragments are combined into fragment blocks of at most block_size
bytes long. This table describes the location and size of these fragment blocks, not the fragments within them.
This table is stored in two levels: The fragment block entries are stored in metadata blocks, and the file offsets to these metadata blocks are stored at the offset specified by the fragment_table_start
field of the superblock.
Each metadata block can store 512 fragment block entries (16 bytes per fragment block entry), so there will be
ceil(fragment_entry_count / 512.0)
metadata blocks (and the same number of u64 offsets stored at
fragment_table_start
)
To read the list of fragment block entries, read ceil(fragment_entry_count / 512.0)
u64 offsets starting at
fragment_table_start
, then read the metadata blocks at the offsets read, interpreting the data of the metadata blocks as a packed array of fragment block entries.
Fragment Block Entry
Name | Type | Description |
---|---|---|
start | u64 | The offset within the archive where the fragment block starts |
size | u32 | This stores two pieces of information. If the block is uncompressed, the 0x1000000 (1<<24) bit wil be set. The remaining bits describe the size of the fragment block on disk. Because the max value of block_size is 1 MiB (1<<20), and the size of a fragment block should be less than block_size , the uncompressed bit will never be set by the size. |
_unused | u32 | This field is unused |
Export Table
To support NFS exports, squashfs needs a fast way to resolve an inode number to an inode structure.
For this purpose, a squashfs archive can optionally contain an export table, which is basically a flat array of 64 bit inode references with the inode number being used as an index into the array.
Because the inode number 0 is not used (as it is reserved as a sentinel value in Linux and other UNIX-like OS kernels), the array actually starts at inode number 1 and the index is thus inode_number - 1
.
The array itself is stored in a series of metadata blocks. Since each block can store 1024 references (8 byte per reference), there will be ceil(inode_count / 1024.0)
metadata blocks for the entire array.
To locate the metadata blocks, a secondary list is used, containing the absolute, on-disk locations of the blocks. This list is stored uncompressed and starts at export_table_start
.
UID/GID Table
UID/GIDs are both stored as u32s. Both UIDs and GIDs are treated as IDs, if a file is owned by user 1000, and group 1000, the ID 1000 will be stored only once. UID/GIDs in inodes are stored as u16 indexes into this table (and thus, a maximum of 65535 unique IDs may be stored).
This table is stored in two levels: The IDs are stored in metadata blocks, and the file offsets to these metadata blocks are stored at the offset specified by the id_table_start
field of the superblock.
Each metadata block can store 2048 UID/GIDs (4 bytes per ID), so there will be ceil(id_count / 2048.0)
metadata blocks used for IDs (and the same number of u64 offsets stored at id_table_start
).
To read the list of IDs, read ceil(id_count / 2048.0)
u64 offsets starting at
id_table_start
, then read the metadata blocks at the offsets read.
Xattr Table
Extended attributes are arbitrary key value pairs attached to inodes. The key names use dots as separators to create a hierarchy of namespaces.
Squashfs uses multiple levels of indirection to store Xattr key value pairs associated with inodes. To saves space, the topmost namespace prefix is removed and encoded as an integer ID instead. This approach limits squashfs xattr support to the following, commonly used namespaces:
This means that on the one hand squashfs can store SELinux labels or capabilities since those are stored in the security.* namespaces, but cannot store ACLs which are stored in system.posix_acl_access because it has no way to encode the system. prefix yet.
The key value pairs of all inodes are stored consecutively in metadata blocks. The values can be either be stored inline, i.e. an Xattr Key Entry is directly followed by an Xattr Value Entry, or out of line to deduplicate identical values.
If a value is stored out of line, the value entry structure holds a 64 bit reference instead of a string that specifies the location of the value string, similar to an inode reference, but relative to the the first metadata block containing the key value pairs.
Typically, the first occurrence of a value is stored in line and every consecutive use of the same value uses an out of line value to refer back to the first one.
Xattr Key Entry
Name | Type | Description |
---|---|---|
type | u16 | The ID of the key prefix. If the value that follows is stored out of line, the flag 0x0100 is ORed to the type ID |
name_size | u16 | The size of the key name excluding the omitted prefix and the trailing null byte |
name | u8[name_size - strlen(prefix)] | The remainder of the key without the prefix and without trailing null byte |
Xattr Value Entry
Name | Type | Description |
---|---|---|
value_size | u32 | The size of the value string. If the value is stored out of line, this is always 8, i.e. the size of an unsigned 64 bit integer |
value | u8[value_size] | The raw value assigned to the key without trailing null byte, or if the value is stored out of line, a reference to the location of the value string |
To actually address a block of key value pairs associated with an inode, a lookup table is used that specifies the start and size of a block of key value pairs.
All an inode needs to store is a 32 bit index into this table. If two inodes have the identical xattrs (e.g. they have the same SELinux labels and no other attributes), the key/value block is only written once, there is only one lookup table entry and both inodes have the same index.
Xattr Lookup Table
Name | Type | Description |
---|---|---|
xattr_ref | u64 | A reference to the start of the key value block. Similar to inode references, the bits 16 to 48 hold the absolute position of the first metadata block and the lower 16 bits hold an offset into the uncompressed block. |
count | u32 | The number of key value pairs |
size | u32 | The exact, uncompressed size in bytes of the entire block of key value pairs, counting only what has been written to disk and including the key/value entry structures |
In order to locate both tables on disk, an approach similar to ID and fragment tables is used. The following data structure is stored directly on in the archive (i.e. uncompressed and without additional headers).
The xattr_id_table_start
in the superblock stores the absolute position of this table.
Xattr ID Table
Name | Type | Description |
---|---|---|
xattr_table_start | u64 | The absolute position of the first metadata block holding the key/value pairs. |
xattr_ids | u32 | The number of entries in the Xattr Lookup Table |
_unused | u32 | This field is unused |
table | u64[] | The absolute locations of each metadata block of the Xattr Lookup Table. Each entry is 16 bytes in size, so a block can hold at most 512 entries. Thus this array has ceil(xattr_ids / 512.0) entries. |