What the Linux File System Actually Is
When most engineers say "file system" they mean the format on disk — ext4, XFS, Btrfs. But Linux uses the term in a broader sense, and if you conflate the two meanings you'll confuse yourself badly the first time you encounter something like
procor
tmpfs. A Linux file system is any hierarchical namespace that exposes data through the standard file operations:
open,
read,
write,
stat,
readdir. The storage backend is irrelevant. Disk, RAM, kernel data structures, network — it doesn't matter. If you can mount it and navigate it with a path, it's a file system as far as Linux is concerned.
The piece that makes this possible is the Virtual File System, or VFS. VFS is a kernel abstraction layer that sits between system calls and the concrete file system drivers. When your process calls
open("/var/log/syslog", O_RDONLY), the kernel doesn't know or care whether
/var/logsits on an ext4 partition, an NFS share, or an overlay mount. VFS translates the call into driver-specific operations and returns a file descriptor. This is why "everything is a file" isn't just a philosophy — it's a kernel engineering decision with real consequences for how you build and manage systems.
Under VFS, four key data structures do the heavy lifting. The superblock represents a mounted file system instance and stores global metadata: block size, inode count, flags, and a pointer to the file system operations table. The inode stores per-file metadata — permissions, ownership, timestamps, and block pointers — but notably not the file name. The dentry (directory entry) maps a name to an inode and is cached aggressively in the dentry cache for performance. Finally, the file object represents an open file descriptor in a process and tracks the current position within the file. Understanding that names live in dentries and not inodes is the key to understanding hard links, which I'll come back to later.
How Mount Points Work
A mount point is simply a directory in the existing tree where a new file system is grafted. When you run
mount, the kernel calls the
mount(2)syscall, which allocates a new superblock for the target file system, creates a mount structure, and attaches it to the mount tree at the specified path. From that moment on, any path resolution that reaches that directory gets handed off to the new file system's driver rather than continuing through the parent. The directory itself — the mount point — isn't deleted or modified. It's hidden behind the newly attached tree. Unmount it and the original directory reappears, contents intact.
Linux maintains a per-namespace mount tree. In the early days this was a single global tree, but since Linux 3.8 every process can have its own mount namespace, which is fundamental to how containers work. When you run
unshare --mountor create a new mount namespace via
clone(2)with
CLONE_NEWNS, the child gets a copy of the parent's mount tree. Changes made inside the namespace — new mounts, unmounts — don't affect the parent unless you're using shared propagation, which I'll get to in a moment.
Mount propagation is one of those topics that reads like simple documentation until you break production with it. There are four propagation types. Shared mounts propagate events bidirectionally between peer groups — mount something inside a shared mount, and it appears in all peers. Slave mounts receive events from a master but don't send them back. Private mounts have no propagation at all. Unbindable mounts are private and additionally can't be bind-mounted. In my experience, the default on most modern distributions is shared propagation for the root file system, which means that if you mount something inside a container's namespace without thinking about this, it can leak back to the host. Always check with
findmnt -o TARGET,PROPAGATIONbefore assuming isolation.
# Show mount tree with propagation flags
findmnt --tree -o TARGET,SOURCE,FSTYPE,OPTIONS,PROPAGATION
# Sample output excerpt
TARGET SOURCE FSTYPE OPTIONS PROPAGATION
/ /dev/sda1 ext4 rw,relatime shared
├─/sys sysfs sysfs rw,nosuid,nodev,noexec shared
├─/proc proc proc rw,nosuid,nodev,noexec shared
├─/dev devtmpfs devtmpfs rw,nosuid shared
│ ├─/dev/pts devpts devpts rw,nosuid,noexec shared
│ └─/dev/shm tmpfs tmpfs rw,nosuid,nodev shared
└─/data /dev/sdb1 xfs rw,relatime shared
The Filesystem Hierarchy Standard and Why It's Laid Out That Way
The FHS isn't arbitrary. Every major directory split exists because of a real operational concern — either separability for independent mounting, performance characteristics, or administrative convenience.
/usrwas historically a separate partition because it held read-only user programs that could be shared over NFS across workstations.
/varholds variable data — logs, spool files, package databases — and belongs on its own partition so that a runaway log file can't fill the root and take down the system.
/tmpis volatile by design.
/bootoften needs to live on a partition that the BIOS or UEFI firmware can access before the kernel is running, which can constrain the file system type and encryption options.
In production I've seen the root file system fill to 100% more times than I can count, and it's almost always
/var/logthat's responsible. A properly partitioned server carves
/varoff the root so that log growth never threatens system stability. Similarly, on database servers I'll put
/var/lib/postgresqlor
/var/lib/mysqlon a dedicated XFS partition with mount options tuned for write throughput. Keeping the database data path on its own mount point means you can snapshot it cleanly, remount it read-only for a consistent backup, or replace the underlying block device without touching anything else.
/etc/fstab — The Configuration File You Must Not Ignore
Every persistent mount is defined in
/etc/fstab. Each line has six fields: device, mount point, file system type, mount options, dump frequency, and fsck pass order. The device field accepts block device paths, but you should almost never use bare
/dev/sdXpaths in production. Device names are not stable across reboots — a disk that enumerates as
/dev/sdbtoday might come up as
/dev/sdctomorrow if another disk is added or the enumeration order changes. Use UUIDs or labels instead.
# Get UUID and label info for all block devices
blkid
# /dev/sda1: UUID="a1b2c3d4-e5f6-7890-abcd-ef1234567890" TYPE="ext4" LABEL="root"
# /dev/sdb1: UUID="f9e8d7c6-b5a4-3210-fedc-ba9876543210" TYPE="xfs" LABEL="data"
# Correct fstab entries using UUID
UUID=a1b2c3d4-e5f6-7890-abcd-ef1234567890 / ext4 defaults,relatime 0 1
UUID=f9e8d7c6-b5a4-3210-fedc-ba9876543210 /data xfs defaults,relatime,noatime 0 2
tmpfs /tmp tmpfs mode=1777,nosuid,nodev 0 0
The last two fields trip up junior engineers constantly. The dump field (fifth column) should be
0for virtually everything — it controls the legacy
dumputility which nobody actually uses. The fsck pass field (sixth column) controls whether and when
fsckchecks the file system at boot.
0means skip,
1means check first (root only),
2means check after root. If you have XFS or Btrfs file systems, set this to
0— those file systems use their own recovery mechanisms and running
fsckon them is either a no-op or actively harmful.
Mount Options That Actually Matter in Production
The options field is where you harden your mounts and tune performance. Don't leave everything as
defaultsand call it done. Let me walk through the options I actually configure on production systems at sw-infrarunbook-01.
noexec prevents execution of binaries directly from the mount point. I put this on
/tmp,
/var/tmp, and any partition that doesn't need to run executables. It won't stop a determined attacker, but it meaningfully raises the cost of exploiting a write vulnerability to execute a payload. nosuid ignores the setuid and setgid bits on files in that mount, which prevents privilege escalation through setuid binaries dropped onto world-writable mounts. nodev prevents device file interpretation, which matters on any mount that untrusted users can write to.
relatime versus noatime is a performance decision. By default, every file read updates the access time (
atime) on the inode, which turns every read into a write and can thrash your storage on read-heavy workloads.
noatimedisables atime updates entirely.
relatime, the common middle ground, only updates atime when it's older than mtime or ctime — which satisfies most applications that check atime while eliminating the worst-case write amplification. On busy log servers and database hosts, switching from default atime to relatime can make a measurable difference in I/O utilization.
# Hardened /tmp in fstab
tmpfs /tmp tmpfs rw,nosuid,nodev,noexec,relatime,size=2G,mode=1777 0 0
tmpfs /var/tmp tmpfs rw,nosuid,nodev,noexec,relatime,size=1G,mode=1777 0 0
# Verify options are applied after mount
grep '/tmp' /proc/mounts
Bind Mounts and Their Practical Uses
A bind mount takes an existing directory (or file) in the tree and makes it appear at a second location simultaneously. Both paths point to the same underlying data — there's no copy. The kernel simply attaches the source's dentry tree at the target location. Bind mounts are powerful because they let you reshape the namespace without moving data on disk.
The most common production use case I reach for is exposing a specific subdirectory into a chroot or container. If a service runs in a chroot at
/srv/chroot/dnsand it needs access to
/etc/resolv.conf, you don't copy the file — you bind-mount it in. When the host updates
/etc/resolv.conf, the chroot sees the change immediately because there's only one inode.
# Bind-mount a single file into a chroot
mount --bind /etc/resolv.conf /srv/chroot/dns/etc/resolv.conf
# Make it read-only inside the chroot
mount --bind /etc/resolv.conf /srv/chroot/dns/etc/resolv.conf
mount -o remount,ro,bind /srv/chroot/dns/etc/resolv.conf
# Bind-mount in /etc/fstab
/etc/resolv.conf /srv/chroot/dns/etc/resolv.conf none bind,ro 0 0
Another case where bind mounts shine is testing. On sw-infrarunbook-01, when I need to test a new configuration directory structure without modifying the live path, I'll bind-mount the test directory over the live one for the duration of the test and unmount when done. No file moves, no risk of leaving behind a symlink.
Special File Systems: proc, sysfs, devtmpfs, and tmpfs
These aren't storage file systems — they're kernel interfaces that happen to speak the VFS protocol.
procexposes process information and kernel state.
/proc/mountsis what the kernel actually has mounted, which is the authoritative source — more reliable than parsing
/etc/fstab, which is just a wishlist.
/proc/meminfo,
/proc/cpuinfo,
/proc/net/— all dynamically generated from kernel data structures on every read. No disk I/O happens.
sysfs, mounted at
/sys, exports kernel object hierarchies — devices, drivers, buses, power management. It's the kernel's preferred modern interface for device configuration. When you write to
/sys/block/sda/queue/schedulerto change the I/O scheduler for a disk, you're triggering a kernel function through what looks like a file write.
devtmpfsmanages the device nodes under
/devdynamically, creating and removing nodes as devices appear and disappear.
tmpfsis a real file system backed by virtual memory — RAM and swap. It performs extremely well because there's no disk, but its contents disappear on unmount. I use it for
/tmp,
/run, and on systems with enough RAM, for scratch space in data pipelines where intermediate results don't need to survive a reboot. Always set a size limit on tmpfs mounts. The default limit is half of physical RAM, and I've seen servers run out of memory because an application hammered
/tmpwith temporary files and nothing enforced a ceiling.
How Inodes Connect to All of This
Each file system has a fixed inode table allocated at format time (for ext4; XFS allocates dynamically). Running out of inodes is a different failure mode from running out of disk space, and it can be just as fatal. You'll see "No space left on device" even with gigabytes free on disk. This happens most commonly when an application creates enormous numbers of small files — mail spools, PHP session directories, package manager caches.
# Check inode usage per mount point
df -i
# Filesystem Inodes IUsed IFree IUse% Mounted on
# /dev/sda1 3932160 312000 3620160 8% /
# /dev/sdb1 6553600 6553590 10 100% /data
# Find directories with massive inode consumption
find /data -xdev -printf '%h\n' | sort | uniq -c | sort -rn | head -20
Hard links are a direct consequence of inode architecture. A hard link is a dentry that points to an existing inode — a second name for the same underlying data. The inode has a link count field that tracks how many dentries reference it. The data is only freed when the link count drops to zero and no process has the file open. Symbolic links, by contrast, are their own inodes containing a path string. They can cross file system boundaries because they're just path pointers; hard links cannot, because dentries are only meaningful within a single superblock's namespace.
Common Misconceptions
The biggest one I hear: "mounting replaces the directory." It doesn't. The directory still exists on the underlying file system — it's just hidden while the mount is active. Unmount the file system and the directory, including any files you might have accidentally left there before mounting, comes back. I've seen engineers create files in what they thought was a mounted directory, only to realize later they were writing to the hidden layer underneath. Always verify with
findmntor
mount | grep targetbefore writing to a path you expect to be mounted.
Second misconception: "NFS mounts work like local mounts." They do at the VFS interface level, but semantics differ in ways that will ruin your day. NFS has weaker consistency guarantees, atime behavior can differ based on server settings, file locking uses a separate daemon (rpcbind/statd), and a network partition will cause your mount to hang indefinitely by default unless you use the
softand
timeooptions — which introduce their own tradeoffs around silent data corruption on write failures. NFS deserves its own article, but know that mounting it and treating it as ext4 is a mistake.
Third: "
/etc/mtabis authoritative." On modern systems,
/etc/mtabis a symlink to
/proc/self/mounts, which is the kernel's own view of your mount table. This is actually correct behavior. On older systems where
/etc/mtabwas a real file, it could drift out of sync with actual mounts if a mount operation failed to update it cleanly — a particularly nasty failure mode after a crash. If you're troubleshooting mounts, always read
/proc/mountsdirectly and treat it as ground truth.
# The authoritative mount table
cat /proc/mounts
# Or with more human-readable output
findmnt
# Find what's mounted on a specific path
findmnt /data
# TARGET SOURCE FSTYPE OPTIONS
# /data /dev/sdb1 xfs rw,relatime,attr2,inode64,logbufs=8,noquota
In my experience, engineers who invest the time to understand VFS, mount propagation, and inode semantics stop treating storage as a black box and start making better decisions — about partition layout, about mount options, about how containers interact with the host file system. It's foundational knowledge that pays dividends every time you're debugging a full disk, a permission problem, or an unexpected mount behavior in a containerized environment. Get comfortable with
findmnt, understand what's in
/proc/mounts, and never use bare device paths in fstab again.
