Everything Is a File — But What Does That Actually Mean?
You've heard the phrase a hundred times: in Linux, everything is a file. Processes, sockets, hardware devices, kernel tunables — the operating system presents all of it through a unified file interface. But the reason that abstraction works at all is because of the Virtual File System layer, and the mount system that hangs concrete storage onto that abstraction. If you've ever wondered what actually happens when you run
mount, or why your container sees a completely different root than the host, this is the article for you.
I'm going to walk through this the way I'd explain it to a colleague who's comfortable with Linux administration but hasn't dug into the internals. We'll go from the VFS layer down to fstab, bind mounts, and mount namespaces — and I'll flag the misconceptions I see most often in the field.
The Virtual File System: One Interface, Many Backends
The kernel's VFS is a software layer that sits between user-space applications and actual file system implementations. When a process calls
open(),
read(), or
stat(), it talks to the VFS. The VFS then dispatches that call to the appropriate concrete file system driver — whether that's ext4, XFS, tmpfs, or even a network file system like NFS.
The VFS defines four core objects: superblocks, inodes, dentries, and file objects. The superblock holds metadata about the mounted file system as a whole — its type, block size, and state. An inode represents a single file or directory, tracking permissions, timestamps, and pointers to the actual data blocks. A dentry (directory entry) maps a filename to an inode — it's the kernel's cache of the namespace tree. And a file object represents an open file descriptor in a running process, tying together the inode, current position, and flags.
This design is why you can do things like
stat /proc/1/statusand get sensible output even though there are no actual disk blocks backing that file. The
procfsdriver implements the VFS interface and generates content on the fly when the kernel services the read call. The application doesn't need to know or care.
What a Mount Point Actually Is
A mount point is just a directory. That's it. When you mount a file system onto a directory, the kernel attaches the root of that file system to that directory entry in the namespace tree. Anything that was previously in that directory becomes temporarily invisible — hidden behind the mounted file system's own root. This is not a copy, not a symlink, not a bind. The kernel literally replaces what the dentry points to.
The kernel tracks all active mounts in an internal structure accessible via
/proc/mountsand the more detailed
/proc/self/mountinfo. The
findmnttool reads these and formats them usably:
[infrarunbook-admin@sw-infrarunbook-01 ~]$ findmnt
TARGET SOURCE FSTYPE OPTIONS
/ /dev/sda2 ext4 rw,relatime
├─/sys sysfs sysfs rw,nosuid,nodev,noexec,relatime
│ ├─/sys/kernel/security securityfs securityfs rw,nosuid,nodev,noexec,relatime
│ └─/sys/fs/cgroup cgroup2 cgroup2 rw,nosuid,nodev,noexec,relatime
├─/proc proc proc rw,nosuid,nodev,noexec,relatime
├─/dev devtmpfs devtmpfs rw,nosuid,size=8192k,nr_inodes=4096
│ ├─/dev/pts devpts devpts rw,nosuid,noexec,relatime
│ └─/dev/shm tmpfs tmpfs rw,nosuid,nodev
├─/run tmpfs tmpfs rw,nosuid,nodev,mode=755
└─/data /dev/sdb1 xfs rw,relatime,attr2,inode64
Notice the tree structure. Every indented entry is a child mount — a mount whose mount point lives inside the parent mount's directory tree. This hierarchy is called the mount tree, and it's how the kernel resolves paths. When you stat
/sys/fs/cgroup, the kernel walks the path components, checks the dentry cache, and follows the mount attachment to reach cgroup2's own inode space.
File System Types You'll Actually Encounter
Not all file systems store data on disk. Linux mounts a mix of storage-backed, memory-backed, and kernel-virtual file systems during a normal boot. Understanding what each one is for saves you a lot of confusion when you're staring at
/proc/mountswondering why there are fifteen entries before you even get to your disk.
ext4 is still the most common root file system type on general-purpose Linux. It's journaled, mature, and has excellent fsck tooling. XFS is a better choice for large files and high-throughput workloads — it scales better under heavy parallelism, which is why RHEL defaults to it. Both support extended attributes and ACLs out of the box.
tmpfs is memory-backed storage. It behaves exactly like a disk-backed file system from a user-space perspective, but its data lives in RAM (and optionally swap). The kernel uses it for
/dev/shm,
/run, and often
/tmp. It doesn't have a fixed size ceiling unless you specify one — it grows and shrinks dynamically. I've seen junior engineers try to diagnose "disk full" errors in
/runand spend twenty minutes looking for a disk device that doesn't exist.
proc and sysfs are pseudo-file systems that expose kernel data structures.
/procis process and system information.
/sysis the kernel object model — device trees, driver parameters, and hardware state. Neither has any on-disk presence. If you're ever tuning kernel parameters at runtime with
echointo
/sys/block/sdb/queue/scheduler, you're writing directly into kernel memory through the sysfs interface.
devtmpfs provides the
/devtree. The kernel auto-populates it with device nodes as drivers register hardware. Without it mounted, you'd have no
/dev/sda, no
/dev/null, nothing. udev then manages the naming and permissions on top of this.
overlay (overlayfs) is what Docker and most container runtimes use for image layers. It stacks a read-only lower directory and a writable upper directory, presenting a merged view. Writes go to the upper layer. The lower layers are never modified. This is how you can spin up fifty containers from the same base image without duplicating gigabytes of data.
How fstab Works — and Where People Get It Wrong
The file
/etc/fstabis a static table that tells the system what to mount at boot, and how. Each line has six fields: device, mount point, file system type, options, dump frequency, and fsck pass order. Here's a representative example from a server I work on:
# /etc/fstab on sw-infrarunbook-01
# <device> <mount> <type> <options> <dump> <pass>
UUID=3f2e1a4b-8c7d-4e56-9f01-2b3c4d5e6f7a / ext4 defaults,errors=remount-ro 0 1
UUID=a1b2c3d4-e5f6-7890-abcd-ef1234567890 /data xfs defaults,noatime 0 2
UUID=dead1234-beef-cafe-0000-111122223333 /boot ext4 defaults 0 2
tmpfs /tmp tmpfs defaults,size=2G,mode=1777 0 0
The device field should almost always use a UUID, not
/dev/sda1. Device names aren't stable — if you add a disk or the kernel enumerates storage in a different order,
/dev/sdbmight become
/dev/sdaafter a reboot. UUIDs don't change. Get them with
blkidor
lsblk -o NAME,UUID.
The options field is where the real control lives.
noexecprevents execution of binaries from that mount — useful on
/tmpand
/hometo limit attacker surface.
nosuidignores setuid bits, which matters if users can put files on a mount.
nodevprevents interpretation of device files — you generally want this on any non-root partition.
relatimeis the modern default for atime handling: it only updates access time if the current atime is older than the modification time, reducing write overhead without completely disabling atime tracking like
noatimedoes.
The dump and fsck fields trip people up. Dump is almost universally 0 these days — the dump utility is rarely used. The fsck pass field tells
fsckwhen to check the file system at boot. The root file system should be 1. All other file systems that need checking should be 2 (they run after root, in parallel if possible). A value of 0 means skip fsck entirely, which is correct for tmpfs, proc, sysfs, and any network file system.
Bind Mounts: The Same Data, Different Path
A bind mount takes an existing directory (or file) and mounts it at a second location. Both paths show the same inode tree. Changes through either path are immediately visible through the other because they're backed by the same in-memory structures.
[infrarunbook-admin@sw-infrarunbook-01 ~]$ mkdir /mnt/bindtest
[infrarunbook-admin@sw-infrarunbook-01 ~]$ mount --bind /data/shared /mnt/bindtest
[infrarunbook-admin@sw-infrarunbook-01 ~]$ findmnt /mnt/bindtest
TARGET SOURCE FSTYPE OPTIONS
/mnt/bindtest /dev/sdb1 xfs rw,relatime,attr2,inode64
Bind mounts are how container runtimes expose host paths inside a container's mount namespace. When you run a container with
-v /data/configs:/etc/app, the runtime bind-mounts
/data/configsinto the container's private namespace at
/etc/app. The container process sees it as a regular directory. The host directory is unchanged.
You can also bind-mount individual files, which is useful for injecting a single config file into a container without exposing an entire directory. And you can make a bind mount read-only even if the source is writable:
[infrarunbook-admin@sw-infrarunbook-01 ~]$ mount --bind /data/shared /mnt/readonly-view
[infrarunbook-admin@sw-infrarunbook-01 ~]$ mount -o remount,ro,bind /mnt/readonly-view
To make bind mounts persistent across reboots, add them to fstab with the
bindoption:
/data/shared /mnt/bindtest none bind 0 0
Mount Namespaces: Isolation at the Kernel Level
Mount namespaces are the mechanism that lets containers have an entirely different file system view from the host. When a process creates a new mount namespace (via
unshare(2)or
clone(2)with
CLONE_NEWNS), it gets its own private copy of the mount tree. Changes inside that namespace — mounting, unmounting, bind-mounting — are invisible to the parent namespace and to other namespaces.
In my experience, this is where a lot of engineers get confused when debugging containers. They'll shell into a host and try to find a mount they're certain they did inside a container, only to find it's not there. Of course it isn't — the container is running in its own mount namespace. The host's
/proc/mountsshows the host namespace. The container's namespace is separate.
You can inspect another process's mount namespace by reading
/proc/<pid>/mountsor by using
nsenter:
[infrarunbook-admin@sw-infrarunbook-01 ~]$ nsenter --mount --target 4821 findmnt
TARGET SOURCE FSTYPE OPTIONS
/ overlay overlay rw,relatime,lowerdir=...,upperdir=...,workdir=...
├─/proc proc proc rw,nosuid,nodev,noexec,relatime
├─/dev tmpfs tmpfs rw,nosuid,size=65536k,mode=755
└─/etc/resolv.conf /dev/sda2 ext4 rw,relatime
That last line is particularly revealing — a single file bind-mounted from the host's ext4 root into the container's overlay namespace. That's exactly how container runtimes inject DNS configuration without affecting anything else in the container's file system.
Shared subtrees complicate this further. A mount can be marked as shared, slave, private, or unbindable. A shared mount propagates mount events to its peers. A slave mount receives propagation from its master but doesn't send it back. Private mounts don't propagate at all. This controls whether new mounts inside a namespace become visible outside it, and it's what the
--mount=type=bind,propagation=rprivatesyntax in container tooling is controlling.
systemd and Mount Units
On any systemd-based distribution, fstab entries are automatically converted into mount units at boot. You can also write mount units directly. A
.mountunit's name must match the escaped mount point path — so a mount at
/data/sharedbecomes
data-shared.mount.
# /etc/systemd/system/data-shared.mount
[Unit]
Description=Shared Data Volume
After=network.target
[Mount]
What=/dev/disk/by-uuid/a1b2c3d4-e5f6-7890-abcd-ef1234567890
Where=/data/shared
Type=xfs
Options=defaults,noatime
[Install]
WantedBy=multi-user.target
Paired with an
.automountunit, systemd can mount file systems on demand — only when something actually accesses the mount point. This is useful for NFS shares that you don't want to block boot if the network isn't ready, or for removable storage.
Common Misconceptions I Keep Seeing
The first one: mounting doesn't copy data. I've had people tell me they're worried that mounting a directory over another directory will destroy the contents of the lower directory. Nothing is destroyed. The contents of the mount point directory are hidden while something is mounted on top of it, but they come right back when you unmount. Run
umount /mountpointand the original directory contents are exactly where you left them.
The second: proc and sys are not optional. I've seen people strip down containers to the point where
/procisn't mounted, then wonder why
psshows nothing, why
toprefuses to start, and why half the tooling is broken. Many system utilities read process and system state directly from procfs. If it's not mounted, they fail silently or with cryptic errors.
The third: /etc/mtab
is not the source of truth. On modern systems,
/etc/mtabis a symlink to
/proc/self/mounts. The kernel's mount table is the authoritative record. Don't hand-edit mtab — it's auto-generated. If your mount isn't in
/proc/mounts, it didn't work, regardless of what's in fstab.
The fourth: lazy unmount is not a safe default.
umount -lperforms a lazy unmount — it detaches the file system from the namespace tree immediately but doesn't actually release the resources until all open file descriptors against it are closed. This sounds convenient, but I've seen it cause real problems: processes continue reading and writing to what they think is the file system long after you thought you unmounted it, and then you try to run fsck on what you believe is an idle device. Use lazy unmount intentionally, not as a workaround for a busy device. The right approach is to find what has files open with
lsof +f -- /mountpointand deal with those processes first.
The fifth: tmpfs data doesn't survive reboots — obviously — but it also doesn't survive unmounts. This trips people up occasionally when they're testing something in
/dev/shmand manually unmount and remount. The data is gone. tmpfs is volatile by design. Don't confuse it with a ramdisk image that you can serialize and reload.
Practical Checks for Day-to-Day Work
When you're investigating a storage problem,
findmntis almost always the right starting point. It reads
/proc/self/mountinfoand gives you the full picture including propagation flags and bind sources.
lsblk -fmaps block devices to file system types and UUIDs.
df -Thshows usage with file system types included — the
-Tflag is something I forget to add half the time and then wonder why I'm looking at unlabeled columns.
[infrarunbook-admin@sw-infrarunbook-01 ~]$ df -Th
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda2 ext4 50G 18G 30G 38% /
tmpfs tmpfs 7.8G 0 7.8G 0% /dev/shm
/dev/sdb1 xfs 500G 220G 280G 44% /data
tmpfs tmpfs 2.0G 1.2M 2.0G 1% /tmp
If you suspect a mount point is hiding something (i.e., something mounted over a non-empty directory), you can temporarily move the mount aside with
mount --moveto inspect what's underneath, or check if the inode count for the mount point directory looks off relative to what you'd expect.
For anything involving mount namespaces and containers,
lsns -t mntlists all mount namespaces on the system with their owning process and PID. Combined with
nsenter, you can walk into any process's mount namespace and inspect its exact view of the file system tree without affecting it.
[infrarunbook-admin@sw-infrarunbook-01 ~]$ lsns -t mnt
NS TYPE NPROCS PID USER COMMAND
4026531841 mnt 142 1 root /sbin/init
4026532198 mnt 4 4821 infrarunbook-admin /usr/bin/containerd-shim-runc-v2
4026532301 mnt 1 4835 100000 nginx: master process nginx
The Linux file system layer is one of those areas where understanding the internals pays compounding dividends. Once you understand what a mount point actually does at the kernel level, the behavior of containers, bind mounts, chroots, and namespace isolation stops being magic and starts being obvious. And when something breaks at 2 AM, obvious beats magic every single time.
