System administrators want to understand the applications that run on their
systems. You can't tune a machine unless you know what the machine is
doing! It's fairly easy to monitor a machine's physical resources:
CPU (mpstat
, top
), memory
(vmstat
), disk IO (iotop
, blktrace
,
blkiomon
) and network bandwidth (ip
,
nettop
).
Logical resources are just as important—if not more important—yet
the tools to monitor them either don't exist or aren't
exactly "user-friendly". For example, the ps
command can report the RSS
(resident set size) for a process. But how much of that is shared library
and how much is application? Or executable code vs. data space? Those are
questions that must be answered if a system administrator wants to
calculate an application's memory footprint.
To answer these questions, and others, I describe extracting information from the /proc filesystem. First, let's look at terminology relevant to Linux memory management. If you want an exhaustive look at memory management on Linux, consider Mel Gorman's seminal work Understanding the Linux Virtual Memory Manager. His book is an oldie but goodie; the hardware he describes hasn't changed much over the intervening years, and the changes that have occurred have been minor. This means the concepts he describes and much of the code used to implement those concepts is still spot-on.
Before going into the nuts and bolts of the answers to those questions, you first need to understand the context in which those questions are answered. So let's start with a high-level overview.
Your computer system has some amount of physical RAM installed. RAM is needed to run all software, because the CPU will fetch instructions and data from RAM and nowhere else. When a system doesn't have enough RAM to satisfy all processes, some of the process memory is written to an external storage device and that RAM then can be freed for use by other processes. This is called either swapping, when the RAM being freed is anonymous memory (meaning that it isn't associated with file data, such as shared memory or a process's heap space), or paging (which applies to things like memory-mapped files).
(By the way, a process is simply an application that's currently running. While the application is executing, it has a current directory, user and group credentials, a list of open files and network connections, and so on.)
Some types of memory don't need to be written out before they can be freed and reused. For example, the executable code of an application is stored in memory and protected as read-only. Since it can never be changed, when Linux wants to use that memory for something else, it just takes it! If the application ever needs that memory back again, Linux can reload it from the original application executable on disk. Also, since this memory is read-only, it can be used by multiple processes at the same time. And, this is where the confusion comes in regarding calculating how much memory a process is using—what if some of that memory is being shared with other processes? How do you account for it?
Before getting to that, I need to define a few other terms. The first is pinned memory. Most memory is pageable, meaning that it can be swapped or paged out when the system is running low on RAM. But pinned memory is locked in place and can't be reused. This is obviously good for performance—the memory never can be taken away, so you never have to wait for it to be brought back in. The problem is that such memory can never be reused, even if the system is running critically low on RAM. Pinned memory reduces the system's flexibility when it comes to managing memory, and no one likes to be boxed into a corner.
I made reference above to read-only memory, memory that is shared, memory used for heap space, and so on. Below is some sample output that shows how memory is being used by my Bash shell (I want to emphasize that this output has been trimmed to fit into the allotted space, but all relevant fields are still represented. You can run the two commands you see on your own system and look at real data, if you wish. You'll see full pathnames instead of "..." as shown below, for example):
fedwards@local:~$ cd /proc/$$
fedwards@local:/proc/3867$ cat maps
00400000-004f4000 r-xp 00000000 08:01 260108 /bin/bash
006f3000-006f4000 r--p 000f3000 08:01 260108 /bin/bash
006f4000-006fd000 rw-p 000f4000 08:01 260108 /bin/bash
006fd000-00703000 rw-p 00000000 00:00 0
00f52000-01117000 rw-p 00000000 00:00 0 [heap]
f4715000-f4720000 r-xp 00000000 08:01 267196 /.../libnss_files-2.23.so
f4720000-f491f000 ---p 0000b000 08:01 267196 /.../libnss_files-2.23.so
f491f000-f4920000 r--p 0000a000 08:01 267196 /.../libnss_files-2.23.so
f4920000-f4921000 rw-p 0000b000 08:01 267196 /.../libnss_files-2.23.so
f4921000-f4927000 rw-p 00000000 00:00 0
f4f55000-f5914000 r--p 00000000 08:01 139223 /.../locale-archive
f6329000-f6330000 r--s 00000000 08:01 396945 /.../gconv-modules.cache
f6332000-f6333000 rw-p 00000000 00:00 0
fd827000-fd848000 rw-p 00000000 00:00 0 [stack]
fd891000-fd894000 r--p 00000000 00:00 0 [vvar]
fd894000-fd896000 r-xp 00000000 00:00 0 [vdso]
ff600000-ff601000 r-xp 00000000 00:00 0 [vsyscall]
fedwards@local:/proc/3867$
Each line of output represents one vm_area
. A
vm_area
is a
data structure inside the Linux kernel that keeps track of how one region
of virtual memory is being used inside a process. The sample output has
/bin/bash on the first three lines, because Linux has created three ranges
of virtual memory that refer to the executable program. The first region
has permissions r-xp
, because it is executable code (r = read, x = execute
and p = private; the dash means write permission is turned off). The
second region refers to read-only data within the application and has
permissions r--p
(the two dashes represent write and execute permission).
The third region represents variables that have been given initial values
in the application's source code, so it must be loaded from the executable,
but it could be changed during runtime (hence the permissions
rw-p
that shows
only execute is turned off). These regions can be any size, but they are
made of up pages, which are each 4K on Linux. The term
page means the
smallest allocatable unit of virtual memory. (In technical documentation,
you'll see two other terms: frame and slot. Frames and slots are the
same size as pages, but frames refer to physical memory and slots refer to
swap space.)
You know from my previous discussion that read-only regions are shared
with other processes, so why does "p" show up in the permissions for the
first region? Shouldn't it be a shared region? You have a good eye to
spot that! Yes, it should. And in fact, it is shared. The reason it
shows up as "p" here is because there are actually 14 different permissions
and room only for four letters, so some condensing had to be done. The "p"
means private, because while the memory is currently marked read-only, the
application could change that permission and make it read-write, and if it
did make that change and then modified the memory, you would not want other
processes to see those changes! That would be similar to one process
changing directory, and every other process on the system changing at the
same time! Oops! So the letter "p" that marks the region as private
really means copy-on-write. All of the memory starts out being shared
among all processes using that region, but if any part of it is modified
in the future, that one tiny piece is copied into another part of RAM so
that the change applies only to the one process that attempted the write.
In essence, it's private, even though 99% of the time, the memory in
that region will be shared with other processes. Such copying applies on a
page-by-page basis, not the entire vm_area
. Now you can begin to see the
difficulty in calculating how much memory a process actually consumes.
But while I'm on this topic, there's a region in the list that has an "s" in the permission field. That region is a memory-mapped file, meaning that the data blocks on disk are mapped to the virtual memory addresses shown in the listing. Any reference the process makes to the memory addresses are translated automatically into reads and writes to the corresponding data blocks on disk. The memory used by this region is actually shared by all processes that map the file into memory, meaning no duplicated memory for file access by those processes.
Just because a region represents some given size of virtual memory does not necessarily mean that there are physical frames of RAM for every virtual page. In fact, this is often the case. Imagine an application that allocates 100MB of memory. Should the operating system actually allocate 100MB right then? UNIX systems do not—they allocate a region of virtual memory like those above, but no physical RAM. As the process tries to access those virtual addresses, page faults will be generated, and the operating system will allocate the memory at that time. Deferring memory allocation until the last possible moment is one way that Linux optimizes the use of memory, but it complicates the task in trying to determine how much memory an application is using.
A process's address space is broken up into regions called vm_areas
.
These vm_areas
are unique to each process, but the frames of memory
referred to by the pages within the vm_area
might be shared across
processes. If the memory is read-only (like executable code), all
processes share the frame equally. Any attempt to write to virtual pages
that are read-only triggers a page fault that is converted into a SIGSEGV
and the process is killed. (You may have seen the message pop up on your
terminal screen, "Segmentation fault." That means the process was killed
by SIGSEGV.)
Memory that is read/write also can be shared, such as shared memory. If
multiple processes can write to the frames of the vm_area
equally,
some form of synchronization inside the application will be necessary, or
multiple processes could write at the same time, possibly corrupting the
contents of that shared memory. (Most applications will use some kind of
mutex lock for this, but synchronization and locking is outside the scope
of this article.)
So, determining how much memory a process consumes is difficult. You
could add up the space allocated to the vm_areas
, but that's virtual
memory, not physical; large portions of that space could be unused or
swapped out. This number is not a true representation of the amount of
memory being used by the process.
You could add up only the frames that are used by this process and not shared. (This information is available in /proc/pid/smaps.) You might call this the "USS" (Unique Set Size), as it defines how much memory will be freed when an application terminates (shared libraries typically stay in RAM even when no processes are currently using them as a performance optimization for when they are needed again). But this isn't the true memory cost of a process either, as the process likely uses one or more shared libraries. For example, if an application is executed and it uses a shared library that isn't already in memory, that library must be loaded—some part of that library should be allocated against the new process, right?
The ps
command reports the "RSS" (Resident Set Size), which
includes all
frames used by the process, regardless of whether they're shared.
Unfortunately, this number is going to inflate the memory size when
all processes are summed up—adding up this number for all processes
running on the system will count all shared libraries multiple times,
greatly inflating the actual memory requirement.
The /proc/pid/smaps file includes yet another memory category, PSS (Proportional Set Size). This is the amount of unique memory just for one process (the USS), plus a proportion of the memory that is shared by other running processes. For example, let's assume the USS for a process is 2MB and it uses another 4MB of shared libraries, but those shared libraries are used by three other processes. Since there are four processes using the shared libraries, they should each only be accounted for 25% of the overall library size. That would make the PSS of the process 2MB + (4MB / 4) = 3MB. If you now add together the PSS values of all processes on the system, the shared library memory will be totally accounted for, meaning the whole is equal to the sum of its parts.
It's not perfect—when one of those processes terminates, the memory returned to the system will be USS, and because there's one less process using the shared libraries, the PSS of all other processes will appear to increase! A naïve system administrator might wonder why the memory usage on the remaining processes has suddenly spiked, but in truth, it hasn't. In this example, 4MB/4 becomes 4MB/3, so any process using the shared libraries will see an adjusted PSS value that goes up by .33MB.
As the last step, I'm going to demonstrate a command that performs these calculations.
The one-line command shown below will accumulate all of the PSS values for all processes on the system:
awk '/^Pss:/ { ttl += $2 }; END { print ttl }' /proc/[0-9]*/smaps
↪2>/dev/null
Note that stderr is redirected to /dev/null. This is because
the shell replaces the wildcard string with a list of all filenames that
match and then executes the awk
command. This means that by the time the
awk
command is running, some of those processes already may have
terminated. That will cause awk
to print an error message about a
non-existent file, hence redirecting stderr to avoid that error. (Astute
readers will note that this command line will never factor in the memory
consumed by the awk
command itself!)
Many of the processes that the awk
command is going to be reading will not
be accessible to an unprivileged account, so system administrators should
consider using sudo
to run the command. (Inaccessible processes will
produce error messages that are then redirected to /dev/null, thus the
command will report a total of the memory used by all processes that
are
accessible—in other words, those owned by the current user.)
I've covered a lot of ground in this article, from terminology (pages,
frames, slots) and background information on how virtual memory is
organized (vm_areas
), to details on how memory usage is reported to
userspace (the maps and smaps files under /proc). I've barely scratched the
surface of the type of information that the Linux kernel exposes to userspace, but
hopefully, this has piqued your interest enough that you'll
explore it further.
My favorite source for technical details is LWN.net if I'm looking for discussion and analysis, but I frequently will go straight to the Linux source code when I'm looking for implementation details. See "ELC: How much memory are applications really using?" for the discussion around adding PSS to smaps, and see "Tracking actual memory utilization" for a discussion of memory used by a process but that belongs to the kernel (something this article doesn't touch upon).