Memory Management

Memory Management

  • Tuning the memory sub-system can be a complex process.

  • First of all, one has to take note that memory usage and I/O throughput are intrinsically related, as, in most cases, most memory is being used to cache the contents of files on disk. Thus, changing memory parameters can have a large effect on I/O performance, and changing I/O parameters can have an equally large converse effect on the virtual memory sub-system.

    free -m
                total        used        free      shared  buff/cache   available
    Mem:            7763        3178         646        1022        3938        3262
    Swap:           7762        1034        6728
    
    cat /proc/meminfo 
    MemTotal:        7949804 kB
    MemFree:          669748 kB
    MemAvailable:    3355456 kB
    Buffers:              28 kB
    Cached:          3777140 kB
    SwapCached:        13160 kB
    Active:          2357428 kB
    Inactive:        3249488 kB
    Active(anon):    1659132 kB
    Inactive(anon):  1201760 kB
    Active(file):     698296 kB
    Inactive(file):  2047728 kB
    Unevictable:      583624 kB
    Mlocked:             220 kB
    SwapTotal:       7949308 kB
    ...
    UTILITY PURPOSE PACKAGE
    free Brief summary of memory usage procps
    vmstat Detailed virtual memory statistics and block I/O, dynamically updated procps
    pmap Process memory map procps
  • The pseudofile /proc/meminfo contains a wealth of information about how memory is being used.

/proc/sys/vm

  • The /proc/sys/vm directory contains many tunable knobs to control the Virtual Memory system.

  • Values can be changed either by directly writing to the entry, or using the sysctl utility.

  • When tweaking parameters in /proc/sys/vm, the usual best practice is to adjust one thing at a time and look for effects. The primary (inter-related) tasks are:

  • Controlling flushing parameters; i.e., how many pages are allowed to be dirty and how often they are flushed out to disk

  • Controlling swap behavior; i.e., how much pages that reflect file contents are allowed to remain in memory, as opposed to those that need to be swapped out as they have no other backing store

  • Controlling how much memory overcommission is allowed, since many programs never need the full amount of memory they request, particularly because of copy on write (COW) techniques

  • Memory tuning can be subtle: what works in one system situation or load may be far from optimal in other circumstances.

  • Exactly what appears in this directory will depend somewhat on the kernel version. Almost all of the entries are writable (by root).

    ls /proc/sys/vm/
    admin_reserve_kbytes         dirty_ratio                legacy_va_layout           min_unmapped_ratio       numa_zonelist_order       panic_on_oom                   watermark_boost_factor
    compaction_proactiveness     dirtytime_expire_seconds   lowmem_reserve_ratio       mmap_min_addr            oom_dump_tasks            percpu_pagelist_high_fraction  watermark_scale_factor
    compact_memory               dirty_writeback_centisecs  max_map_count              mmap_rnd_bits            oom_kill_allocating_task  stat_interval                  zone_reclaim_mode
    compact_unevictable_allowed  drop_caches                memfd_noexec               mmap_rnd_compat_bits     overcommit_kbytes         stat_refresh
    dirty_background_bytes       extfrag_threshold          memory_failure_early_kill  nr_hugepages             overcommit_memory         swappiness
    dirty_background_ratio       hugetlb_optimize_vmemmap   memory_failure_recovery    nr_hugepages_mempolicy   overcommit_ratio          unprivileged_userfaultfd
    dirty_bytes                  hugetlb_shm_group          min_free_kbytes            nr_overcommit_hugepages  page-cluster              user_reserve_kbytes
    dirty_expire_centisecs       laptop_mode                min_slab_ratio             numa_stat                page_lock_unfairness      vfs_cache_pressure

vmstat

  • vmstat is a multi-purpose tool that displays information about memory, paging, I/O, processor activity and processes.
vmstat [options] [delay] [count]

If delay is given in seconds, the report is repeated at that interval count times; if count is not given, vmstat will keep reporting statistics forever, until it is killed by a signal, such as Ctrl-C.

vmstat 2 4
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  0 1048576 910912     28 4061280    6   16    62    42   52  151  4  2 94  0  0
 0  0 1048576 940816     28 4040172    0    0     0   266 2874 5571  3  2 94  0  0
 0  0 1048576 939220     28 4042236    0    0     0    44 2850 5257  3  2 95  0  0
 0  0 1048576 938500     28 4042236    0    0     0     0 2695 5135  3  2 95  0  0

vmstat -SM -a 2 4
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa st
 2  0   1024    825   3128   2305    0    0    62    42   55  157  4  2 94  0  0
 0  0   1024    824   3122   2305    0    0     0    38 2829 5162  3  3 94  0  0
 0  0   1024    836   3122   2305    0    0     0  3438 2983 5966  3  2 94  0  0
 1  0   1024    841   3122   2305    0    0     0    44 2672 5199  3  2 95  0  0

vmstat -p /dev/sda3 2 4
sda3            reads      read sectors      writes  requested writes
               258262          26944496      303063          13001080
               258262          26944496      303083          13001448
               258262          26944496      303108          13001744
               258263          26944504      303137          13004376
  • If the option -S m is given, memory statistics will be in MB instead of KB.

  • With the -a option, vmstat displays information about active and inactive memory.

  • Active memory pages are those which have been recently used; they may be clean (disk contents are up to date) or dirty (need to be flushed to disk eventually).

  • By contrast, inactive memory pages have not been recently used and are more likely to be clean and are released sooner under memory pressure.

  • If you just want to get some quick statistics on only one partition, use the -p option

Using SWAP

Linux employs a virtual memory system, in which the operating system can function as if it had more memory than it really does. This kind of memory overcommission functions in two ways:

  • Many programs do not actually use all the memory they are given permission to use. Sometimes, this is because child processes inherit a copy of the parent’s memory regions utilizing a COW (Copy On Write) technique, in which the child only obtains a unique copy (on a page-by-page basis) when there is a change.
  • When memory pressure becomes important, less active memory regions may be swapped out to disk, to be recalled only when needed again.

Such swapping is usually done to one or more dedicated partitions or files; Linux permits multiple swap areas, so the needs can be adjusted dynamically. Each area has a priority, and lower priority areas are not used until higher priority areas are filled.

In most situations, the recommended swap size is the total RAM on the system. You can see what your system is currently using for swap areas by looking at the /proc/swaps file and report on current usage with free.

The commands involving swap are:

  • mkswap: format swap partitions or files

  • swapon: activate swap partitions or files

  • swapoff: deactivate swap partitions or files

    cat /proc/swaps
    Filename		  Type		     Size		   Used		    Priority
    /dev/zram0    partition	   7949308	 1111296		100
    
    free -m
            total    used     free    shared  buff/cache   available
    Mem:     7763     3104     1528    707     3129         3532
    Swap:    7762     1085     6677

At any given time, most memory is in use for caching file contents to prevent actually going to the disk any more than necessary, or in a sub-optimal order or timing. Such pages of memory are never swapped out as the backing store is the files themselves, so writing out to swap would be pointless; instead, dirty pages (memory containing updated file contents that no longer reflect the stored data) are flushed out to disk.

In Linux, memory used by the kernel itself, as opposed to application memory, is never swapped out, in distinction to some other operating systems.

OOM (Out of Memory) Killer

  • Simplest way to handle memory pressure: Permit memory allocations until all memory is exhausted, then fail.
  • Second simplest way: Use swap space on disk to free up some resident memory. Total available memory is RAM + swap space.
  • Linux allows memory overcommitment, granting memory requests beyond RAM + swap, as many processes don’t use all requested memory.
  • Example:
    • An example would be a program that allocates a 1 MB buffer, and then uses only a few pages of the memory.
    • Another example is that every time a child process is forked, it receives a copy of the entire memory space of the parent. Because Linux uses the COW (copy on write) technique, unless one of the processes modifies memory, no actual copy needs be made. However, the kernel has to assume that the copy might need to be done.
  • Kernel allows overcommitment only for user process pages; kernel pages are not swappable and are allocated at request time.
  • OOM (Out of Memory) killer selects which processes to terminate during severe memory pressure.

OOM Killer Algorithms:

  • Overcommission can be modify and even turn off by setting the value of /proc/sys/vm/overcommit_memory values:
    • 0 (default): Permit overcommission but refuse obvious overcommits. Root users get more memory allocation than normal users.
    • 1 : Allow all memory requests to overcommit.
    • 2 : Turn off overcommission. Memory requests fail when total memory commit reaches swap space + a configurable percentage of RAM (/proc/sys/vm/overcommit_ratio).
  • Heuristic algorithm not for normal operations but for graceful shutdown or retrenchment.
  • Process selection based on badness value (/proc/[pid]/oom_score) for each process.
  • Adjust oom_adj_score in the same directory for each task to make adjustments.