{"id":301,"date":"2022-09-28T19:52:28","date_gmt":"2022-09-29T03:52:28","guid":{"rendered":"https:\/\/stasosphere.com\/entrepreneur-being\/?p=301"},"modified":"2025-07-22T21:12:07","modified_gmt":"2025-07-23T05:12:07","slug":"mmap-memory-leak-investigation","status":"publish","type":"post","link":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/","title":{"rendered":"A Deep Investigation into MMAP Not Leaking Memory"},"content":{"rendered":"\n<p>This write up is going to demonstrate that while it looks that a <code>mmap<\/code>&#8216;ed file IO looks like it&#8217;s leaking memory it actually is not.<\/p>\n<p>HuggingFace&#8217;s <a href=\"https:\/\/github.com\/huggingface\/datasets\">datasets<\/a> project uses MMAP to make datasets available to multiple processes in an efficient way. This is very important since typically a machine learning training program will use a Dataloader which may use multiple workers, or alternatively the same dataset is simply accessed by multiple processes.<\/p>\n<p>An <a href=\"https:\/\/github.com\/huggingface\/datasets\/issues\/4883\">issue was posted<\/a> that suggested that a <code>datasets<\/code>-based program leaks memory with each iteration. This triggered an extensive research into understanding that MMAP doesn&#8217;t leak memory and bringing a lot of deepeer understanding of the different components used under the hood of <code>datasets<\/code>.<\/p>\n<p>If you&#8217;d like to gain a deeper understanding into why and how please read on.<br \/><!--more--><br \/><\/p>\n<h2 id=\"emulating-a-computer-with-just-1gb-of-memory\">Emulating a computer with just 1GB of memory<\/h2>\n<p>Since we don&#8217;t want to crash our computer while debugging memory issues we are going to emulate a computer with just 1GB of memory and no swap memory. Unless such computer has a protection from programs using more memory than the computer has most of the time such computers start <a href=\"https:\/\/en.wikipedia.org\/wiki\/Thrashing_(computer_science)\">thrashing<\/a> and eventually crash.<\/p>\n<p>To accomplish that we are going to start a cgroups-controlled shell which will kill any program started from that shell and which consumes more than 1GB of memory (and give it no swap memory either):<\/p>\n<pre>$ systemd-run --user --scope -<span class=\"hljs-selector-tag\">p<\/span> MemoryHigh=<span class=\"hljs-number\">1<\/span>G -<span class=\"hljs-selector-tag\">p<\/span> MemoryMax=<span class=\"hljs-number\">1<\/span>G -<span class=\"hljs-selector-tag\">p<\/span> MemorySwapMax=<span class=\"hljs-number\">0<\/span>G --setenv=<span class=\"hljs-string\">\"MEMLIMIT=1GB\"<\/span> bash\n<\/pre>\n<p>I&#8217;m setting <code>MEMLIMIT=1GB<\/code> env variable so that at any moment I can check if I&#8217;m in the right shell by printing:<\/p>\n<pre>$ <span class=\"hljs-built_in\">echo<\/span> <span class=\"hljs-variable\">$MEMLIMIT<\/span>\n1GB<\/pre>\n<p>Let&#8217;s validate that this shell allows a program to allocate under 1GB of RSS RAM, but will kill it if it tries to allocate more than that:<\/p>\n<pre># <span class=\"hljs-number\">7<\/span> * <span class=\"hljs-number\">128<\/span>M chars\n<span class=\"hljs-string\">$ <\/span>python -c <span class=\"hljs-comment\">\"import sys, os, psutil; a='a'*7*2**27; print(f'{psutil.Process(os.getpid()).memory_info().rss &gt;&gt; 20}MB');\"<\/span>\n<span class=\"hljs-number\">908<\/span>MB\n\n# <span class=\"hljs-number\">8<\/span> * <span class=\"hljs-number\">128<\/span>M chars\n<span class=\"hljs-string\">$ <\/span>python -c <span class=\"hljs-comment\">\"import sys, os, psutil; a='a'*8*2**27; print(f'{psutil.Process(os.getpid()).memory_info().rss &gt;&gt; 20}MB');\"<\/span>\n<span class=\"hljs-type\">Killed<\/span>\n<\/pre>\n<p>So we can see that &lt; ~1GB works, but beyond an allocation that asks for more than 1GB of resident memory gets killed.<\/p>\n<p>In the rest of this write up let&#8217;s use shell A, which is unlimited (or rather limited to an actual available memory on your computer) and shell B, where a program started from it can only allocate 1GB of resident memory.<\/p>\n<p>Sidenote: Linux memory management and reporting is super-complicated and one could probably easily write a whole book about it. Resident Set Size (RSS) is typically the easiest to use to measure the approximate actual memory usage by the program. It doesn&#8217;t tell you the whole truth, but most of the time it&#8217;s good enough to detect memory leaks. Therefore in this write up this is the metric we are going to use.<\/p>\n<h2 id=\"simple-io-debug-program\">Simple IO debug program<\/h2>\n<p>Now let&#8217;s write a simple debug program that will create a file with a few very large lines, and then it&#8217;ll read them sequentially using a normal IO, but if we set <code>--mmap<\/code> it&#8217;ll switch to memory mmaped API via the <code>mmap<\/code> module.<\/p>\n<p>Additionally, if <code>--accumulate<\/code> flag is passed the program will accumulate the lines it reads into a single string.<\/p>\n<pre>$ cat python mmap-no-leak-debug.py\n<span class=\"hljs-keyword\">import<\/span> gc\n<span class=\"hljs-keyword\">import<\/span> mmap\n<span class=\"hljs-keyword\">import<\/span> os\n<span class=\"hljs-keyword\">import<\/span> psutil\n<span class=\"hljs-keyword\">import<\/span> sys\n\nPATH = <span class=\"hljs-string\">\".\/tmp.txt\"<\/span>\n<span class=\"hljs-comment\"># create a large data file with a few long lines<\/span>\n<span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-keyword\">not<\/span> os.path.exists(PATH):\n    <span class=\"hljs-keyword\">with<\/span> open(PATH, <span class=\"hljs-string\">\"w\"<\/span>) <span class=\"hljs-keyword\">as<\/span> fh:\n        s = <span class=\"hljs-string\">'a'<\/span>* <span class=\"hljs-number\">2<\/span>**<span class=\"hljs-number\">27<\/span> + <span class=\"hljs-string\">\"\\n\"<\/span> <span class=\"hljs-comment\"># 128MB<\/span>\n        <span class=\"hljs-comment\"># write ~2GB file<\/span>\n        <span class=\"hljs-keyword\">for<\/span> i <span class=\"hljs-keyword\">in<\/span> range(<span class=\"hljs-number\">16<\/span>):\n            fh.write(s)\n\nproc = psutil.Process(os.getpid())\n<span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">mem_read<\/span><span class=\"hljs-params\">()<\/span>:<\/span>\n    gc.collect()\n    <span class=\"hljs-keyword\">return<\/span> proc.memory_info().rss \/ <span class=\"hljs-number\">2<\/span>**<span class=\"hljs-number\">20<\/span>\n\nprint(f<span class=\"hljs-string\">\"{'idx':&gt;4} {'RSS':&gt;10}   {'\u0394 RSS':&gt;12}   {'\u0394 accumulated':&gt;10}\"<\/span>)\n\ncontent = <span class=\"hljs-string\">''<\/span>\nmem_after = mem_before_acc = mem_after_acc = mem_before = proc.memory_info().rss \/ <span class=\"hljs-number\">2<\/span>**<span class=\"hljs-number\">20<\/span>\nprint(f<span class=\"hljs-string\">\"{0:4d} {mem_after:10.2f}MB {mem_after - 0:10.2f}MB {0:10.2f}MB\"<\/span>)\n\nmmap_mode = <span class=\"hljs-keyword\">True<\/span> <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-string\">\"--mmap\"<\/span> <span class=\"hljs-keyword\">in<\/span> sys.argv <span class=\"hljs-keyword\">else<\/span> <span class=\"hljs-keyword\">False<\/span>\n\n<span class=\"hljs-keyword\">with<\/span> open(PATH, <span class=\"hljs-string\">\"r\"<\/span>) <span class=\"hljs-keyword\">as<\/span> fh:\n\n    <span class=\"hljs-keyword\">if<\/span> mmap_mode:\n        mm = mmap.mmap(fh.fileno(), <span class=\"hljs-number\">0<\/span>, access=mmap.ACCESS_READ)\n\n    idx = <span class=\"hljs-number\">0<\/span>\n    <span class=\"hljs-keyword\">while<\/span> <span class=\"hljs-keyword\">True<\/span>:\n        idx += <span class=\"hljs-number\">1<\/span>\n        mem_before = mem_read()\n        line = mm.readline() <span class=\"hljs-keyword\">if<\/span> mmap_mode <span class=\"hljs-keyword\">else<\/span> fh.readline()\n        <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-keyword\">not<\/span> line:\n            <span class=\"hljs-keyword\">break<\/span>\n\n        <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-string\">\"--accumulate\"<\/span> <span class=\"hljs-keyword\">in<\/span> sys.argv:\n            mem_before_acc = mem_read()\n            content += str(line)\n            mem_after_acc = mem_read()\n\n        mem_after = mem_read()\n\n        print(f<span class=\"hljs-string\">\"{idx:4d} {mem_after:10.2f}MB {mem_after - mem_before:10.2f}MB {mem_after_acc - mem_before_acc:10.2f}MB\"<\/span>)<\/pre>\n<p>The four output columns are:<\/p>\n<pre> <span class=\"hljs-attribute\">idx<\/span>        RSS          \u0394 RSS   \u0394 accumulated<\/pre>\n<ol>\n<li>the line number (starting from 1)<\/li>\n<li>the total RSS reported at the end of each iteration<\/li>\n<li>the RSS delta of each iteration<\/li>\n<li>the accumulated buffer delta<\/li>\n<\/ol>\n<p>And as you can see we force Python&#8217;s garbage collection via <code>gc.collect()<\/code> before taking RSS (Resident Set Size) measurements. This is a very crucial step when debugging memory usages and leaks in particular and especially if you delete some objects and want to make sure that memory is actually freed as Python&#8217;s garbage collection mechanism is not immediate.<\/p>\n<h2 id=\"normal-io-diagnostics\">Normal IO diagnostics<\/h2>\n<p>First, let&#8217;s run normal IO without accumulating any strings and simply discarding those.<\/p>\n<pre>shell A $ python mmap-no-leak-debug.py\n idx        RSS          \u0394 RSS   \u0394 accumulated\n   <span class=\"hljs-number\">0<\/span>      <span class=\"hljs-number\">12.37<\/span>MB      <span class=\"hljs-number\">12.37<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">1<\/span>     <span class=\"hljs-number\">269.66<\/span>MB     <span class=\"hljs-number\">257.29<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">2<\/span>     <span class=\"hljs-number\">269.68<\/span>MB       <span class=\"hljs-number\">0.02<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">3<\/span>     <span class=\"hljs-number\">269.68<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">4<\/span>     <span class=\"hljs-number\">269.69<\/span>MB       <span class=\"hljs-number\">0.01<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">5<\/span>     <span class=\"hljs-number\">269.69<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">6<\/span>     <span class=\"hljs-number\">269.70<\/span>MB       <span class=\"hljs-number\">0.01<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">7<\/span>     <span class=\"hljs-number\">269.70<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">8<\/span>     <span class=\"hljs-number\">269.70<\/span>MB       <span class=\"hljs-number\">0.01<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">9<\/span>     <span class=\"hljs-number\">269.70<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">10<\/span>     <span class=\"hljs-number\">269.71<\/span>MB       <span class=\"hljs-number\">0.01<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">11<\/span>     <span class=\"hljs-number\">269.71<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">12<\/span>     <span class=\"hljs-number\">269.71<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">13<\/span>     <span class=\"hljs-number\">269.71<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">14<\/span>     <span class=\"hljs-number\">269.71<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">15<\/span>     <span class=\"hljs-number\">269.71<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">16<\/span>     <span class=\"hljs-number\">145.96<\/span>MB    <span class=\"hljs-number\">-123.75<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB<\/pre>\n<p>We read in a loop a 128MB line and discard it.<\/p>\n<p>We can see the memory is very low and steady, with some fluctuations when Python decided to release some memory. The program allocates more than 128MB due to a new line character in the string &#8211; this is a peculiar Python behavior.<\/p>\n<p>The bottom line is that the program doesn&#8217;t appear to be leaking any memory.<\/p>\n<h2 id=\"mmap-ed-io-diagnostics\">MMAP&#8217;ed IO diagnostics<\/h2>\n<p>Now let&#8217;s do the exact same operation but this time using <code>mmap<\/code>&#8216;s IO:<\/p>\n<pre>shell A $ python mmap-no-leak-debug.py --mmap\nidx        RSS          \u0394 RSS   \u0394 accumulated\n   <span class=\"hljs-number\">0<\/span>      <span class=\"hljs-number\">12.39<\/span>MB      <span class=\"hljs-number\">12.39<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">1<\/span>     <span class=\"hljs-number\">268.25<\/span>MB     <span class=\"hljs-number\">255.87<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">2<\/span>     <span class=\"hljs-number\">396.47<\/span>MB     <span class=\"hljs-number\">128.22<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">3<\/span>     <span class=\"hljs-number\">524.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">4<\/span>     <span class=\"hljs-number\">652.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">5<\/span>     <span class=\"hljs-number\">780.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">6<\/span>     <span class=\"hljs-number\">908.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">7<\/span>    <span class=\"hljs-number\">1036.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">8<\/span>    <span class=\"hljs-number\">1164.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">9<\/span>    <span class=\"hljs-number\">1292.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">10<\/span>    <span class=\"hljs-number\">1420.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">11<\/span>    <span class=\"hljs-number\">1548.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">12<\/span>    <span class=\"hljs-number\">1676.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">13<\/span>    <span class=\"hljs-number\">1804.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">14<\/span>    <span class=\"hljs-number\">1932.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">15<\/span>    <span class=\"hljs-number\">2060.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">16<\/span>    <span class=\"hljs-number\">2188.47<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB<\/pre>\n<p>Whoah! It looks like there is a major leak here. On each iteration the program keeps on growing by 128MB despite us discarding the read data. What&#8217;s going on?<\/p>\n<p>The theoretical explanation is simple &#8211; MMAP was designed to make IO faster and shared by multiple processes &#8211; so if there is a lot of available RAM, the MMAP API will use as much of it as it can and in order to speed things up it won&#8217;t normally release it back to the OS. For example, if you have two programs reading the same sections from the same MMAP&#8217;ed file only the first program will incur the delay of copying the data from disc to RAM. The other program will read it directly from RAM. Since MMAP doesn&#8217;t know which sections will be accessed next it simply keeps everything it read in the memory if there is enough of it.<\/p>\n<p>But you&#8217;d say this is very bad and that&#8217;s a terrible design. But wait, it only keeps it in memory if nobody else wants the memory, and it immediately releases that unused memory back to the operating system as soon as such demand arises.<\/p>\n<h2 id=\"proof-that-there-is-no-leak\">Proof that there is no leak<\/h2>\n<p>To show that the memory does get released as soon as it&#8217;s needed let&#8217;s re-run this same program in shell B, where only 1GB of memory is allowed to be allocated.<\/p>\n<pre>shell B $ systemd-run --user --scope -p MemoryHigh=<span class=\"hljs-number\">1<\/span>G -p MemoryMax=<span class=\"hljs-number\">1<\/span>G -p MemorySwapMax=<span class=\"hljs-number\">0<\/span>G --setenv=<span class=\"hljs-string\">\"MEMLIMIT=1GB\"<\/span> bash\nshell B $ python mmap-no-leak-debug.py --mmap\n idx        RSS          \u0394 RSS   \u0394 accumulated\n   <span class=\"hljs-number\">0<\/span>      <span class=\"hljs-number\">12.48<\/span>MB      <span class=\"hljs-number\">12.48<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">1<\/span>     <span class=\"hljs-number\">268.51<\/span>MB     <span class=\"hljs-number\">256.03<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">2<\/span>     <span class=\"hljs-number\">396.73<\/span>MB     <span class=\"hljs-number\">128.22<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">3<\/span>     <span class=\"hljs-number\">524.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">4<\/span>     <span class=\"hljs-number\">652.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">5<\/span>     <span class=\"hljs-number\">780.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">6<\/span>     <span class=\"hljs-number\">908.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">7<\/span>    <span class=\"hljs-number\">1036.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">8<\/span>    <span class=\"hljs-number\">1164.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">9<\/span>    <span class=\"hljs-number\">1292.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">10<\/span>    <span class=\"hljs-number\">1420.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">11<\/span>    <span class=\"hljs-number\">1548.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">12<\/span>    <span class=\"hljs-number\">1676.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">13<\/span>    <span class=\"hljs-number\">1804.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">14<\/span>    <span class=\"hljs-number\">1932.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">15<\/span>    <span class=\"hljs-number\">2060.73<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n  <span class=\"hljs-number\">16<\/span>    <span class=\"hljs-number\">2188.69<\/span>MB     <span class=\"hljs-number\">127.95<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB<\/pre>\n<p>A surprise, it appears that the program managed to allocate &gt;2GB of memory when we double checked that it should have been killed as soon as it reached 1GB RSS since we limited the shell to allow only &lt;1GB memory allocation!<\/p>\n<p>We will understand better shortly what&#8217;s going on, but it&#8217;s clear that cgroups that controls the memory usage is aware that while it accounts that MMAP&#8217;ed memory to the RSS counter of the program it&#8217;s aware that the program itself isn&#8217;t using most of this memory!<\/p>\n<p>Interim observation: we can&#8217;t rely on RSS memory stats to diagnose memory leaks when MMAP is used.<\/p>\n<h2 id=\"let-s-create-memory-pressure\">Let&#8217;s create memory pressure<\/h2>\n<p>This is where our <code>--accumulate<\/code> flag comes in. It&#8217;s going to help us to see that RSS is &#8220;misreporting&#8221; the actual memory used by the program.<\/p>\n<p>First we run it with normal IO:<\/p>\n<pre>shell A $ python mmap-no-leak-debug.py --accumulate\n idx        RSS          \u0394 RSS   \u0394 accumulated\n   <span class=\"hljs-number\">0<\/span>      <span class=\"hljs-number\">12.30<\/span>MB      <span class=\"hljs-number\">12.30<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">1<\/span>     <span class=\"hljs-number\">269.60<\/span>MB     <span class=\"hljs-number\">257.29<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">2<\/span>     <span class=\"hljs-number\">525.49<\/span>MB     <span class=\"hljs-number\">255.89<\/span>MB     <span class=\"hljs-number\">127.93<\/span>MB\n   <span class=\"hljs-number\">3<\/span>     <span class=\"hljs-number\">653.49<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n   <span class=\"hljs-number\">4<\/span>     <span class=\"hljs-number\">781.50<\/span>MB     <span class=\"hljs-number\">128.01<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n   <span class=\"hljs-number\">5<\/span>     <span class=\"hljs-number\">909.50<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n   <span class=\"hljs-number\">6<\/span>    <span class=\"hljs-number\">1037.51<\/span>MB     <span class=\"hljs-number\">128.01<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n   <span class=\"hljs-number\">7<\/span>    <span class=\"hljs-number\">1165.51<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n   <span class=\"hljs-number\">8<\/span>    <span class=\"hljs-number\">1293.52<\/span>MB     <span class=\"hljs-number\">128.01<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n   <span class=\"hljs-number\">9<\/span>    <span class=\"hljs-number\">1421.52<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n  <span class=\"hljs-number\">10<\/span>    <span class=\"hljs-number\">1549.53<\/span>MB     <span class=\"hljs-number\">128.01<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n  <span class=\"hljs-number\">11<\/span>    <span class=\"hljs-number\">1677.53<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n  <span class=\"hljs-number\">12<\/span>    <span class=\"hljs-number\">1805.53<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n  <span class=\"hljs-number\">13<\/span>    <span class=\"hljs-number\">1933.53<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n  <span class=\"hljs-number\">14<\/span>    <span class=\"hljs-number\">2061.53<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n  <span class=\"hljs-number\">15<\/span>    <span class=\"hljs-number\">2189.53<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n  <span class=\"hljs-number\">16<\/span>    <span class=\"hljs-number\">2193.78<\/span>MB       <span class=\"hljs-number\">4.25<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB<\/pre>\n<p>where RSS reports correctly <code>128*16 ~= 2048<\/code>MB and then some for the other bits of the program, but the ballpark matches.<\/p>\n<p>Now let&#8217;s activate MMAP and re-run:<\/p>\n<pre>shell A $ python mmap-no-leak-debug.py --mmap --accumulate\n idx        RSS          \u0394 RSS   \u0394 accumulated\n   <span class=\"hljs-number\">0<\/span>      <span class=\"hljs-number\">12.37<\/span>MB      <span class=\"hljs-number\">12.37<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">1<\/span>     <span class=\"hljs-number\">396.39<\/span>MB     <span class=\"hljs-number\">384.02<\/span>MB     <span class=\"hljs-number\">128.13<\/span>MB\n   <span class=\"hljs-number\">2<\/span>     <span class=\"hljs-number\">652.48<\/span>MB     <span class=\"hljs-number\">256.09<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">3<\/span>     <span class=\"hljs-number\">908.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">4<\/span>    <span class=\"hljs-number\">1164.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">5<\/span>    <span class=\"hljs-number\">1420.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">6<\/span>    <span class=\"hljs-number\">1676.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">7<\/span>    <span class=\"hljs-number\">1932.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">8<\/span>    <span class=\"hljs-number\">2188.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">9<\/span>    <span class=\"hljs-number\">2444.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n  <span class=\"hljs-number\">10<\/span>    <span class=\"hljs-number\">2700.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n  <span class=\"hljs-number\">11<\/span>    <span class=\"hljs-number\">2956.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n  <span class=\"hljs-number\">12<\/span>    <span class=\"hljs-number\">3212.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n  <span class=\"hljs-number\">13<\/span>    <span class=\"hljs-number\">3468.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n  <span class=\"hljs-number\">14<\/span>    <span class=\"hljs-number\">3724.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n  <span class=\"hljs-number\">15<\/span>    <span class=\"hljs-number\">3980.48<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n  <span class=\"hljs-number\">16<\/span>    <span class=\"hljs-number\">4236.46<\/span>MB     <span class=\"hljs-number\">255.98<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB<\/pre>\n<p>Here we can see that RSS reports 2x memory than it actually uses.<\/p>\n<p>And now let&#8217;s create pressure using our 1GB-limited shell B and use normal IO with accumulation:<\/p>\n<pre>shell B $ systemd-run --user --scope -p MemoryHigh=<span class=\"hljs-number\">1<\/span>G -p MemoryMax=<span class=\"hljs-number\">1<\/span>G -p MemorySwapMax=<span class=\"hljs-number\">0<\/span>G --setenv=<span class=\"hljs-string\">\"MEMLIMIT=1GB\"<\/span> bash\nshell B $ python mmap-no-leak-debug.py --accumulate\n idx        RSS          \u0394 RSS   \u0394 accumulated\n   <span class=\"hljs-number\">0<\/span>      <span class=\"hljs-number\">12.38<\/span>MB      <span class=\"hljs-number\">12.38<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">1<\/span>     <span class=\"hljs-number\">269.41<\/span>MB     <span class=\"hljs-number\">257.04<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">2<\/span>     <span class=\"hljs-number\">525.55<\/span>MB     <span class=\"hljs-number\">256.14<\/span>MB     <span class=\"hljs-number\">127.93<\/span>MB\n   <span class=\"hljs-number\">3<\/span>     <span class=\"hljs-number\">653.55<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n   <span class=\"hljs-number\">4<\/span>     <span class=\"hljs-number\">781.56<\/span>MB     <span class=\"hljs-number\">128.01<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\n   <span class=\"hljs-number\">5<\/span>     <span class=\"hljs-number\">909.56<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB     <span class=\"hljs-number\">127.87<\/span>MB\nKilled<\/pre>\n<p>As you can easily see the program gets killed once it reaches 1GB of RSS. It managed to perform 5 iterations, thus on iteration 6 it tries to accumulate <code>6*128=768<\/code> plus the current <code>readline<\/code> read of 128MB, plus the memory used by the rest of the program, it crosses 1GB and gets killed before finishing iteration 6.<\/p>\n<p>Also it might be useful to compare with the same run with shell A. You can see that RSS of the shell B run is quite different from shell A. The reported RSS doesn&#8217;t grow as fast.<\/p>\n<p>Now let&#8217;s run the MMAPed version:<\/p>\n<pre>shell B $ systemd-run --user --scope -p MemoryHigh=<span class=\"hljs-number\">1<\/span>G -p MemoryMax=<span class=\"hljs-number\">1<\/span>G -p MemorySwapMax=<span class=\"hljs-number\">0<\/span>G --setenv=<span class=\"hljs-string\">\"MEMLIMIT=1GB\"<\/span> bash\nshell B $ python mmap-no-leak-debug.py --mmap --accumulate\n idx        RSS          \u0394 RSS   \u0394 accumulated\n   <span class=\"hljs-number\">0<\/span>      <span class=\"hljs-number\">12.51<\/span>MB      <span class=\"hljs-number\">12.51<\/span>MB       <span class=\"hljs-number\">0.00<\/span>MB\n   <span class=\"hljs-number\">1<\/span>     <span class=\"hljs-number\">396.52<\/span>MB     <span class=\"hljs-number\">384.00<\/span>MB     <span class=\"hljs-number\">128.13<\/span>MB\n   <span class=\"hljs-number\">2<\/span>     <span class=\"hljs-number\">652.60<\/span>MB     <span class=\"hljs-number\">256.08<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">3<\/span>     <span class=\"hljs-number\">908.60<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">4<\/span>    <span class=\"hljs-number\">1164.60<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\n   <span class=\"hljs-number\">5<\/span>    <span class=\"hljs-number\">1420.60<\/span>MB     <span class=\"hljs-number\">256.00<\/span>MB     <span class=\"hljs-number\">128.00<\/span>MB\nKilled\n<\/pre>\n<p>You can see it gets killed in the exactly same iteration as when it was run without MMAP.<\/p>\n<p>You can see that while the RSS numbers are bigger than that of the normal IO run, the program gets killed in the exact same iteration. which tells us the actual memory usage with normal IO and mmap&#8217;ed IO is either very similar or very likely exactly the same.<\/p>\n<h2 id=\"enter-huggingface-datasets\">What about PyArrow?<\/h2>\n<p>Originally this whole research started from this <a href=\"https:\/\/github.com\/huggingface\/datasets\/issues\/4883\">Issue<\/a> in the <a href=\"https:\/\/github.com\/huggingface\/datasets\"><code>datasets<\/code><\/a> repo. It looked like a dataset loaded via <code>pyarrow<\/code> leaked on every iteration.<\/p>\n<p><a href=\"https:\/\/github.com\/lhoestq\">Quentin Lhoest<\/a> reduced it to <a href=\"https:\/\/github.com\/huggingface\/datasets\/issues\/4883#issuecomment-1242034985\">a simple <code>pyarrow<\/code> program<\/a><\/p>\n<pre>$ cat mmap-no-leak-<span class=\"hljs-built_in\">debug<\/span>-pyarrow.py\n<span class=\"hljs-keyword\">import<\/span> psutil\n<span class=\"hljs-keyword\">import<\/span> <span class=\"hljs-built_in\">os<\/span>\n<span class=\"hljs-keyword\">import<\/span> gc\n<span class=\"hljs-keyword\">import<\/span> pyarrow as pa\n\nARROW_PATH = <span class=\"hljs-string\">\"tmp.arrow\"<\/span>\n\n<span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-keyword\">not<\/span> <span class=\"hljs-built_in\">os<\/span>.path.exists(ARROW_PATH):\n    arr = pa.array([b<span class=\"hljs-string\">\"a\"<\/span> * (<span class=\"hljs-number\">200<\/span> * <span class=\"hljs-number\">1024<\/span>)] * <span class=\"hljs-number\">1000<\/span>)  # ~<span class=\"hljs-number\">200<\/span>MB\n    <span class=\"hljs-built_in\">table<\/span> = pa.<span class=\"hljs-built_in\">table<\/span>({<span class=\"hljs-string\">\"a\"<\/span>: arr})\n\n    with open(ARROW_PATH, <span class=\"hljs-string\">\"wb\"<\/span>) as <span class=\"hljs-name\">f<\/span>:\n        writer = pa.RecordBatchStreamWriter(f, schema=<span class=\"hljs-built_in\">table<\/span>.schema)\n        writer.write_table(<span class=\"hljs-built_in\">table<\/span>)\n        writer.close()\n\ndef memory_mapped_arrow_table_from_file(<span class=\"hljs-name\">filename<\/span>: str) -&gt; pa.<span class=\"hljs-name\">Table<\/span>:\n    memory_mapped_stream = pa.memory_map(filename)\n    opened_stream = pa.ipc.open_stream(memory_mapped_stream)\n    pa_table = opened_stream.read_all()\n    <span class=\"hljs-keyword\">return<\/span> pa_table\n\n\n<span class=\"hljs-built_in\">table<\/span> = memory_mapped_arrow_table_from_file(ARROW_PATH)\narr = <span class=\"hljs-built_in\">table<\/span>[<span class=\"hljs-number\">0<\/span>]\n\n<span class=\"hljs-built_in\">print<\/span>(f<span class=\"hljs-string\">\"{'idx':&gt;8} {'RSS':&gt;10} {'\u0394 RSS':&gt;15}\"<\/span>)\n\nmem_before = psutil.Process(<span class=\"hljs-built_in\">os<\/span>.getpid()).memory_info().rss \/ (<span class=\"hljs-number\">1024<\/span> * <span class=\"hljs-number\">1024<\/span>)\n<span class=\"hljs-keyword\">for<\/span> idx, x <span class=\"hljs-keyword\">in<\/span> enumerate(arr):\n    <span class=\"hljs-keyword\">if<\/span> idx % <span class=\"hljs-number\">100<\/span> == <span class=\"hljs-number\">0<\/span>:\n        gc.collect()\n        mem_after = psutil.Process(<span class=\"hljs-built_in\">os<\/span>.getpid()).memory_info().rss \/ (<span class=\"hljs-number\">1024<\/span> * <span class=\"hljs-number\">1024<\/span>)\n        <span class=\"hljs-built_in\">print<\/span>(f<span class=\"hljs-string\">\"{idx:4d}  {mem_after:12.4f}MB {mem_after - mem_before:12.4f}MB\"<\/span>)\n<\/pre>\n<p>which when run produced the familiar leak-like pattern:<\/p>\n<pre>$ python mmap-no-leak-debug-pyarrow.py\n     idx        RSS           \u0394 RSS\n   <span class=\"hljs-number\">0<\/span>       <span class=\"hljs-number\">51.3164<\/span>MB       <span class=\"hljs-number\">2.5430<\/span>MB\n <span class=\"hljs-number\">100<\/span>       <span class=\"hljs-number\">69.9805<\/span>MB      <span class=\"hljs-number\">21.2070<\/span>MB\n <span class=\"hljs-number\">200<\/span>       <span class=\"hljs-number\">90.6055<\/span>MB      <span class=\"hljs-number\">41.8320<\/span>MB\n <span class=\"hljs-number\">300<\/span>      <span class=\"hljs-number\">107.1055<\/span>MB      <span class=\"hljs-number\">58.3320<\/span>MB\n <span class=\"hljs-number\">400<\/span>      <span class=\"hljs-number\">127.7305<\/span>MB      <span class=\"hljs-number\">78.9570<\/span>MB\n <span class=\"hljs-number\">500<\/span>      <span class=\"hljs-number\">148.3555<\/span>MB      <span class=\"hljs-number\">99.5820<\/span>MB\n <span class=\"hljs-number\">600<\/span>      <span class=\"hljs-number\">164.8555<\/span>MB     <span class=\"hljs-number\">116.0820<\/span>MB\n <span class=\"hljs-number\">700<\/span>      <span class=\"hljs-number\">185.4805<\/span>MB     <span class=\"hljs-number\">136.7070<\/span>MB\n <span class=\"hljs-number\">800<\/span>      <span class=\"hljs-number\">206.1055<\/span>MB     <span class=\"hljs-number\">157.3320<\/span>MB\n <span class=\"hljs-number\">900<\/span>      <span class=\"hljs-number\">226.7305<\/span>MB     <span class=\"hljs-number\">177.9570<\/span>MB\n<\/pre>\n<p>But if we run it from a shell that is only allowed 100MB of allocated memory:<\/p>\n<pre>$ systemd-run --user --scope -p MemoryHigh=<span class=\"hljs-number\">0.1<\/span>G -p MemoryMax=<span class=\"hljs-number\">0.1<\/span>G -p MemorySwapMax=<span class=\"hljs-number\">0<\/span>G --setenv=<span class=\"hljs-string\">\"MEMLIMIT=0.1GB\"<\/span> bash\n$ python mmap-no-leak-debug-pyarrow.py\n     idx        RSS           \u0394 RSS\n   <span class=\"hljs-number\">0<\/span>       <span class=\"hljs-number\">51.2852<\/span>MB       <span class=\"hljs-number\">2.4609<\/span>MB\n <span class=\"hljs-number\">100<\/span>       <span class=\"hljs-number\">70.4102<\/span>MB      <span class=\"hljs-number\">21.5859<\/span>MB\n <span class=\"hljs-number\">200<\/span>       <span class=\"hljs-number\">86.9102<\/span>MB      <span class=\"hljs-number\">38.0859<\/span>MB\n <span class=\"hljs-number\">300<\/span>      <span class=\"hljs-number\">107.5352<\/span>MB      <span class=\"hljs-number\">58.7109<\/span>MB\n <span class=\"hljs-number\">400<\/span>      <span class=\"hljs-number\">128.1602<\/span>MB      <span class=\"hljs-number\">79.3359<\/span>MB\n <span class=\"hljs-number\">500<\/span>      <span class=\"hljs-number\">148.7852<\/span>MB      <span class=\"hljs-number\">99.9609<\/span>MB\n <span class=\"hljs-number\">600<\/span>      <span class=\"hljs-number\">165.2852<\/span>MB     <span class=\"hljs-number\">116.4609<\/span>MB\n <span class=\"hljs-number\">700<\/span>      <span class=\"hljs-number\">185.9102<\/span>MB     <span class=\"hljs-number\">137.0859<\/span>MB\n <span class=\"hljs-number\">800<\/span>      <span class=\"hljs-number\">206.5352<\/span>MB     <span class=\"hljs-number\">157.7109<\/span>MB\n <span class=\"hljs-number\">900<\/span>      <span class=\"hljs-number\">227.1602<\/span>MB     <span class=\"hljs-number\">178.3359<\/span>MB<\/pre>\n<p>So it reports it allocated ~200MB of RSS, yet it runs just fine without getting killed.<\/p>\n<p>There is no leak here.<\/p>\n<h2 id=\"what-about-datasets-\">What about HuggingFace datasets?<\/h2>\n<p>In another <a href=\"https:\/\/github.com\/huggingface\/datasets\/issues\/4528\">Issue<\/a> a very similar datasets-iterator-is-leaking report was submitted.<\/p>\n<p>So let&#8217;s use a similar <code>datasets<\/code> reproduction example here but we will use a larger dataset.<\/p>\n<pre>$ cat mmap-no-leak-debug-datasets.py\n<span class=\"hljs-keyword\">from<\/span> datasets <span class=\"hljs-keyword\">import<\/span> load_dataset\n<span class=\"hljs-keyword\">import<\/span> gc\n<span class=\"hljs-keyword\">import<\/span> os\n<span class=\"hljs-keyword\">import<\/span> psutil\n<span class=\"hljs-keyword\">import<\/span> sys\n\nkeep_in_memory = <span class=\"hljs-keyword\">True<\/span> <span class=\"hljs-keyword\">if<\/span> <span class=\"hljs-string\">\"in-mem\"<\/span> <span class=\"hljs-keyword\">in<\/span> sys.argv <span class=\"hljs-keyword\">else<\/span> <span class=\"hljs-keyword\">False<\/span>\n\nproc = psutil.Process(os.getpid())\n<span class=\"hljs-function\"><span class=\"hljs-keyword\">def<\/span> <span class=\"hljs-title\">mem_read<\/span><span class=\"hljs-params\">()<\/span>:<\/span>\n    gc.collect()\n    <span class=\"hljs-keyword\">return<\/span> proc.memory_info().rss \/ <span class=\"hljs-number\">2<\/span>**<span class=\"hljs-number\">20<\/span>\n\ndataset = load_dataset(<span class=\"hljs-string\">\"wmt19\"<\/span>, <span class=\"hljs-string\">'cs-en'<\/span>, keep_in_memory=keep_in_memory, streaming=keep_in_memory)[<span class=\"hljs-string\">'train'<\/span>]\nprint(f<span class=\"hljs-string\">\"Dataset len={len(dataset)}\"<\/span>)\n\nprint(f<span class=\"hljs-string\">\"{'idx':&gt;8} {'RSS':&gt;10} {'\u0394 RSS':&gt;15}\"<\/span>)\nstep = <span class=\"hljs-number\">1<\/span>_000_000\nmem_start = <span class=\"hljs-number\">0<\/span>\n<span class=\"hljs-keyword\">for<\/span> idx, i <span class=\"hljs-keyword\">in<\/span> enumerate(range(<span class=\"hljs-number\">0<\/span>, len(dataset), step)):\n    <span class=\"hljs-keyword\">if<\/span> idx == <span class=\"hljs-number\">4<\/span>: <span class=\"hljs-comment\"># skip the first few iterations while things get set up<\/span>\n        mem_start = mem_read()\n    mem_before = mem_read()\n    x = dataset[i:i+step]\n    mem_after = mem_read()\n    print(f<span class=\"hljs-string\">\"{idx:8d} {mem_after:12.4f}MB {mem_after - mem_before:12.4f}MB \"<\/span>)\nmem_end = mem_read()\n\nprint(f<span class=\"hljs-string\">\"Total diff: {mem_end - mem_start:12.4f}MB \"<\/span>)<\/pre>\n<p>Let&#8217;s run it in a normal shell first:<\/p>\n<pre>$ python mmap-no-leak-debug-datasets.py\nDataset len=7270695\n     idx        RSS           \u0394 RSS\n      <span class=\"hljs-number\"> 0 <\/span>    775.7773MB     609.9805MB\n      <span class=\"hljs-number\"> 1 <\/span>    849.6016MB      73.8242MB\n      <span class=\"hljs-number\"> 2 <\/span>    876.1445MB      26.5430MB\n      <span class=\"hljs-number\"> 3 <\/span>    941.3477MB      65.2031MB\n      <span class=\"hljs-number\"> 4 <\/span>    984.9570MB      43.6094MB\n      <span class=\"hljs-number\"> 5 <\/span>   1053.6445MB      68.6875MB\n      <span class=\"hljs-number\"> 6 <\/span>   1164.2852MB     110.6406MB\n      <span class=\"hljs-number\"> 7 <\/span>   1252.5312MB      88.2461MB\n      <span class=\"hljs-number\"> 8 <\/span>   1368.6523MB     116.1211MB\n      <span class=\"hljs-number\"> 9 <\/span>   1445.7266MB      77.0742MB\n     <span class=\"hljs-number\"> 10 <\/span>   1564.5195MB     118.7930MB\n     <span class=\"hljs-number\"> 11 <\/span>   1678.7500MB     114.2305MB\n     <span class=\"hljs-number\"> 12 <\/span>   1729.9844MB      51.2344MB\n     <span class=\"hljs-number\"> 13 <\/span>   1866.1953MB     136.2109MB\nTotal diff:    1700.3984MB<\/pre>\n<p>You can see the mid-column of total RSS memory keeps on growing in MBs. The last column is by how much it has grown during a single iteration of the script (0.5M items).<\/p>\n<p>And now let&#8217;s run in a 1GB limited shell:<\/p>\n<pre>$ systemd-run --user --scope -p MemoryHigh=1G -p MemoryMax=1G -p MemorySwapMax=0G --setenv=\"MEMLIMIT=1GB\" bash\n$ python mmap-no-leak-debug-datasets.py\nDataset len=7270695\n     idx        RSS           \u0394 RSS\n      <span class=\"hljs-number\"> 0 <\/span>    775.8516MB     610.1797MB\n      <span class=\"hljs-number\"> 1 <\/span>    849.5820MB      73.7305MB\n      <span class=\"hljs-number\"> 2 <\/span>    876.1328MB      26.5508MB\n      <span class=\"hljs-number\"> 3 <\/span>    941.3281MB      65.1953MB\n      <span class=\"hljs-number\"> 4 <\/span>    984.9375MB      43.6094MB\n      <span class=\"hljs-number\"> 5 <\/span>   1053.6328MB      68.6953MB\n      <span class=\"hljs-number\"> 6 <\/span>   1164.0273MB     110.3945MB\n      <span class=\"hljs-number\"> 7 <\/span>   1252.5273MB      88.5000MB\n      <span class=\"hljs-number\"> 8 <\/span>   1368.3906MB     115.8633MB\n      <span class=\"hljs-number\"> 9 <\/span>   1445.7188MB      77.3281MB\n     <span class=\"hljs-number\"> 10 <\/span>   1564.2656MB     118.5469MB\n     <span class=\"hljs-number\"> 11 <\/span>   1678.7383MB     114.4727MB\n     <span class=\"hljs-number\"> 12 <\/span>   1729.7227MB      50.9844MB\n     <span class=\"hljs-number\"> 13 <\/span>   1866.1875MB     136.4648MB\nTotal diff:    1700.5156MB<\/pre>\n<p>No problem at all.<\/p>\n<p>So we now know there is no leak there and it&#8217;s just the OS includes in RSS memory that will be released as soon as it&#8217;s needed.<\/p>\n<h2 id=\"debbuging-real-leak-while-using-mmap\">How to debug real memory leaks while using MMAP<\/h2>\n<p>So how does one debug an actual memory that might be elsewhere in the code while using MMAP.<\/p>\n<p>Well, you have to disable MMAP for the duration of your debug session and then re-enabled it back when you want high performance.<\/p>\n<p>As you have seen at the beginning of this article switching from <code>mmap<\/code> to normal IO is very simple to do.<\/p>\n<p>In the case of <code>datasets<\/code> you&#8217;d turn MMAP functionality off with <code>keep_in_memory=True<\/code> as in:<\/p>\n<pre><code><span class=\"hljs-attr\">dataset<\/span> = load_dataset(<span class=\"hljs-string\">\"wmt19\"<\/span>, <span class=\"hljs-string\">'cs-en'<\/span>, keep_in_memory=<span class=\"hljs-literal\">True<\/span>, streaming=<span class=\"hljs-literal\">False<\/span>)[<span class=\"hljs-string\">'train'<\/span>]\n<\/code><\/pre>\n<p>This loads the dataset in RAM, and now you should be able to debug your potential leak.<\/p>\n<p>Let&#8217;s test after modifying our last program:<\/p>\n<pre>- dataset = load_dataset(<span class=\"hljs-string\">\"wmt19\"<\/span>, <span class=\"hljs-symbol\">'cs<\/span>-en', keep_in_memory=<span class=\"hljs-literal\">False<\/span>, streaming=<span class=\"hljs-literal\">False<\/span>)[<span class=\"hljs-symbol\">'train<\/span>']\n+ dataset = load_dataset(<span class=\"hljs-string\">\"wmt19\"<\/span>, <span class=\"hljs-symbol\">'cs<\/span>-en', keep_in_memory=<span class=\"hljs-literal\">True<\/span>, streaming=<span class=\"hljs-literal\">False<\/span>)[<span class=\"hljs-symbol\">'train<\/span>']<\/pre>\n<p>Now in the normal unlimited shell we run:<\/p>\n<pre>$ python mmap-no-leak-debug-datasets.py --in-mem\nDataset len=7270695\n     idx        RSS           \u0394 RSS\n      <span class=\"hljs-number\"> 0 <\/span>   1849.5391MB     469.5781MB\n      <span class=\"hljs-number\"> 1 <\/span>   1833.0391MB     -16.5000MB\n      <span class=\"hljs-number\"> 2 <\/span>   1803.4609MB     -29.5781MB\n      <span class=\"hljs-number\"> 3 <\/span>   1811.5312MB       8.0703MB\n      <span class=\"hljs-number\"> 4 <\/span>   1803.9531MB      -7.5781MB\n      <span class=\"hljs-number\"> 5 <\/span>   1811.7734MB       7.8203MB\n      <span class=\"hljs-number\"> 6 <\/span>   1836.0391MB      24.2656MB\n      <span class=\"hljs-number\"> 7 <\/span>   1839.5938MB       3.5547MB\n      <span class=\"hljs-number\"> 8 <\/span>   1855.9688MB      16.3750MB\n      <span class=\"hljs-number\"> 9 <\/span>   1850.5430MB      -5.4258MB\n     <span class=\"hljs-number\"> 10 <\/span>   1865.3398MB      14.7969MB\n     <span class=\"hljs-number\"> 11 <\/span>   1876.2461MB      10.9062MB\n     <span class=\"hljs-number\"> 12 <\/span>   1853.0469MB     -23.1992MB\n     <span class=\"hljs-number\"> 13 <\/span>   1881.4453MB      28.3984MB\nTotal diff:     501.4844MB<\/pre>\n<p>The RSS memory is more stable but fluctuates because the records are different, and the dataset can be huge to load into memory.<\/p>\n<h2 id=\"using-synthetic-mmap-disabled-dataset-to-debug-memory-leaks\">Using synthetic MMAP-disabled dataset to debug memory leaks<\/h2>\n<p>Therefore the easiest approach is to create a synthetic dataset of desired length with all records being the same. That way the data is no longer a factor in the memory usage patterns as it&#8217;s always the same.<\/p>\n<pre>$ <span class=\"hljs-keyword\">cat<\/span> <span class=\"hljs-keyword\">ds<\/span>-synthetic-<span class=\"hljs-keyword\">no<\/span>-mmap.<span class=\"hljs-keyword\">py<\/span>\nfrom datasets import load_from_disk, Dataset\nimport gc\nimport sys\nimport os\nimport psutil\n\nproc = psutil.Process(os.<span class=\"hljs-built_in\">getpid<\/span>())\ndef mem_read():\n    gc.collect()\n    <span class=\"hljs-keyword\">return<\/span> proc.memory_info().rss \/ <span class=\"hljs-number\">2<\/span>**<span class=\"hljs-number\">20<\/span>\n\nDS_PATH = <span class=\"hljs-string\">\"synthetic-ds\"<\/span>\n<span class=\"hljs-keyword\">if<\/span> not os.path.<span class=\"hljs-built_in\">exists<\/span>(DS_PATH):\n    records = <span class=\"hljs-number\">1<\/span>_000_000\n    <span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-string\">\"Creating a synthetic dataset\"<\/span>)\n    row = dict(foo=[dict(<span class=\"hljs-keyword\">a<\/span>=<span class=\"hljs-string\">'a'<\/span>*<span class=\"hljs-number\">500<\/span>, <span class=\"hljs-keyword\">b<\/span>=<span class=\"hljs-string\">'b'<\/span>*<span class=\"hljs-number\">1000<\/span>)])\n    <span class=\"hljs-keyword\">ds<\/span> = Dataset.from_dict({<span class=\"hljs-keyword\">k<\/span>: [v] * records <span class=\"hljs-keyword\">for<\/span> <span class=\"hljs-keyword\">k<\/span>, v in row.<span class=\"hljs-built_in\">items<\/span>()})\n    <span class=\"hljs-keyword\">ds<\/span>.save_to_disk(DS_PATH)\n    <span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-string\">\"Done. Please restart the program\"<\/span>)\n    sys.<span class=\"hljs-keyword\">exit<\/span>()\n\ndataset = load_from_disk(DS_PATH, keep_in_memory=True)\n<span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-keyword\">f<\/span><span class=\"hljs-string\">\"Dataset len={len(dataset)}\"<\/span>)\n\n<span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-keyword\">f<\/span><span class=\"hljs-string\">\"{'idx':&gt;8} {'RSS':&gt;10} {'\u0394 RSS':&gt;15}\"<\/span>)\nmem_start = <span class=\"hljs-number\">0<\/span>\nstep = <span class=\"hljs-number\">50<\/span>_000\nwarmup_iterations = <span class=\"hljs-number\">4<\/span>\n<span class=\"hljs-keyword\">for<\/span> idx, i in enumerate(<span class=\"hljs-built_in\">range<\/span>(<span class=\"hljs-number\">0<\/span>, <span class=\"hljs-built_in\">len<\/span>(dataset), step)):\n    <span class=\"hljs-keyword\">if<\/span> idx == warmup_iteration<span class=\"hljs-variable\">s:<\/span> # skip the <span class=\"hljs-keyword\">first<\/span> few iterations <span class=\"hljs-keyword\">while<\/span> things <span class=\"hljs-built_in\">get<\/span> <span class=\"hljs-keyword\">set<\/span> <span class=\"hljs-keyword\">up<\/span>\n        mem_start = mem_read()\n    mem_before = mem_read()\n    _ = dataset[i:i+step]\n    mem_after = mem_read()\n    <span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-keyword\">f<\/span><span class=\"hljs-string\">\"{i:8d} {mem_after:12.4f}MB {mem_after - mem_before:12.4f}MB\"<\/span>)\nmem_end = mem_read()\n\n<span class=\"hljs-keyword\">print<\/span>(<span class=\"hljs-keyword\">f<\/span><span class=\"hljs-string\">\"Total diff: {mem_end - mem_start:12.4f}MB (after {warmup_iterations} warmup iterations)\"<\/span>)\n<\/pre>\n<p>We run this program once to create the dataset, and then the second time to profile its memory usage:<\/p>\n<pre>$ python ds-synthetic-no-mmap<span class=\"hljs-selector-class\">.py<\/span>\nCreating <span class=\"hljs-selector-tag\">a<\/span> synthetic dataset\nDone. Please restart the program<\/pre>\n<pre>$ python ds-synthetic-no-mmap.py\nDataset len=1000000\n     idx        RSS           \u0394 RSS\n      <span class=\"hljs-number\"> 0 <\/span>   1649.6055MB      95.1992MB\n  <span class=\"hljs-number\"> 50000 <\/span>   1728.4961MB      78.8906MB\n <span class=\"hljs-number\"> 100000 <\/span>   1728.7109MB       0.2148MB\n <span class=\"hljs-number\"> 150000 <\/span>   1729.2539MB       0.5430MB\n <span class=\"hljs-number\"> 200000 <\/span>   1729.0039MB      -0.2500MB\n <span class=\"hljs-number\"> 250000 <\/span>   1729.5039MB       0.5000MB\n <span class=\"hljs-number\"> 300000 <\/span>   1729.2539MB      -0.2500MB\n <span class=\"hljs-number\"> 350000 <\/span>   1729.7539MB       0.5000MB\n <span class=\"hljs-number\"> 400000 <\/span>   1729.5039MB      -0.2500MB\n <span class=\"hljs-number\"> 450000 <\/span>   1730.0039MB       0.5000MB\n <span class=\"hljs-number\"> 500000 <\/span>   1729.7539MB      -0.2500MB\n <span class=\"hljs-number\"> 550000 <\/span>   1730.2539MB       0.5000MB\n <span class=\"hljs-number\"> 600000 <\/span>   1730.0039MB      -0.2500MB\n <span class=\"hljs-number\"> 650000 <\/span>   1730.5039MB       0.5000MB\n <span class=\"hljs-number\"> 700000 <\/span>   1730.2539MB      -0.2500MB\n <span class=\"hljs-number\"> 750000 <\/span>   1730.7539MB       0.5000MB\n <span class=\"hljs-number\"> 800000 <\/span>   1730.5039MB      -0.2500MB\n <span class=\"hljs-number\"> 850000 <\/span>   1731.0039MB       0.5000MB\n <span class=\"hljs-number\"> 900000 <\/span>   1730.7539MB      -0.2500MB\n <span class=\"hljs-number\"> 950000 <\/span>   1731.2539MB       0.5000MB\nTotal diff:       2.0000MB (after<span class=\"hljs-number\"> 4 <\/span>warmup iterations)\n<\/pre>\n<p>This is much better. There are still tiny fluctuations due to Python and you can see in the code I skipped the first few iterations in the code while things are being set up.<\/p>\n<p>But otherwise now you can easily debug the rest of your code for any memory leaks since <code>datasets<\/code> are in non-MMAP mode and the records size doesn&#8217;t fluctuate.<\/p>\n<p>Of course, do not forget to flip <code>load_from_disk(..., keep_in_memory=True)<\/code> to <code>False<\/code> when the debugging process is over so that you get back the performance speed up provided by MMAP.<\/p>\n<p>I wrote these notes mainly for myself to ensure I have a good understanding of this complex use-case. And I hope you have gained some understanding from it as well.<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>A step-by-step demonstration of mmap not leaking memory even though it appears to be leaking memory like there is no tomorrow. <!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":303,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[28,48],"tags":[51,31,49,52,50],"class_list":["post-301","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","category-software-engineering","tag-datasets","tag-machine-learning","tag-mmap","tag-pyarrow","tag-python"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.1.1 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>A Deep Investigation into MMAP Not Leaking Memory - Entrepreneur Being<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"A Deep Investigation into MMAP Not Leaking Memory - Entrepreneur Being\" \/>\n<meta property=\"og:description\" content=\"A step-by-step demonstration of mmap not leaking memory even though it appears to be leaking memory like there is no tomorrow.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/\" \/>\n<meta property=\"og:site_name\" content=\"Entrepreneur Being\" \/>\n<meta property=\"article:published_time\" content=\"2022-09-29T03:52:28+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-23T05:12:07+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png\" \/>\n\t<meta property=\"og:image:width\" content=\"940\" \/>\n\t<meta property=\"og:image:height\" content=\"788\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"stas\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"stas\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/\"},\"author\":{\"name\":\"stas\",\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/#\/schema\/person\/554642dec8ca3206478ceaefac2b48ac\"},\"headline\":\"A Deep Investigation into MMAP Not Leaking Memory\",\"datePublished\":\"2022-09-29T03:52:28+00:00\",\"dateModified\":\"2025-07-23T05:12:07+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/\"},\"wordCount\":1814,\"commentCount\":2,\"image\":{\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png\",\"keywords\":[\"datasets\",\"machine learning\",\"mmap\",\"pyarrow\",\"python\"],\"articleSection\":[\"Machine Learning\",\"Software Engineering\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/\",\"url\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/\",\"name\":\"A Deep Investigation into MMAP Not Leaking Memory - Entrepreneur Being\",\"isPartOf\":{\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png\",\"datePublished\":\"2022-09-29T03:52:28+00:00\",\"dateModified\":\"2025-07-23T05:12:07+00:00\",\"author\":{\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/#\/schema\/person\/554642dec8ca3206478ceaefac2b48ac\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#primaryimage\",\"url\":\"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png\",\"contentUrl\":\"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png\",\"width\":940,\"height\":788,\"caption\":\"A Deep Investigation into MMAP Not Leaking Memory\"},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/#website\",\"url\":\"https:\/\/stasosphere.com\/entrepreneur-being\/\",\"name\":\"Entrepreneur Being\",\"description\":\"What can be done without a safety net\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/stasosphere.com\/entrepreneur-being\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/stasosphere.com\/entrepreneur-being\/#\/schema\/person\/554642dec8ca3206478ceaefac2b48ac\",\"name\":\"stas\",\"sameAs\":[\"https:\/\/stasosphere.com\/\"],\"url\":\"https:\/\/stasosphere.com\/entrepreneur-being\/author\/stas\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"A Deep Investigation into MMAP Not Leaking Memory - Entrepreneur Being","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/","og_locale":"en_US","og_type":"article","og_title":"A Deep Investigation into MMAP Not Leaking Memory - Entrepreneur Being","og_description":"A step-by-step demonstration of mmap not leaking memory even though it appears to be leaking memory like there is no tomorrow.","og_url":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/","og_site_name":"Entrepreneur Being","article_published_time":"2022-09-29T03:52:28+00:00","article_modified_time":"2025-07-23T05:12:07+00:00","og_image":[{"width":940,"height":788,"url":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png","type":"image\/png"}],"author":"stas","twitter_misc":{"Written by":"stas","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#article","isPartOf":{"@id":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/"},"author":{"name":"stas","@id":"https:\/\/stasosphere.com\/entrepreneur-being\/#\/schema\/person\/554642dec8ca3206478ceaefac2b48ac"},"headline":"A Deep Investigation into MMAP Not Leaking Memory","datePublished":"2022-09-29T03:52:28+00:00","dateModified":"2025-07-23T05:12:07+00:00","mainEntityOfPage":{"@id":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/"},"wordCount":1814,"commentCount":2,"image":{"@id":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#primaryimage"},"thumbnailUrl":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png","keywords":["datasets","machine learning","mmap","pyarrow","python"],"articleSection":["Machine Learning","Software Engineering"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/","url":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/","name":"A Deep Investigation into MMAP Not Leaking Memory - Entrepreneur Being","isPartOf":{"@id":"https:\/\/stasosphere.com\/entrepreneur-being\/#website"},"primaryImageOfPage":{"@id":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#primaryimage"},"image":{"@id":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#primaryimage"},"thumbnailUrl":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png","datePublished":"2022-09-29T03:52:28+00:00","dateModified":"2025-07-23T05:12:07+00:00","author":{"@id":"https:\/\/stasosphere.com\/entrepreneur-being\/#\/schema\/person\/554642dec8ca3206478ceaefac2b48ac"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/stasosphere.com\/entrepreneur-being\/301-mmap-memory-leak-investigation\/#primaryimage","url":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png","contentUrl":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-content\/uploads\/2022\/09\/A-Deep-Investigation-into-MMAP-Not-Leaking-Memory.png","width":940,"height":788,"caption":"A Deep Investigation into MMAP Not Leaking Memory"},{"@type":"WebSite","@id":"https:\/\/stasosphere.com\/entrepreneur-being\/#website","url":"https:\/\/stasosphere.com\/entrepreneur-being\/","name":"Entrepreneur Being","description":"What can be done without a safety net","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/stasosphere.com\/entrepreneur-being\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/stasosphere.com\/entrepreneur-being\/#\/schema\/person\/554642dec8ca3206478ceaefac2b48ac","name":"stas","sameAs":["https:\/\/stasosphere.com\/"],"url":"https:\/\/stasosphere.com\/entrepreneur-being\/author\/stas\/"}]}},"_links":{"self":[{"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/posts\/301","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/comments?post=301"}],"version-history":[{"count":12,"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/posts\/301\/revisions"}],"predecessor-version":[{"id":348,"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/posts\/301\/revisions\/348"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/media\/303"}],"wp:attachment":[{"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/media?parent=301"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/categories?post=301"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/stasosphere.com\/entrepreneur-being\/wp-json\/wp\/v2\/tags?post=301"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}