Numpy.memmap: Bogus Memory Allocation
Solution 1:
There's nothing 'bogus' about the fact that you are generating 10 TB files
You are asking for arrays of size
2 ** 37 * 10 = 1374389534720 elements
A dtype of 'i8'
means an 8 byte (64 bit) integer, therefore your final array will have a size of
1374389534720 * 8 = 10995116277760 bytes
or
10995116277760 / 1E12 = 10.99511627776 TB
If you only have 250 GB of free disk space then how are you able to create a "10 TB" file?
Assuming that you are using a reasonably modern filesystem, your OS will be capable of generating almost arbitrarily large sparse files, regardless of whether or not you actually have enough physical disk space to back them.
For example, on my Linux machine I'm allowed to do something like this:
# I only have about 50GB of free space...
~$ df -h /
Filesystem Type Size Used Avail Use% Mounted on
/dev/sdb1 ext4 459G 383G 53G 88% /
~$ ddif=/dev/zero of=sparsefile bs=1 count=0 seek=10T
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000236933 s, 0.0 kB/s
# ...but I can still generate a sparse file that reports its size as 10 TB
~$ ls -lah sparsefile
-rw-rw-r-- 1 alistair alistair 10T Dec 1 21:17 sparsefile
# however, this file uses zero bytes of "actual" disk space
~$ du -h sparsefile
0 sparsefile
Try calling du -h
on your np.memmap
file after it has been initialized to see how much actual disk space it uses.
As you start actually writing data to your np.memmap
file, everything will be OK until you exceed the physical capacity of your storage, at which point the process will terminate with a Bus error
. This means that if you needed to write < 250GB of data to your np.memmap
array then there might be no problem (in practice this would probably also depend on where you are writing within the array, and on whether it is row or column major).
How is it possible for a process to use 10 TB of virtual memory?
When you create a memory map, the kernel allocates a new block of addresses within the virtual address space of the calling process and maps them to a file on your disk. The amount of virtual memory that your Python process is using will therefore increase by the size of the file that has just been created. Since the file can also be sparse, then not only can the virtual memory exceed the total amount of RAM available, but it can also exceed the total physical disk space on your machine.
How can you check whether you have enough disk space to store the full np.memmap
array?
I'm assuming that you want to do this programmatically in Python.
Get the amount of free disk space available. There are various methods given in the answers to this previous SO question. One option is
os.statvfs
:import os def get_free_bytes(path='/'): st = os.statvfs(path) return st.f_bavail * st.f_bsize print(get_free_bytes()) # 56224485376
Work out the size of your array in bytes:
import numpy as np defcheck_asize_bytes(shape, dtype): return np.prod(shape) * np.dtype(dtype).itemsize print(check_asize_bytes((2 ** 37 * 10,), 'i8')) # 10995116277760
Check whether 2. > 1.
Update: Is there a 'safe' way to allocate an np.memmap
file, which guarantees that sufficient disk space is reserved to store the full array?
One possibility might be to use fallocate
to pre-allocate the disk space, e.g.:
~$ fallocate -l 1G bigfile
~$ du -h bigfile
1.1G bigfile
You could call this from Python, for example using subprocess.check_call
:
import subprocess
deffallocate(fname, length):
return subprocess.check_call(['fallocate', '-l', str(length), fname])
defsafe_memmap_alloc(fname, dtype, shape, *args, **kwargs):
nbytes = np.prod(shape) * np.dtype(dtype).itemsize
fallocate(fname, nbytes)
return np.memmap(fname, dtype, *args, shape=shape, **kwargs)
mmap = safe_memmap_alloc('test.mmap', np.int64, (1024, 1024))
print(mmap.nbytes / 1E6)
# 8.388608print(subprocess.check_output(['du', '-h', 'test.mmap']))
# 8.0M test.mmap
I'm not aware of a platform-independent way to do this using the standard library, but there is a fallocate
Python module on PyPI that should work for any Posix-based OS.
Solution 2:
Based on the answer of @ali_m I finally came to this solution:
# must be called with the argumant marking array size in GB
import sys, numpy, tempfile, subprocess
size = (2 ** 27) * int(sys.argv[1])
tmp_primary = tempfile.NamedTemporaryFile('w+')
array = numpy.memmap(tmp_primary.name, dtype = 'i8', mode = 'w+', shape = size)
tmp = tempfile.NamedTemporaryFile('w+')
check = subprocess.Popen(['cp', '--sparse=never', tmp_primary.name, tmp.name])
stdout, stderr = check.communicate()
if stderr:
sys.stderr.write(stderr.decode('utf-8'))
sys.exit(1)
del array
tmp_primary.close()
array = numpy.memmap(tmp.name, dtype = 'i8', mode = 'r+', shape = size)
array[0] = 666array[size-1] = 777print('File: {}. Array size: {}. First cell value: {}. Last cell value: {}'.\
format(tmp.name, len(array), array[0], array[size-1]))
whileTrue:
pass
The idea is to copy initially generated sparse file to a new normal one. For this cp
with the option --sparse=never
is employed.
When the script is called with a manageable size parameter (say, 1 GB) the array is getting mapped to a non-sparse file. This is confirmed by the output of du -h
command, which now shows ~1 GB size. If the memory is not enough, the scripts exits with the error:
cp:‘/tmp/tmps_thxud2’:write failed:Nospaceleftondevice
Post a Comment for "Numpy.memmap: Bogus Memory Allocation"