The issue is that I have a ~1GB super-array of numpy, where each item goes through an algorithm and then is stored in another super-array with append , which has a way of working that I don't like. Let's see, if I tell it to add an item to the end of the array, why does it return a modified copy and not modify the original? Then why duplicate the array, unnecessarily? I just don't get it, if numpy is looking for performance, they've done well with this. This is the code:
import numpy as np
a_original # Array original numpy de una dimensión con los datos, dtype=np.uint8
a_final = np.array([], dtype=np.uint8) # Array donde se pasan los items
for i,item in enumerate(a_original):
''' Aquí se pasa el item por el algoritmo '''
a_final = np.append(a_final, item) # Se añade al final del array
What I do is overwrite the a_final variable so as not to double the memory. But this way of working that append has slows down the process too much for large files. Is there any other way to do this?
An array in NumPy, in C or Fortran... is by definition a set of data that occupy contiguous positions in memory. This is important because it allows you to perform operations very efficiently on the data, for example it is essential to be able to use pointer arithmetic.
When you define an array of size N for a data type of size M, NumPy asks the operating system to allocate a memory chunk of size at least N x M bytes in order to allocate the array.
The OS will allocate the space that seems best to it, whether we have 2 GB of free memory (unallocated) behind the array (which can change at any time as well) or we have the right size.
When you want to add a new element to the NumPy array (or in C) you have to ask the OS to allocate M bytes after the last array element, this may or may not be possible depending on whether there is contiguous unallocated memory. If it is not possible, it only remains to ask the OS for a new memory space of size N x (M + 1) and copy the entire array to the new position).
On the other hand,
numpy.ndarray.append
it is intended to always return a copy of the array, not to add elements in-place. If we want to add elements in-place we can usenumpy.ndarray.resize
. This does not assure us that at some point the array will not have to be copied into memory as has been mentioned and there is no way to solve this whether we are working in NumPy, in C or wherever, unless we reserve excess memory in advance from the beginning. What it does assure us is that it will try to enlarge the array without copying anything whenever possible.To test the different codes I am going to use a random array like this:
Using
numpy.append
:As expected, each call to
append
allocates memory for the new array and copies everything to the new location...You can try several things:
Use
numpy.ndearray.resize
:An intermediate solution between this and the following could be used, using another intermediate array as a buffer, reducing the number of calls to
resize
.If you know that at most your final array is going to have the size of the original or two times more, etc, you can reserve memory for this maximum and then apply a resize and keep only what is used. This may waste RAM initially but you avoid costly backup operations. Also, the cost in CPU time of declaring an uninitialized array is negligible, it just allocates memory and that's it.
In this case, the CPU consumption is practically due to the cycle itself
for
, if we cannot vectorize the operation (in this example logically yes) we have no choice but to use the flexibility of Python at the cost of losing efficiency.