What is a promise in Javascript?

Question

user11254

Asked: 2020-04-07 00:31:32 +0800 CST 2020-04-07 00:31:32 +0800 CST 2020-04-07 00:31:32 +0800 CST

Alternative to numpy.append to speed up the algorithm

772

The issue is that I have a ~1GB super-array of numpy, where each item goes through an algorithm and then is stored in another super-array with append , which has a way of working that I don't like. Let's see, if I tell it to add an item to the end of the array, why does it return a modified copy and not modify the original? Then why duplicate the array, unnecessarily? I just don't get it, if numpy is looking for performance, they've done well with this. This is the code:

import numpy as np

a_original # Array original numpy de una dimensión con los datos, dtype=np.uint8
a_final = np.array([], dtype=np.uint8) # Array donde se pasan los items

for i,item in enumerate(a_original):
    ''' Aquí se pasa el item por el algoritmo '''
    a_final = np.append(a_final, item) # Se añade al final del array

What I do is overwrite the a_final variable so as not to double the memory. But this way of working that append has slows down the process too much for large files. Is there any other way to do this?

1 Answers

Voted

FJSevilla · Answer 1 · 2020-04-07T01:58:57+08:00

An array in NumPy, in C or Fortran... is by definition a set of data that occupy contiguous positions in memory. This is important because it allows you to perform operations very efficiently on the data, for example it is essential to be able to use pointer arithmetic.

When you define an array of size N for a data type of size M, NumPy asks the operating system to allocate a memory chunk of size at least N x M bytes in order to allocate the array.

The OS will allocate the space that seems best to it, whether we have 2 GB of free memory (unallocated) behind the array (which can change at any time as well) or we have the right size.

When you want to add a new element to the NumPy array (or in C) you have to ask the OS to allocate M bytes after the last array element, this may or may not be possible depending on whether there is contiguous unallocated memory. If it is not possible, it only remains to ask the OS for a new memory space of size N x (M + 1) and copy the entire array to the new position).

On the other hand, numpy.ndarray.appendit is intended to always return a copy of the array, not to add elements in-place. If we want to add elements in-place we can use numpy.ndarray.resize. This does not assure us that at some point the array will not have to be copied into memory as has been mentioned and there is no way to solve this whether we are working in NumPy, in C or wherever, unless we reserve excess memory in advance from the beginning. What it does assure us is that it will try to enlarge the array without copying anything whenever possible.

To test the different codes I am going to use a random array like this:

array_inicial = np.random.randint(0, 10, 1000000, dtype="uint8")

Using numpy.append:

array_final = np.array([], dtype="uint8")
for n in array_inicial:
    if n != 0:
        array_final = np.append(array_final, n)

%%timeit 40 s ± 465 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

As expected, each call to appendallocates memory for the new array and copies everything to the new location...

You can try several things:

Use numpy.ndearray.resize:

array_final = np.ndarray([], dtype="uint8")
for n in array_inicial:
    if n != 0:
        array_final.resize(array_final.size + 1)
        array_final[-1] =  n

%%timeit 5.4 s ± 118 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

An intermediate solution between this and the following could be used, using another intermediate array as a buffer, reducing the number of calls to resize.

If you know that at most your final array is going to have the size of the original or two times more, etc, you can reserve memory for this maximum and then apply a resize and keep only what is used. This may waste RAM initially but you avoid costly backup operations. Also, the cost in CPU time of declaring an uninitialized array is negligible, it just allocates memory and that's it.
```
array_final = np.empty_like(array_inicial)
index = 0
for n in array_inicial:
    if n != 0:
        array_final[index] =  n
        index += 1
array_final.resize(index)
```
```
%%timeit 3.28 s ± 59.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
In this case, the CPU consumption is practically due to the cycle itself for, if we cannot vectorize the operation (in this example logically yes) we have no choice but to use the flexibility of Python at the cost of losing efficiency.

Alternative to numpy.append to speed up the algorithm

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?