I would like to know what is the most efficient way to compare the contents of two arrays, to do this in the least amount of time and to use the least amount of resources.
I present this example in Python of how I do it on a daily basis. For this example, there are few elements, it takes very little time, but as we add elements, the times increase exponentially.
If you want to give an example in another welcome language, the idea is to know the logic of how to do it over the language.
Here the example:
#!/usr/bin/python
# -*- coding: utf-8 -*-
arreglo1 = [1,2,3,4,5,6]
arreglo2 = [0,2,4,6,8,10]
repetidos = []
for x in arreglo1:
for y in arreglo2:
if x == y:
repetidos.append(x)
print "Los Repetidos son"
for z in repetidos:
print z
The output is: The Repeats are 2 4 6
Since the question has the tag I
lenguaje-agnóstico
will answer the question as a computational problem and from an algorithmic analysis.The computational problem that it tries to solve is the following: given an array A of n elements and an array B of m elements with n and m nonnegative integers, return an array C with k elements where each kth element of this array C belongs to both to array A and to array B. It is clear that if the array C is empty it means that there are no elements common to both arrays.
A trivial solution to this problem in pseudocode is the following:
The trivial solution consists in comparing all the elements of the array B with each one of the elements of the array A and adding those that are repeated in the array C and returning them. This solution is essentially the same as the one proposed in the Python code that accompanies the question.
Let us clarify the previous statement. From the pseudocode above we find a nested double loop. The first loop (line 2.) is executed n times (because array A has n elements), for every time that line is executed line 3. is executed m times (because array B has m elements. This means that the number of comparisons carried out by this procedure is n times m.The analysis of the algorithm then shows that its complexity is O(nm)(This complexity is an abstraction that, under a defined computational model, allows us to determine the execution time of the algorithm for a large input size). If we assume that both arrays have the same number of elements n , the complexity of the algorithm is O(n^2) . That is, the complexity of the algorithm is not 'exponential' as it colloquially mentions, but quadratic with respect to n , the size of the input of the problem.
Given the above context the question now is: is there an algorithm better than O(n^2) to compare two arrays of size n? . The answer is yes.
If you add the elements of the first array to a data structure called Hash and then iterate through the elements of the second array to see if they are in the structure, if so, you add it to array C.
The operations of adding and querying elements to a hash structure have computational complexity O(1) , an amortized analysis shows that in average cases the insertion and query take constant time . For the above algorithm, line 2 is executed n times (we need to add n elements of array A to hash D , for each execution of line 2, line 3 is executed 1 time (because inserting into structure takes constant time) Line 4 is executed mtimes and for each execution we take constant time to verify if the m-th element of the array is in the hash structure, if so, it is clear that the element is repeated and we add it to array C.
From the above it is clear that the computational complexity of the new algorithm is O(n) , that is, it is linear with respect to the size of the array elements. This is a substantial improvement to the previous algorithm with complexity O(n^2) , remember that the goal of setting the complexity is to identify the execution time when the size of the input tends to larger and larger values.
Then a question arises, is it possible to overcome the previous algorithm? , the answer is no . Clearly, in order to identify which elements are repeated in both arrays, we must go through some of the arrays at least once, so at least O(n) operations must be performed.
Other algorithm implementations allow for decent computational complexities. One possible strategy is to sort both arrays and do a linear traversal comparing each of the elements (not in pairs like the first algorithm, but using two iterators that guarantee to traverse the two arrays in O(n+m) , or O(n If we assume that we use an ordering algorithm of order O(nlogn) , the complexity of this algorithm is precisely O(nlogn) .Another possibility is to assume that both A and B are ordered, so the computational complexity is , using the algorithm above, O(n) .
In short: the most efficient way to compare two arrays, without making any additional assumptions, is to use a Hash, which guarantees a computational complexity in most cases of O(n) .
There is no general method that is the most efficient. It will depend on many factors, but a very important one is going to be the language used. For each language it is possible that there is a more efficient way than another. A Python list is not the same as a C array or a C++ Vector. There are always general ideas such as avoiding the number of complete iterations over the array as much as possible, using hash tables if possible, etc. but it will depend on each language and even on the data type.
Speaking specifically of Python and based on your example, the problem with using that method is that for each element
arreglo1
you loop through thearreglo2
integer. In Python, a much more efficient way to see the elements that are present in two iterables simultaneously is to useintersection()
the sets(set
) method, which takes two sets and returns another set with the elements present in both:Edition:
I add a small comparison using two lists of 100,000 elements each. I compare your code, with a more efficient alternative using
in
and list comprehensions and withset.intersection()
. All three functions return a list.To what has been said before, it must be added that the method
append
of lists subtracts even more efficiency, if possible. The results leave little doubt:in
PHP
existsarray_intersect
in
Python
existsintersection
In php there is a function that does it is array_diff() . What it does is compare two arrays and remove the ones that don't match. Ex:
This is how I would do it:
I tried both algorithms with array1 = 1-6000 from 1 to 1 and array2 = 1-10000 from 2 to 2.
Your implementation takes: 6,003 seconds, the one I propose takes: 2,001 seconds.
All the best.