I have to process a lot of sentences. In fact, the number of them is unlimited : they are obtained from a source on the Internet.
The goal is to count how many times a certain phrase is repeated. Since the number of them is, as I say, unlimited , it is obvious that I cannot store all of them in memory.
Additionally, certain conditions are required of me: the complexity of any operation must be O(1). There 's no mention of collisions, so I'm neglecting them for now.
An example of the expected behavior :
4096 sentences are read, all of them different . Coincidentally, neither generates hash collision with another.
The hash table is filled . The
4096
gaps are occupied by the sentences read.A new sentence is read. Its hash is the same as one of the previous phrases, but the phrase is different from all the
4096
previous ones.I remove from the table the item that has been unused the longest . In this case, it would be the first one I inserted . I don't remove the element with matching hash value, because it is more modern .
EDIT
Currently, I have this:
#include <string.h>
typedef struct phrase_s {
char text[256];
unsigned long long count;
struct phrase_s *prev;
struct phrase_s *next;
} Phrase;
Phrase List[4096];
// Devuelve la hash de una cadena, limitada a 12 bits ( 0 - 4095 ).
size_t makeHash( const char * );
// Devuelve la cadena MAS ANTIGUA, aquella que hay que eliminar.
Phrase *old( void );
void addString( const char *str ) {
size_t hash = makeHash( str );
if( !( List[hash].count ) ) {
// Caso fácil. No hay cadenas con hash coincidente.
memcpy( &( List[hash].text ), str, strlen( str ) );
List[hash].count = 1;
List[hash].prev = NULL;
List[hash].next = NULL;
return;
}
// Caso INTERESANTE. La hash COINCIDE con otra.
// POR HACER.
}
How do I implement point 4 above? Can I continue from the code I already have, or are other modifications necessary?
I reiterate the issue of complexity. The number of operations to perform must be independent of the size of the table.
Note: This is a question that was posted a couple days ago and I found it interesting. It is what I could understand with some sense of it, which has been eliminated by its author.
EDIT 2
I still find it an interesting question; surely there are more good answers, which show the logic to follow in these cases. I add the language C++
to the allowed responses.
EDIT 3
As I have been told, the term must be specified O( 1 )
. Normally, it would indicate a constant complexity . Since we're talking about a hash table, we'll add the mean tagline ; that is, the chosen algorithm must have, on average , complexity O( 1 )
.
Also, to be able to use std::unordered_map
, responses in C++11 are supported.
The problem, as you have stated, has no solution:
The simplest implementation, in my opinion, is a circular buffer. How is it implemented? Very easy:
Well, we have already defined the buffer:
Buffer initialization: Simple function...both integers to 0.
Function to increment the index: I don't like to repeat code.
Insert an element in the buffer: What we are going to do now is replace the value pointed to by the variable
indice
... then we shift the index one position. As the buffer is circular, if we reach the end of it we have to move the pointer to the beginning. The element counter will have to be updated until we reach the maximum number of elements... once that limit is reached, the oldest elements will begin to be replaced and the number of elements will remain constantHow is it used? Easy:
As you can see, inserting an element has an O(1) complexity since it does not require loops of any kind... and the number of elements stored does not matter. In addition, it will always overwrite the oldest values and its management is very simple.
Edition motivated by the always attentive @Trauma... I didn't take into account that they didn't admit duplicates... the solution then goes through integrating, as you comment in the question, a hash system to the circular buffer system:
A possible hash for the strings could be ( source ), I have sloppily adapted it to
unsigned short
... collisions will be more possible but we reduce the field to cover:Okay, now that we can generate a hash, it's time to consider how to combine the two worlds... it occurs to me to have two vectors: the circular one that we have already seen and one of hashes. In this way knowing if a hash is already occupied is as simple as looking at the hash table. How are the two lists related? I think it's better to keep it simple... the hash table has one element per possible hash (hence I made the table of type
unsigned short
). Initially, all values are 0. When an element is going to be inserted, it is checked if the position given by the hash of the string is at 0 or 1... if it is at 1, a string already exists and it is not inserted in the buffer while that if it is 0 we have a free pass.To correctly manage the hash table, the library must be loaded
limits.h
. Thus, the size of the hash table will be correct regardless of the system in which the program is compiled:Now we have to update the initialization... passing a bit of performance (the improvement would be negligible), it already makes sense to use
memset
:We already have all the bytes of the buffer at 0 (including the hash table)... Let's modify the function to add elements:
The logic is simple:
And to top it off we are missing a function that I had missed before. We need a function that gives us the initial index of the buffer, which will be:
Said with code:
And now it only remains to correct the presentation loop of the
main
to use this index as initial instead of what was before:C++11 version
In this case I have tried to force the machine so that it is not a simple translation of the C code:
std::unordered_map
. This map stores the values found in the list and a pointer to the node that contains the value.std::unique_ptr
. Using raw pointers was too easy and the idea was to come up with something totally different... and yes, withstd::unique_ptr
you can create a doubly linked list.And well, here I leave you the result of typing for a while. The code can be complex to follow if you don't have a good level of C++.
To see the working example: link
There are easier ways to do this but my intention is not to do it the easy way. I hope other answers will explore new options.