I am reading a file and counting the words to show how many times each word is repeated.
readme = sc.textFile("README.md")
wordCounts = readme.flatMap(lambda line: line.split()).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
wordCounts.collect()
Returns the following:
[('#', 1),
('Apache', 1),
('Spark', 14),
('is', 6),
('It', 2), ('provides', 1),
( 'high-level', 1),
('APIs', 1),
('in', 5), ('Scala,', 1),
('Java,', 1),
('an', 3) ,
('optimized', 1), ('engine', 1),
('supports', 2),
('computation', 1),
('analysis.',1),
('set', 2),
('of', 5),
('tools', 1),
('SQL', 2),
('MLlib', 1),
('machine', 1), ('learning,', 1),
( 'graphX', 1),
('graph', 1), ('processing,', 1),
('Documentation', 1),..... (There is more data but I cut it here)]
As you can see, the word that is repeated the most is Spark with 14 registers, now:
How can I show only that word and the total number of records?
I already found it was with the takeOrdered function using a lambda in the key ordering the last one from the total of key's
It can also be done this way: