What is the conceptual problem with the following snippet of Apache Spark
code meant to work on very large data. Note that the collect() function
returns a Java collection, and Java collections (from Java 8 onwards) support
map and reduce functions.
The problem with the code is that the collect() function gathers the RDD
data at a single node, and the map and reduce functions are then executed on that
single node, not in parallel as intended.