10.5

What is the conceptual problem with the following snippet of Apache Spark code meant to work on very large data. Note that the collect() function returns a Java collection, and Java collections (from Java 8 onwards) support map and reduce functions.
JavaRDD<String> lines = sc.textFile('logDirectory');
int totalLength = lines.collect().map(s -> s.length())
                    .reduce(0, (a,b) -> a + b); 

The problem with the code is that the collect() function gathers the RDD data at a single node, and the map and reduce functions are then executed on that single node, not in parallel as intended.

2ndSem

Explorer

10.5

Graph View