Some performance suggestions for Spark

From reading

http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

First, it appears you had fewer Kafka partitions than you had worker nodes.  Because there's a 1:1 relationship between spark partitions and kafka partitions, you're not going to get full utilization of your workers.  As you noted, shuffling can be expensive, you're better off doing the partitioning at the time you produce into kafka (ie add more kafka partitions)

Second, groupByKey is almost always a bad idea in Spark.  You want e.g. reduceByKey to get aggregation work done before the shuffle (this is similar to a hadoop combiner).

Finally, it's not clear from the code whether the json and redis code paths are identical across the different benchmarks.  For instance, there's a comment in the code that redis caching is not being done in the Spark case, but is in the Storm case.  It doesn't seem like a fair comparison can be made without controlling for other variables.

If you want assistance with anything, let me know.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some performance suggestions for Spark #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Some performance suggestions for Spark #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions