Skip to content

Some performance suggestions for Spark #4

Open
@koeninger

Description

@koeninger

From reading

http://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at

First, it appears you had fewer Kafka partitions than you had worker nodes. Because there's a 1:1 relationship between spark partitions and kafka partitions, you're not going to get full utilization of your workers. As you noted, shuffling can be expensive, you're better off doing the partitioning at the time you produce into kafka (ie add more kafka partitions)

Second, groupByKey is almost always a bad idea in Spark. You want e.g. reduceByKey to get aggregation work done before the shuffle (this is similar to a hadoop combiner).

Finally, it's not clear from the code whether the json and redis code paths are identical across the different benchmarks. For instance, there's a comment in the code that redis caching is not being done in the Spark case, but is in the Storm case. It doesn't seem like a fair comparison can be made without controlling for other variables.

If you want assistance with anything, let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions