Spark Tips and Tricks

February 25, 2015 - BigData / Spark

Introduction

In this post, I provide tips on how best to use Spark for specific use cases. Moreover, I also provide some lessons learned (tricks) that may be useful.  The content is divided by logical group.

Enjoy!

Data Loading

Text File Data

In cases where your data is split between many different text files, and you wish to do an aggregate analysis query, use the sparkContext.wholeTextFiles to retain file contents per-file while doing aggregate analysis. 1Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “File Formats: Loading Text Files” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.

Spark Streaming

Performance

Monitor the performance of your Streaming application to gather useful statistics, such as batch processing times, by navigating to the following URL (assuming your application is running): http://<driver>:4040

To reduce processing time of batches and increase the level of parallelism2Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Performance Considerations” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.:

  • Increase the number of receivers so that the data pipe is split between more receiver nodes. To do this, establish multiple DStreams (each DStream has a 1-to-1 correspondence with a receiver) and merge them using union
  • Explicitly repartition the input stream in a custom way using DStream.repartition. This is useful if you have no way to increase the number of receivers.
  • Use method-specific arguments to increase level of parallelism (e.g. reduceByKey()’s second parameter)

Sometime Java Garbage Collection can introduce long pauses in stream processing. Use Java’s Concurrent Mark-Sweep garbage collector if such is the case. Doing so will reduce pause time but consume more resources. For example3Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Performance Considerations” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.:

spark-submit --conf spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC App.jar

Serializing cached RDDs will reduce Java GC pressure, especially if you use the Kryo serialization technique

Use the Spark configuration parameter spark.cleaner.ttl to specify how long cached/persisted RDDs should stay around in memory before being evicted. Setting lower values could reduce Java GC pressure.

Fault-tolerance

By default, [Spark Streaming] received data is replicated across two nodes, so Spark Streaming can tolerate single worker failures 4Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Architecture and Abstraction” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.

Checkpointing can be used to lessen the time taken for node failure data-loss recovery, by writing the stream state to a reliable file-system like HDFS. Suggested checkpointing interval times could be every 5-10 batches of data.

To have your driver program become fault-tolerant and able to recover processing data without any data loss, make sure to use the StreamingContext.getOrCreate method to establish a StreamingContext – as opposed to just defining a new StreamingContext object. This method behaves exactly like new StreamingContext() the first time it is run, but if the driver is rerun after a crash, it reinitializes using the specified checkpoint directory. 5Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: 24/7 Operation” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.

To have your driver restart automatically (if using the standalone cluster), use the –supervise  and the –deploy-mode cluster  arguments when submitting your job. For example:

./bin/spark-submit --deploy-mode cluster --supervise --master spark://.../ App.jar

Stateless Transformations

Stateless transformations, like map, filter, reduceByKey, etc. apply transformations to each RDD (i.e. batch) within a DStream independently for each time-step.

The transform method can be useful to reapplying batch code you already have for transforming a single RDD to a streaming context. It provides arbitrary RDD-to-RDD operations. 6Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Transformations” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.

Stateful Transformations

Use stateful transformations when you have specific use cases for needed data from multiple batches

Stateful transformations require checkpointing, so make sure to set up an HDFS system across your cluster.

Other

The foreach method is a Spark action not a transformation. In other words, it is similar to the transform method in that it deals with individual RDD (batches), but it doesn’t return anything, it only invokes an action. This is useful for doing external tasks like writing to a database. 7Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Output” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.

Do not run Spark Streaming programs locally with master configured as “local” or “local[1]”. This allocates only one CPU for tasks and if a receiver is running on it, there is no resource left to process the received data. Use at least “local[2]” to have more cores. 8Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Input Sources” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.

References   [ + ]

1. Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “File Formats: Loading Text Files” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.
2, 3. Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Performance Considerations” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.
4. Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Architecture and Abstraction” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.
5. Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: 24/7 Operation” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.
6. Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Transformations” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.
7. Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Output” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.
8. Karau, Holden, Andy Konwinski, Patrick Wendell, and Matei Zaharia. “Spark Streaming: Input Sources” Learning Spark. 1st ed. Sebastopol: O’Reilly Media, 2015. 274. Print.

Leave a Reply

Your email address will not be published. Required fields are marked *