Performance Tuning of an Apache Kafka/Spark Streaming System (2023)

Editor’s Note: MapR products and solutions sold prior to the acquisition of such assets by Hewlett Packard Enterprise Company in 2019, may have older product names and model numbers that differ from current solutions. For information about current offerings, which are now part of HPE Ezmeral Data Fabric, please visit

Original Post Information:

"authorDisplayName": "Mathieu Dumoulin","publish": "2017-01-17T06:00:00.000Z","tags": "apache-hive"

Real-world case study in the telecom industry

Debugging a real-life distributed application can be a pretty daunting task. Most common Google searches don't turn out to be very useful, at least at first. In this blog post, I will give a fairly detailed account of how we managed to accelerate by almost 10x an Apache Kafka/Spark Streaming/Apache Ignite application and turn a development prototype into a useful, stable streaming application that eventually exceeded the performance goals set for the application.

The lessons learned here are fairly general and extend easily to similar systems using MapR Event Store as well as Kafka.

This project serves as a concrete case for the need of a converged platform, which integrates the full software stack to support the requirements of this system: real-time streams and big data distributed processing and persistence. The MapR Data Platform is the only currently available production-ready implementation of such a platform as of this writing.

Goal of the system

To meet the needs of the telecom company, the goal of the application is to join together the log data from three separate systems. When the data is joined, it becomes possible to correlate the network conditions to a particular call for any particular customer, thus allowing customer support to provide accurate and useful information to customers who are unsatisfied with their phone service. The application has great additional value if it can do this work in real time rather than as a batch job, since call quality information that is 6 hours old has no real value for customer service or network operations.

Basically, this is a fairly straight-up ETL job that would normally be done as a batch job for a data warehouse but now has to be done in real time as a streaming distributed architecture.

More concretely, the overall picture is to stream the input data from a remote server into a distributed cluster, do some data cleaning and augmentation, join the records from the three logs, and persist the joined data as a single table into a database.

The problems with the original system

The original system had several issues centered around performance and stability.

First, the streaming application was not stable. In a Spark Streaming application, the stream is said to be stable if the processing time of each microbatch is equal to or less than the batch time. In this case, the streaming part of the application was receiving data in 30 second windows but was taking between 4.5-6 minutes to process.

Second, there is a batch process to join data one hour at a time that was targeted to run in 30 minutes but was taking over 2 hours to complete.

Third, the application was randomly crashing after running for a few hours.

The cluster hardware, software stack, and input data

The cluster hardware is pretty good, with 12 nodes of enterprise servers, each equipped with two E5 Xeon CPUs each with 16 physical cores, 256GB memory, and eight 6TB spinning HDD. The network is 10GB Ethernet.

The technology stack selected for this project is centered around Kafka 0.8 for streaming the data into the system, Apache Spark 1.6 for the ETL operations (essentially a bit of filter and transformation of the input, then a join), and the use of Apache Ignite 1.6 as an in-memory shared cache to make it easy to connect the streaming input part of the application with joining the data. Apache Hive is also used to serve as a disk backup for Ignite in case of failure and for separate analytics application.

The initial cluster was configured as follows:






Mesos Master


Spark Worker





















































The cluster is running Apache Hadoop's HDFS as a distributed storage layer, with resources managed by Mesos 0.28. Finally, HBase is used as the ultimate data store for the final joined data. It will be queried by other systems outside the scope of this project.

(Video) 23. Pyspark Streaming: Achieving Performance Tuning on Streaming

The performance requirement of the system is to handle an input throughput of up to 3GB/min, or 150-200,000 events/second, representing the known peak data throughput, plus an additional margin. The ordinary throughput is about half of that value or 1.5GB/min and 60,000-80,000 events/second.

The raw data source are the logs of three remote systems, labeled A, B, and C here: Log A comprises about 84-85% of the entries, Log B about 1-2%, and Log C about 14-15%. The fact that the data is unbalanced is one of the (many) sources of difficulty in this application.

The Spark applications are both coded in Scala 2.10 and Kafka’s direct approach (no receivers). Apache Ignite has a really nice Scala API with a magic IgniteRDD that can allow applications to share in-memory data, a key feature for this system to reduce coding complexity.

The application architecture

Performance Tuning of an Apache Kafka/Spark Streaming System (1)

The raw data is ingested into the system by a single Kafka producer into Kafka running on 6 servers. The producer reads the various logs and adds each log's records into its own topic. As there are three logs, there are three Kafka topics. Each topic is split into 36 partitions. Most likely, there are 36 partitions because there are 6 nodes with each 6 disks assigned to HDFS, and Kafka documentation seems to recommend having about one partition per physical disk as a guideline.

The data is consumed by a Spark Streaming application which picks up each topic and then does a simple filter to cut out unnecessary fields, a map operation to transform the data, and a foreachRDD operation (each micro-batch generates an RDD in Spark Streaming) that saves the data to Ignite and to Hive.

Performance Tuning of an Apache Kafka/Spark Streaming System (2)

The streaming app is very straightforward: map, filter, and foreach partition to save to Ignite

A second "regular" Spark application runs on the data stored in-memory by Ignite to join the records from the three separate logs into a single table in batches of 1 hour. This job is done using Spark's DataFrame API, which is ideally suited to the task. The second part involves no more than 100GB worth of data, and the cluster hardware is properly sized to handle that amount of data.

Three hours of data are accumulated into Ignite, because the vast majority of calls last for less than an hour, and we want to run the join on one hour’s worth of data at a time. Since some calls will start in one batch and finish in another, the system keeps three hours and only processes the middle one-hour batch, thus the join can succeed on close to 100% of the records.

It’s worth noting that a better all-streaming architecture could have avoided the whole issue with the intermediate representation in the first place. An illustrative, real-world case, built with more time and thought up-front, can end the entire project faster, as opposed to rushing headlong into coding the first working solution that comes to mind.

Performance tuning

The main issues for these applications were caused by trying to run a development system's code, tested on AWS instances on a physical, on-premise cluster running on real data. The original developer was never given access to the production cluster or the real data.

Apache Ignite was a huge source of problems, principally because it is such a new project that nobody had any real experience with it and also because it is not a very mature project yet.

First target: Improve Spark Streaming performance

The Spark Streaming application was running in about 4.5 minutes, and the project goal was to run in about 30 seconds. We needed to find 9x speedup worth of improvements, and due to time constraints, we couldn’t afford to change any code!

The system had to be ready for production testing within a week, so the code from the architecture and algorithm point of view was assumed to be correct and good enough that we could reach the performance requirement only with tuning.

Fix RPC timeout exceptions

(Video) How to Performance-Tune Apache Spark Applications in Large Clusters

We found the correct solution from somebody having the same problem, as seen inSPARK-14140 in JIRA. They recommend increasing the spark.executor.heartbeatInterval from 10s to 20s.

I think this problem may be caused by nodes getting busy from disk or CPU spikes because of Kafka, Ignite, or garbage collector pauses. Since Spark runs on all nodes, the issue was random. (See the cluster services layout table in the first section.)

The configuration change fixed this issue completely. We haven’t seen it happen since.

Increase driver and executor memory

Out of memory issues and random crashes of the application were solved by increasing the memory from 20g per executor to 40g per executor as well as 40g for the driver. Happily, the machines in the production cluster were heavily provisioned with memory. This is a good practice with a new application, since you don’t know how much you will need at first.

The issue was difficult to debug with precision, lacking accurate information, since the Spark UI reports very little memory consumption. In practice, as this setting is easy to change, we empirically settled on 40g being the smallest memory size for the application to run stably.

Increase parallelism: increase number of partitions in Kafka

The input data was unbalanced, and most of the application processing time was spent processing Topic 1 (with 85% of the throughput). Kafka partitions are matched 1:1 with the number of partitions in the input RDD, leading to only 36 partitions, meaning we can only keep 36 cores busy on this task. To increase the parallelism, we need to increase the number of partitions. So we split topic 1 into 12 topics each, with 6 partitions, for a total of 72 partitions. We did a simple modification to the producer to divide the data evenly from the first log into 12 topics, instead of just one. Zero code needed to be modified on the consumer side.

We also right-sized the number of partitions for the two other topics, in proportion to their relative importance in the input data, so we set topic 2 to 2 partitions and topic 3 to 8 partitions.

Performance Tuning of an Apache Kafka/Spark Streaming System (3)

Running more tasks in parallel. Before tuning, each stage always had 36 partitions!

Right-size the executors

The original application was running only 3 executors with 72 total cores. We configured the application to run with 80 cores at a maximum of 10 cores per executor, for a total of 8 executors. Note that with 16 real cores per node on a 10-node cluster, we’re leaving plenty of resources for Kafka brokers, Ignite, and HDFS/NN to run on.

Increase the batch window from 30s to 1m

The data is pushed into Kafka by the producer as batches every 30s, as it is gathered by FTP batches from the remote systems. Such an arrangement is common in telecom applications due to a need to deal with equipment and systems from a bewildering range of manufacturers, technology, and ages.

This meant that the input stream was very lumpy, as shown in the screenshot of Spark UI's Streaming tab:

(Video) Spark Streaming Performance Tuning v7

Increasing the window to 1m allowed us to smooth out the input and gave the system a chance to process the data in 1 minute or less and still be stable.

To make sure of it, the team generated a test data, which simulated the known worst-case data, and with the new settings, the spark-streaming job was now indeed stable. The team was also able to switch easily between test data and the real production data stream as well as a throttle on the producers to configure how much data to let in to the system. This was extremely helpful to test various configurations quickly and see if we had made progress or not.

Drop requirement to save to Hive, only use Ignite

Performance Tuning of an Apache Kafka/Spark Streaming System (4)

Discussion with the project managers revealed that Hive was not actually part of the requirements for the streaming application! Mainly, this is because the data in HBase could just as easily be used by the analytics; also, in the context of this application, each individual record doesn't actually need to be processed with a 100% guarantee.

Indeed, in light of the goal of the system, the worse-case scenario for missing data is that a customer's call quality information cannot be found... which is already the case. In other words, the risk of data loss is not a deal-breaker, and the upside to gaining data is additional insights. As long as the great majority of the data is processed and stored, the business goals can be reached.

Results of all optimizations

The streaming application finally became stable, with an optimized runtime of 30-35s.

As it turns out, cutting out Hive also sped up the second Spark application that joins the data together, so that it now ran in 35m, which meant that both applications were now well within the project requirements.

With improvements from the next part, the final performance of the Spark Streaming job went down in the low 20s range, for a final speedup of a bit over 12 times.

Second target: Improve System Stability

We had to work quite hard on stability. Several strategies were required, as we will explain below.

Make the Spark Streaming application stable

The work we did to fix the performance had a direct impact on system stability. If both applications are stable themselves and running on right-sized resources, then the system has the best chance to be stable overall.

Remove Mesos and use Spark Standalone

The initial choice of Mesos to manage resources was forward-looking, but ultimately we decided to drop it from the final production system. At the onset, the plan was to have Mesos manage all the applications. But the team never could get Kafka and Ignite to play nice with Mesos, and so they were running in standalone mode, leaving only Spark to be managed by Mesos. Surely, with more time, there is little doubt all applications could be properly configured to work with Mesos.

Proposing to remove Mesos was a bit controversial, as Mesos is much more advanced and cool than Spark running in standalone mode.

(Video) Apache Kafka Performance Tuning

But the issue with Mesos was twofold:

  1. Control over executor size and number was poor, a known issue (SPARK-5095) with Spark 1.6 and fixed in Spark 2.0.
  2. Ignite and Kafka weren’t running inside Mesos, just Spark. Because of schedule pressure, the team had given up on trying to get those two services running in Mesos.

Mesos can only ever allocate resources well if it actually controls resources. In the case of this system, Kafka and Ignite are running outside of Mesos’ knowledge, meaning it’s going to assign resources to the Spark applications incorrectly.

In addition, it’s a single-purpose cluster, so we can live with customizing the sizing of the resources for each application with a global view of the system’s resources. There is little need for dynamic resource allocations, scheduling queues, multi-tenancy, and other buzzwords.

Change the Ignite memory model

It is a known issue that when the heap controlled by the JVM gets very big (>32GB), the cost of garbage collection is quite large. We could indeed see this problem when the join application runs: the stages with 25GB shuffle had some rows with spikes in GC time, ranging from 10 seconds up to more than a minute.

The initial configuration of Ignite was to run ONHEAPTIERED with 48GB worth of data cached on heap, then overflow drops to 12GB of off-heap memory. That setting was changed to the OFFHEAPTIERED model. While slightly slower due to serialization cost, OFFHEAP_TIERED doesn't result in big garbage collections. It still runs in memory, so we estimated it would be a net gain.

With this change, the run time for each batch dutifully came down by about five seconds, from 30 seconds down to about 25 seconds. In addition, successive batches tended to have much more similar processing time with a delta of 1-3 seconds, whereas it would previously vary by over 5 to 10 seconds.

Update the Ignite JVM settings

We followed the recommended JVM options as found in Ignite documentation’s performance tuning section (

Improve the Spark code

Some parts of the code assumed reliability, like queries to Ignite, when in fact there was a possibility of the operations failing. These problems can be fixed in the code, which now handles exceptions more gracefully, though there is probably work left to increase the robustness of the code. We can only find these spots by letting the application run now.

Reassign ZooKeeper to nodes 10-12

Given that the cluster is medium-sized, it’s worth spreading the services as much as possible. We moved the ZooKeeper services from nodes 1-3 to nodes 10-12.


Tuning this application took about 1 week of full-time work. The main information we used was Spark UI and Spark logs, easily accessible from the Spark UI. The view of Jobs and Stages as well as the streaming UI are really very useful.

What I learned

  • Migrating a streaming application from a prototype on AWS to an on-premise cluster requires schedule time for testing
  • Not testing the AWS prototype with realistic data was a big mistake
  • Including many “bleeding-edge” OSS components (Apache Ignite and Mesos) with expectations of very high reliability is unrealistic
  • A better architecture design could have simplified the system tremendously
  • Tuning a Kafka/Spark Streaming application requires a holistic understanding of the entire system. It’s not simply about changing the parameter values of Spark; it’s a combination of the data flow characteristics, the application goals and value to the customer, the hardware and services, the application code, and then playing with Spark parameters.
  • MapR Data Platform would have cut the development time, complexity, and cost for this project.

The project is a first for this particular telecom company, and they decided to go all-out on such an advanced, 100% open-source platform. They should be applauded for their pioneering spirit. But a better choice of platform and application architecture would have made their lives a lot easier.

The need for a converged big-data platform is now

In fact, the requirements for this project show the real-world business need for a state-of-the-art converged platform with a fast distributed files system, high-performance key-value store for persistence, and real-time streaming capabilities.

(Video) Stream processing using Spark Streaming & Kafka :Meet-up

A MapR solution could probably skip the requirement for a still speculative open-source project like Ignite, since the full software stack required by the architecture is already built-in and fully supported. Given this system is heading into production for a telecom operator with 24/7 reliability expectation, such an advantage is considerable.


How can I improve my Spark streaming speed? ›

Use DataFrame/Dataset over RDD. For Spark jobs, prefer using Dataset/DataFrame over RDD as Dataset and DataFrame's includes several optimization modules to improve the performance of the Spark workloads. In PySpark use, DataFrame over RDD as Dataset's are not supported in PySpark applications.

What method does Spark use to perform streaming operations? ›

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

How does Kafka improve performance? ›

Increasing the number of partitions and the number of brokers in a cluster will lead to increased parallelism of message consumption, which in turn improves the throughput of a Kafka cluster; however, the time required to replicate data across replica sets will also increase.

How does Spark streaming process the data from Kafka? ›

This approach uses a Receiver to receive the data. The Receiver is implemented using the Kafka high-level consumer API. As with all receivers, the data received from Kafka through a Receiver is stored in Spark executors, and then jobs launched by Spark Streaming processes the data.

What is Spark performance tuning? ›

What is Spark Performance Tuning? Spark Performance Tuning refers to the process of adjusting settings to record for memory, cores, and instances used by the system. This process guarantees that the Spark has a flawless performance and also prevents bottlenecking of resources in Spark.

How do I improve streaming performance? ›

Here's some things you can do to achieve that:
  1. Close other applications and programs. ...
  2. Pause the stream for a few moments. ...
  3. Reduce video quality. ...
  4. Speed up your internet connection. ...
  5. Remove other devices connected to your network. ...
  6. Update graphics card drivers. ...
  7. Try a wired Ethernet connection. ...
  8. Clean up your browser settings.
30 Jan 2020

How is streaming implemented in Spark explain with examples 7? ›

How does Spark Streaming works? In Spark Streaming divide the data stream into batches called DStreams, which internally is a sequence of RDDs. The RDDs process using Spark APIs, and the results return in batches. Spark Streaming provides an API in Scala, Java, and Python.

How is Kafka used as a stream processing? ›

Kafka Streams is a library for building streaming applications, specifically applications that transform input Kafka topics into output Kafka topics (or calls to external services, or updates to databases, or whatever). It lets you do this with concise code in a way that is distributed and fault-tolerant.

What are some of the ways of processing streaming data in Apache spark? ›

Spark Streaming comes with several API methods that are useful for processing data streams. There are RDD-like operations like map, flatMap, filter, count, reduce, groupByKey, reduceByKey, sortByKey , and join. It also provides additional API to process the streaming data based on window and stateful operations.

How do I check Kafka performance? ›

You can do this using the load generation tools that ship with Kafka, kafka-producer-perf-test, and kafka-consumer-perf-test. The kafka-*-perf-test tools can be used in several ways. In general, it is expected that these tools be used on a test or development cluster.

How do you improve Kafka producer performance? ›

  1. Provision your Kafka cluster.
  2. Initialize the project.
  3. Write the cluster information into a local file.
  4. Download and setup the Confluent CLI.
  5. Create a topic.
  6. Run a baseline producer performance test.
  7. Run a producer performance test with optimized throughput.
  8. Teardown Confluent Cloud resources.

What makes Apache Kafka so fast? ›

Kafka uses many other techniques apart from the ones mentioned above to make systems much faster and efficient: Batching of data to reduce network calls, and also converting a lot of random writes into sequential ones. Compression of batches (and not individual messages) using LZ4, SNAPPYor GZIP codecs.

When should I use Spark Streaming? ›

You should use Spark Structured Streaming for your streaming applications and pipelines.

What is the difference between Kafka and Spark Streaming? ›

Kafka analyses the events as they unfold. As a result, it employs a continuous (event-at-a-time) processing model. Spark, on the other hand, uses a micro-batch processing approach, which divides incoming streams into small batches for processing.

Which component can be used to perform streaming data analysis in Spark? ›

Advanced Analytics: Apache Spark also supports “Map” and “Reduce” that has been mentioned earlier. However, at the side of MapReduce, it supports Streaming data, SQL queries, Graph algorithms, and Machine learning. Thus, Apache Spark may be used to perform advanced analytics.

What are the types of performance tuning? ›

There are two distinct types of tuning: Proactive Monitoring. Bottleneck Elimination.

How do you perform a performance tune? ›

Systematic tuning follows these steps:
  1. Assess the problem and establish numeric values that categorize acceptable behavior.
  2. Measure the performance of the system before modification.
  3. Identify the part of the system that is critical for improving the performance. ...
  4. Modify that part of the system to remove the bottleneck.

How do you measure Spark performance? ›

The most useful metrics for Spark performance analysis include:
  1. Average time spent on tasks and assignments.
  2. The amount of memory used.
  3. The amount of CPU used compared to the CPU used by garbage collection.
  4. The number of data records written and retrieved to/from disk in shuffle operations.
  5. Disk I/O statistics.
30 Mar 2022

What improves stream quality? ›

If you want to improve your live streaming quality, use an Ethernet cable connection instead of Wi-Fi, as Wi-Fi connections can be unstable, which will affect your results. If you have access to a cable connection, you'll have more stability, and your stream quality will increase.

What affects streaming performance? ›

Streaming affects your FPS for one simple reason, and that is the amount of resources required for streaming. Not only is it resource intensive on your GPU, but it can be quite a struggle for your CPU, ram and even your harddrive.

How can I stream without losing performance? ›

How to Stop Dropped Frames in OBS
  1. Reduce the resolution of your streams. In general, the higher the stream resolution, the higher the CPU usage. ...
  2. Reduce the bitrate of your streams. ...
  3. If live streaming over WiFi try using an ethernet cable. ...
  4. Check your firewall settings. ...
  5. Broadcast to a server that is closest to you.

Do Spark Streaming programs run continuously? ›

Users specify a streaming computation by writing a batch computation (using Spark's DataFrame/Dataset API), and the engine automatically incrementalizes this computation (runs it continuously).

What is the difference between Spark and Spark Streaming? ›

Generally, Spark streaming is used for real time processing. But it is an older or rather you can say original, RDD based Spark structured streaming is the newer, highly optimized API for Spark. Users are advised to use the newer Spark structured streaming API for Spark. Hope this will clear your doubt.

What is batch duration in Spark Streaming? ›

A batch interval tells spark that for what duration you have to fetch the data, like if its 1 minute, it would fetch the data for the last 1 minute. source: So the data would start pouring in a stream in batches, this continuous stream of data is called DStream.

How fast is Kafka streams? ›

Written By
Peak Throughput (MB/s)605 MB/s305 MB/s
p99 Latency (ms)5 ms (200 MB/s load)25 ms (200 MB/s load)
21 Aug 2020

What is Kafka stream time? ›

Kafka Streams assigns a timestamp to every data record via so-called timestamp extractors. These per-record timestamps describe the progress of a stream with regards to time (although records may be out-of-order within the stream) and are leveraged by time-dependent operations such as joins.

Why do we need Kafka streaming? ›

Kafka Streams greatly simplifies the stream processing from topics. Built on top of Kafka client libraries, it provides data parallelism, distributed coordination, fault tolerance, and scalability.

What is correct flow for Spark Streaming architecture? ›

Architecture of Spark Streaming: Discretized Streams

As we know, continuous operator processes the streaming data one record at a time. Despite, processing one record at a time, it discretizes data into tiny, micro-batches. We can also say, spark streaming's receivers accept data in parallel.

What are the main 6 Hadoop Streaming components require for execution of Hadoop Streaming? ›

Hadoop Streaming architecture
  • Input Reader/Format.
  • Key Value.
  • Mapper Stream.
  • Key-Value Pairs.
  • Reduce Stream.
  • Output Format.
  • Map External.
  • Reduce External.
13 Jun 2022

What type of processing does Apache Spark use? ›

Apache Spark is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.

How can I check my performance test? ›

How to Do Performance Testing?
  1. Identify the Test Environment and Tools. Identify the production environment, testing environment, and testing tools at your disposal. ...
  2. Define Acceptable Performance Criteria. ...
  3. Plan and Design Tests. ...
  4. Prepare Test Environment and Tools. ...
  5. Run the Performance Tests. ...
  6. Resolve and Retest.

How many messages can Kafka process per second? ›

How many messages can Apache Kafka® process per second? At Honeycomb, it's easily over one million messages.

How do I know if my Kafka cluster is healthy? ›

a) Check the kafka port listed in the server. properties file with the netstat command, port 9092. The below output shows that there is a java process listening indicating good results. The below output shows that there is a java process listening indicating good results.

What is one of the methods Kafka employs to optimize write speeds? ›

Kafka's speed is tremendously improved by the implementation of the zero-copy principle.

How can I reduce lag in Kafka topic? ›

How about increase the partitions of topic and also increase consumers up to partitions. Consuming concurrency can increase performance. If you store offsets on the zookeeper, it can be bottleneck. Reduce commits of offset and use dedicated zookeeper if possible.

How do I reduce Kafka latency? ›

Apache Kafka® provides very low end-to-end latency for large volumes of data. This means the amount of time it takes for a record that is produced to Kafka to be fetched by the consumer is short. If you're using a dedicated cluster, adding additional CKUs can reduce latency.

What are the advantages of using Apache Kafka? ›

Apache Kafka is massively scalable because it allows data to be distributed across multiple servers, and it's extremely fast because it decouples data streams, which results in low latency. It can also distribute and replicate partitions across many servers, which protects against server failure.

Why Kafka is efficient? ›

One of the strengths of Kafka is its ability to process large volumes of data in real-time. This feature enables the analysis of data and deduction of meaning from large data inflows within the shortest time possible. Platforms relying on Kafka do not have to maintain an extensive database of unprocessed records.

What is the maximum size of data for Kafka? ›

Kafka has a default limit of 1MB per message in the topic.

How do I optimize a Spark for Streaming? ›

Procedure. Select the Set Tuning properties check box to optimize the allocation of the resources to be used to run this Job. These properties are not mandatory for the Job to run successfully, but they are useful when Spark is bottlenecked by any resource issue in the cluster such as CPU, bandwidth or memory.

How can I improve my Spark Streaming speed? ›

Use DataFrame/Dataset over RDD. For Spark jobs, prefer using Dataset/DataFrame over RDD as Dataset and DataFrame's includes several optimization modules to improve the performance of the Spark workloads. In PySpark use, DataFrame over RDD as Dataset's are not supported in PySpark applications.

How does Spark Streaming work with Kafka? ›

Spark Streaming and Kafka Integration allows a parallelism between partitions of Kafka and Spark along with a mutual access to metadata and offsets. A direct stream can also be created for an input stream to directly pull messages from Kafka. This can be implemented through the following code: import java.

What is the main disadvantage of Spark streaming? ›

Apache Spark Streaming Cons

There are a lot of issues with memory management and latency. There is no real-time analytics. We recommend it for the use cases where there is a five-second latency, but not for a millisecond, an IOT-based, or the detection anomaly-based. Flink as a service is much better.

Is Spark streaming stateful? ›

One of the most powerful features of Spark Streaming is the simple API for stateful stream processing and the associated native, fault-tolerant, state management.

Is Spark streaming exactly once? ›

At most once: Each message may or may not be processed. If a message is processed, it's only processed once. Exactly once: Each message is guaranteed to be processed once and only once.

What is Rate source in Spark streaming? ›

Rate source will auto-generate data which we will then print onto a console. And to create micro-batches of the input stream, we use the below properties as needed. Hadoop vs Spark - Find Out Now Who is The Big Winner in the Big Data World. rowsPerSecond: How many rows should be generated per second.

Why is Spark show so slow? ›

Sometimes, Spark runs slowly because there are too many concurrent tasks running. The capacity for high concurrency is a beneficial feature, as it provides Spark-native fine-grained sharing. This leads to maximum resource utilization while cutting down query latencies.

Why is Spark internet so slow? ›

If you experience slow speed on only one device, it's likely that it's the device limiting the speed. If you're connected via WiFi it could also be that your device is too far away from the modem. Some older devices that have become outdated may be unable to process higher speeds.

Why your Spark apps are slow or failing? ›

Out of Memory at the Executor Level. This is a very common issue with Spark applications which may be due to various reasons. Some of the most common reasons are high concurrency, inefficient queries, and incorrect configuration. Let's look at each in turn.

How can I improve streaming on my router? ›

There are a few tricks you can use to improve stream quality:
  1. Restart the streaming service. ...
  2. Reboot your home network. ...
  3. Move your Wi-Fi hub and router to an optimal location — somewhere central, open and away from obstruction.
  4. Kick some devices off the network. ...
  5. Disable your VPN. ...
  6. Change your DNS server.
8 Jan 2020

Is Spark streaming real-time? ›

Spark Streaming supports the processing of real-time data from various input sources and storing the processed data to various output sinks.

What are the issues faced in Spark? ›

Spark – Different Types of Issues While Running in Cluster?
  • Serialization Issues.
  • Out of Memory Exceptions.
  • Optimizing Long Running Jobs.
  • Result Exceeds Driver Memory.
  • Using coalesce() – Creates Uneven Partitions.
  • Broadcasting Large Data.
  • Data Skewness.
  • Too Small and Too Large Partitions.
16 Jul 2022

How can I improve my slow data connection? ›

Hardware tips:
  1. Turn off internet-connected devices when you aren't using them.
  2. Clear your browser cache and cookies.
  3. Exit streaming apps after watching videos or listening to music.
  4. Restart devices regularly.
  5. Turn off auto-updates on apps, gaming consoles and computers.

How do I speed up data? ›

Why Is Your Mobile Data So Slow? How to Speed It Up in 10 Steps
  1. Restart Your Phone. ...
  2. Move to a Different Location. ...
  3. Update and Disable Network-Heavy Apps. ...
  4. Disable Data Saver or Low Data Mode. ...
  5. Remember Your Data Cap. ...
  6. Disconnect From Your VPN. ...
  7. Check for Network Outages. ...
  8. Reset Your Phone's Network Settings.
2 May 2022

How do I check data speed? ›

Run a speed test on your computer, phone or tablet

Some more popular speed test services include, or CloudFlare. Whether you install an app or use a website, it's a good idea to run the test a few times to get a sense of your connection's performance.

How do I increase Spark memory? ›

To enlarge the Spark shuffle service memory size, modify SPARK_DAEMON_MEMORY in $SPARK_HOME/conf/, the default value is 2g, and then restart shuffle to make the change take effect.

How does Spark handle large data? ›

Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

How do you deal with a failed Spark job? ›

Resolution: From the Analyze page, perform the following steps in Spark Submit Command Line Options:
  1. Set a higher value for the executor memory, using one of the following commands: --conf spark.executor.memory= <XX>g. ...
  2. Increase the number of shuffle partitions, using the following command: --spark.sql.shuffle.partitions.

Why is my streaming so slow? ›

Internet buffering problems are usually caused by one of three issues. Your internet connection is too slow to keep up with the incoming data. The streaming provider can't send your device the data it needs fast enough. Your home Wi-Fi network is slowing things down.

What is best Internet speed for streaming? ›

As a general rule, the recommended internet speed for streaming in standard definition videos is 3–10 Mbps (Megabits per second). To stream Hulu, Netflix, and Sling TV, you'll want a minimum download speed of 25 Mbps. For more devices and users, you'll want a speed closer to 50 Mbps.


1. Best Practices for Monitoring and Improving Kafka Performance
2. Receiver vs Direct kafka | Spark Streaming Interview Questions
(Data Savvy)
3. 7.7.1 Spark Streaming Tutorial | Kafka Spark Integration | Install Kafka
(Data Savvy)
4. Apache Spark Performance Tuning | Resource Allocation with 4 Node Cluster | Hands on Demo
(BigDatapedia ML & DS)
5. Spark performance optimization Part1 | How to do performance optimization in spark
(BigData Thoughts)
6. Optimizing Kafka Performance - Kafka 3.1 Tuning and Benchmarks
Top Articles
Latest Posts
Article information

Author: Nathanial Hackett

Last Updated: 15/05/2023

Views: 5691

Rating: 4.1 / 5 (72 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Nathanial Hackett

Birthday: 1997-10-09

Address: Apt. 935 264 Abshire Canyon, South Nerissachester, NM 01800

Phone: +9752624861224

Job: Forward Technology Assistant

Hobby: Listening to music, Shopping, Vacation, Baton twirling, Flower arranging, Blacksmithing, Do it yourself

Introduction: My name is Nathanial Hackett, I am a lovely, curious, smiling, lively, thoughtful, courageous, lively person who loves writing and wants to share my knowledge and understanding with you.