The latest version for DAS is WSO2 Data Analytics Server 3.1.0. View documentation for the latest release.
WSO2 Data Analytics Server is succeeded by WSO2 Stream Processor. To view the latest documentation for WSO2 SP, see WSO2 Stream Processor Documentation.

All docs This doc
Skip to end of metadata
Go to start of metadata

This section covers the configurations required to use Apache Spark with WSO2 DAS.

Adding jar files to the Spark classpath

When starting the Spark Driver Application in the DAS server, Spark executors are created within the same node. The following jars are included in the classpath for these executors by default.

  • apache-zookeeper
  • axiom
  • axis2
  • axis2-json
  • cassandra-thrift
  • chill
  • com.datastax.driver.core
  • com.fasterxml.jackson.core.jackson-annotations
  • com.fasterxml.jackson.core.jackson-core
  • com.fasterxml.jackson.core.jackson-databind
  • com.fasterxml.jackson.module.jackson.module.scala
  • com.jayway.jsonpath.json-path
  • com.ning.compress-lzf
  • com.sun.jersey.jersey-core
  • com.sun.jersey.jersey-server
  • commons-codec
  • commons-collections
  • commons-configuration
  • commons-httpclient
  • commons-io
  • commons-lang
  • config
  • h2-database-engine
  • hadoop-client
  • hazelcast
  • hbase-client
  • hector-core
  • htrace-core
  • htrace-core-apache
  • httpclient
  • httpcore
  • io.dropwizard.metrics.core
  • io.dropwizard.metrics.graphite
  • io.dropwizard.metrics.json
  • io.dropwizard.metrics.jvm
  • javax.cache.wso2
  • javax.servlet.jsp-api
  • jaxb
  • jdbc-pool
  • jdom
  • jettison
  • json
  • json-simple
  • json4s-jackson
  • kryo
  • libthrift
  • lucene
  • mesos
  • minlog
  • net.minidev.json-smart
  • netty-all
  • objenesis
  • org.apache.commons.lang3
  • org.apache.commons.math3
  • org.jboss.netty
  • org.roaringbitmap.RoaringBitmap
  • org.scala-lang.scala-library
  • org.scala-lang.scala-reflect
  • org.spark.project.akka.remote
  • org.spark.project.akka.slf4j
  • org.wso2.carbon.base
  • org.wso2.carbon.cluster.mgt.core
  • org.wso2.carbon.core
  • org.wso2.carbon.core.common
  • org.wso2.carbon.databridge.agent
  • org.wso2.carbon.databridge.agent
  • org.wso2.carbon.databridge.commons

In addition any jars available in the <DAS_HOME>/repository/conf/lib directory are also appended to the class path.

If you want to add additional jars, you can add them to the SPARK_CLASSPATH in the <DAS_HOME>/bin/external-spark-classpath.conf file in a UNIX environment.

Each path should have a separate line.

When WSO2 DAS connects with an external Spark cluster, the distribution of DAS is copied to each node in the Spark cluster. This allows each Spark node to access the jars it needs to work with WSO2 DAS.

Carbon related configurations 

Following are the Carbon related configurations that are used for Apache Spark. These configurations are shipped with the product by default in the <DAS_home>/repository/conf/analytics/spark/spark-defaults.conf file.

PropertyDefault ValueDescription

The Spark master has three possible states as follows:

  • local: This starts Spark in the local mode. e.g, carbon.spark.master local or carbon.spark.master local[2]
  • client: This mode results in the DAS acting as a client for an external Spark cluster. e.g., carbon.spark.master spark://<host name>:<port>. For more details on setting up WSO2 DAS and Apache Spark in this mode, see Connecting a DAS Instance to an Existing External Apache Spark Cluster.
  • cluster: This mode results in the DAS creating its own Spark cluster using Carbon Clustering. When Spark runs in this mode, it is required to specify a value for the carbon.spark.master.count property. e.g., carbon.spark.master local AND carbon.spark.master.count  <number of redundant masters>

The maximum number of masters allowed at a given time when DAS creates its own Spark cluster.

This property is applicable only when the Spark master runs in the cluster mode.
This links to your DAS home by default.

The symbolic link for the jar files in the Spark class path.

In a clustered DAS deployment, the directory path for the Spark Class path is different for each node depending on the location of the <DAS_HOME>. The symbolic link redirects the Spark Driver Application to the relevant directory for each node when it creates the Spark class path. The symbolic link should be located in the same path for each <DAS_HOME>.

  • The symbolic link is not specified by default. When it is not specified, the jar files are added in the DAS home.
  • In a multi node DAS cluster that runs in a RedHat Linux environment, you also need to update the <DAS_HOME>/bin/ file with the following entry so that the <DAS_HOME> is exported. This is because the symbolic link may not be resolved correctly in this operating system.

    Export CARBON_HOME=<symbolic link>

Default Spark related configurations

Following are the Apache Spark related configurations that are used in WSO2 DAS. These configurations are shipped with the product by default in the <DAS_home>/repository/conf/analytics/spark/spark-defaults.conf file.

For more information on the below Spark configuration properties, go to Apache Spark Documentation

Application configurations

PropertyDefault Value



Spark UI configurations

PropertyDefault Value




Compression and serialization configurations

PropertyDefault Value




Networking configurations

PropertyDefault Value




Scheduling configurations

PropertyDefault Value



In addition to having FAIR as the value for the spark.scheduler.mode property in the spark-defaults.conf file, it is required to have a pool configuration with the schedulingMode parameter set to FAIR in the <DAS_HOME>/repository/conf/analytics/spark/fairscheduler.xml file as shown in the configuration below. This is because Apache Spark by default uses a scheduler pool which runs scheduler tasks in a First-In-First-Out (FIFO) order.

<?xml version="1.0"?>

    <pool name="carbon-pool">
    <pool name="test">

Standalone cluster configurations

PropertyDefault Value



Master configurations

PropertyDefault ValueDescription



The port on which the Spark master is run. port on which the REST server of the Spark master is run.
spark.master.webui.port8081The port on which the web UI of the Sprk master is run.

Worker configurations

PropertyDefault ValueDescription



The number of cores in your machine allocated for the Spark worker.
spark.worker.memory1gThe amount of memory in your machine that is allocated to the Spark worker.
spark.worker.port11000The port of the Spark worker.
spark.worker.webui.port11500The port in which the web UI of the Spark

Executor configurations

It is recommended to run only one executor per DAS worker. If you observe any memory or Spark executor time issues for this executor, you can increase the amount of memory and the number of CPU cores allocated to it.

PropertyDefault ValueDescription
spark.executor.cores1The number of cores allocated to the Spark executors that are running in the DAS node. All the availble CPU cores of the worker are allocated to the executor(s) by default.
spark.executor.memory1gThe amount of CPU memory allocated to the spark executor(s).

Optional Spark related configurations

The following configurations can be added to the <DAS_home>/repository/conf/analytics/spark/spark-defaults.conf file if you want to limit the space allocated for log files generated and saved in the <DAS_HOME>/work directory.

spark.executor.logs.rolling.strategy  size
spark.executor.logs.rolling.maxSize  10000000
spark.executor.logs.rolling.maxRetainedFiles 10
This indicates the strategy used to control the amount of logs saved in the <DAS_HOME>/work directory. In the above configuration, property value size indicates that the amount of logs that are allowed to be kept is restricted based on the size.
The maximum size (in bytes) allowed for logs saved in the <DAS_HOME>/work directory at any given time. Older log files are deleted when new logs are generated so that the specified maximum size is not exceeded.
The maximum number of log files allowed to be kept in the <DAS_HOME>/work directory at any given time. Older log files are deleted when new logs are generated so that the specified maximum number of files is not exceeded. In the above configuration, this property is overruled by the spark.executor.logs.rolling.maxSize property because the value specified for the spark.executor.logs.rolling.strategy is size. If the maximum size specified for logs is reached, older logs are deleted even if the maximum number of files specified is not yet reached.
  • No labels