This documentation is for WSO2 Data Analytics Server 3.0.0. View documentation for the latest release.
||
Skip to end of metadata
Go to start of metadata

This section covers the configurations required to use Apache Spark with WSO2 DAS.

Adding jar files to the Spark classpath

When starting the Spark Driver Application in the DAS server, Spark executors are created within the same node. The following jars are included in the classpath for these executors by default.

  • apache-zookeeper
  • axiom
  • axis2
  • axis2-json
  • cassandra-thrift
  • chill
  • com.datastax.driver.core
  • com.fasterxml.jackson.core.jackson-annotations
  • com.fasterxml.jackson.core.jackson-core
  • com.fasterxml.jackson.core.jackson-databind
  • com.fasterxml.jackson.module.jackson.module.scala
  • com.google.gson
  • com.google.guava
  • com.google.protobuf
  • com.jayway.jsonpath.json-path
  • com.ning.compress-lzf
  • com.sun.jersey.jersey-core
  • com.sun.jersey.jersey-server
  • commons-codec
  • commons-collections
  • commons-configuration
  • commons-httpclient
  • commons-io
  • commons-lang
  • config
  • h2-database-engine
  • hadoop-client
  • hazelcast
  • hbase-client
  • hector-core
  • htrace-core
  • htrace-core-apache
  • httpclient
  • httpcore
  • io.dropwizard.metrics.core
  • io.dropwizard.metrics.graphite
  • io.dropwizard.metrics.json
  • io.dropwizard.metrics.jvm
  • javax.cache.wso2
  • javax.servlet.jsp-api
  • jaxb
  • jdbc-pool
  • jdom
  • jettison
  • json
  • json-simple
  • json4s-jackson
  • kryo
  • libthrift
  • lucene
  • mesos
  • minlog
  • net.minidev.json-smart
  • netty-all
  • objenesis
  • org.apache.commons.lang3
  • org.apache.commons.math3
  • org.jboss.netty
  • org.roaringbitmap.RoaringBitmap
  • org.scala-lang.scala-library
  • org.scala-lang.scala-reflect
  • org.spark-project.protobuf.java
  • org.spark.project.akka.actor
  • org.spark.project.akka.remote
  • org.spark.project.akka.slf4j
  • org.wso2.carbon.analytics.api
  • org.wso2.carbon.analytics.dataservice.commons
  • org.wso2.carbon.analytics.dataservice.core
  • org.wso2.carbon.analytics.datasource.cassandra
  • org.wso2.carbon.analytics.datasource.commons
  • org.wso2.carbon.analytics.datasource.core
  • org.wso2.carbon.analytics.datasource.hbase
  • org.wso2.carbon.analytics.datasource.rdbms
  • org.wso2.carbon.analytics.eventsink
  • org.wso2.carbon.analytics.eventtable
  • org.wso2.carbon.analytics.io.commons
  • org.wso2.carbon.analytics.spark.core
  • org.wso2.carbon.analytics.stream.persistence
  • org.wso2.carbon.base
  • org.wso2.carbon.cluster.mgt.core
  • org.wso2.carbon.core
  • org.wso2.carbon.core.common
  • org.wso2.carbon.core.services
  • org.wso2.carbon.databridge.agent
  • org.wso2.carbon.databridge.agent
  • org.wso2.carbon.databridge.commons

In addition any jars available in the <DAS_HOME>/repository/conf/lib directory are also appended to the class path.

If you want to add additional jars, you can add them to the SPARK_CLASSPATH in the <DAS_HOME>/bin/external-spark-classpath.conf  file in a UNIX environment.

Each path should have a separate line.

When WSO2 DAS connects with an external Spark cluster, the distribution of DAS is copied to each node in the Spark cluster. This allows each Spark node to access the jars it needs to work with WSO2 DAS.

Carbon related configurations 

Following are the Carbon related configurations that are used for Apache Spark. These configurations are shipped with the product by default in the <DAS_home>/repository/conf/analytics/spark/spark-defaults.conf file.

PropertyDefault ValueDescription
carbon.spark.master local

The Spark master has three possible states as follows:

  • local: This starts Spark in the local mode. e.g, carbon.spark.master local or carbon.spark.master local[2]
  • client : This mode results in the DAS acting as a client for an external Spark cluster. e.g., carbon.spark.master spark://<host name>:<port>. For more details on setting up WSO2 DAS and Apache Spark in this mode, see Connecting a DAS Instance to an Existing External Apache Spark Cluster.
  • cluster: This mode results in the DAS creating its own Spark cluster using Carbon Clustering. When Spark runs in this mode, it is required to specify a value for the carbon.spark.master.count property. e.g., carbon.spark.master local AND carbon.spark.master.count  <number of redundant masters>
carbon.spark.master.count 1

The maximum number of masters allowed at a given time when DAS creates its own Spark cluster.

This property is applicable only when the Spark master runs in the cluster mode.

carbon.das.symbolic.link This links to your DAS home by default.

The symbolic link for the jar files in the Spark class path.

In a clustered DAS deployment, the directory path for the Spark Class path is different for each node depending on the location of the <DAS_HOME>. The symbolic link redirects the Spark Driver Application to the relevant directory for each node when it creates the Spark class path. The symbolic link should be located in the same path for each <DAS_HOME>.

The symbolic link is not specified by default. When it is not specified, the jar files are added in the DAS home.

Spark related configurations

Following are the Apache Spark related configurations that are used in WSO2 DAS. These configurations are shipped with the product by default in the <DAS_home>/repository/conf/analytics/spark/spark-defaults.conf file.

For more information on the below Spark configuration properties, go to Apache Spark Documentation.

 

Application configurations

PropertyDefault Value

spark.app.name

CarbonAnalytics

spark.driver.cores
1
spark.driver.memory 512m
spark.executor.memory 512m

Spark UI configurations

PropertyDefault Value

spark.ui.port

CarbonAnalytics

spark.history.ui.port 18080

Compression and serialization configurations

PropertyDefault Value

spark.serializer

org.apache.spark.serializer.KryoSerializer

spark.kryoserializer.buffer
256k
spark.kryoserializer.buffer.max
256m

Networking configurations

PropertyDefault Value

spark.blockManager.port

12000

spark.broadcast.port12500
spark.driver.port13000
spark.executor.port13500
spark.fileserver.port14000
spark.replClassServer.port14500

Scheduling configurations

PropertyDefault Value

spark.scheduler.mode

FAIR

In addition to having FAIR as the value for the spark.scheduler.mode property in the spark-defaults.conf file, it is required to have a pool configuration with the schedulingMode parameter set to FAIR in the <DAS_HOME>/repository/conf/analytics/spark/fairscheduler.xml file as shown in the configuration below. This is because Apache Spark by default uses a scheduler pool which runs scheduler tasks in a First-In-First-Out (FIFO) order.

<?xml version="1.0"?>

<allocations>
    <pool name="carbon-pool">
        <schedulingMode>FAIR</schedulingMode>
        <weight>1000</weight>
        <minShare>1</minShare>
    </pool>
    <pool name="test">
        <schedulingMode>FIFO</schedulingMode>
        <weight>1</weight>
        <minShare>0</minShare>
    </pool>
</allocations>

 

Standalone cluster configurations

PropertyDefault Value

spark.deploy.recoveryMode

CUSTOM

spark.deploy.recoveryMode.factory
org.wso2.carbon.analytics.spark.core.deploy.AnalyticsRecoveryModeFactory

Master configurations

PropertyDefault Value

spark.master.port

7077

spark.master.rest.port6066
spark.master.webui.port8081

Worker configurations

PropertyDefault Value

spark.worker.cores

1

spark.worker.memory1g
spark.worker.dirwork
spark.worker.port11000
spark.worker.webui.port11500

Optional Spark related configurations

The following configurations can be added to the <DAS_home>/repository/conf/analytics/spark/spark-defaults.conf file if you want to limit the space allocated for log files generated and saved in the <DAS_HOME>/work directory.

spark.executor.logs.rolling.strategy  size
spark.executor.logs.rolling.maxSize  10000000
spark.executor.logs.rolling.maxRetainedFiles 10
PropertyDescription
spark.executor.logs.rolling.strategy
This indicates the strategy used to control the amount of logs saved in the <DAS_HOME>/work directory. In the above configuration, property value size indicates that the amount of logs that are allowed to be kept is restricted based on the size.
spark.executor.logs.rolling.maxSize
The maximum size (in bytes) allowed for logs saved in the <DAS_HOME>/work  directory at any given time. Older log files are deleted when new logs are generated so that the specified maximum size is not exceeded.
spark.executor.logs.rolling.maxRetainedFiles
The maximum number of log files allowed to be kept in the <DAS_HOME>/work  directory at any given time. Older log files are deleted when new logs are generated so that the specified maximum number of files is not exceeded. In the above configuration, this property is overruled by the spark.executor.logs.rolling.maxSize property because the value specified for the spark.executor.logs.rolling.strategy is size. If the maximum size specified for logs is reached, older logs are deleted even if the maximum number of files specified is not yet reached.
  • No labels