This documentation is for Machine Learner 1.0.0. View documentation for the latest release.

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Localtab Group
Localtab
titleWith Embedded Spark

WSO2 Machine Learner (WSO2 ML) ships an embedded Spark server for the ease of use.

In order to create datasets out of the data tables created by WSO2 DAS and then build models using the data collected in those tables, make sure that the following databases have the same URL in

<ML_HOME>/repository/conf/datasources/analytics-datasources.xml file as well as <DAS_HOME>/repository/conf/datasources/analytics-datasources.xml file.

  • WSO2_ANALYTICS_FS_DB
  • WSO2_ANALYTICS_EVENT_STORE_DB
  • WSO2_ANALYTICS_PROCESSED_DATA_STORE_DB

Once the WSO2 ML is started, you can proceed with the dataset creation.

Localtab
titleWith External Spark Cluster

WSO2 ML can connect to an external Spark cluster and retrieve data using the Data Access Layer (DAL) of WSO2 DAS.


Pre-requisites

In order to connect to an external Spark cluster, you need to do the following.

  • Set up an external Spark cluster with a master node and at least one worker node.

    Tip

    Use Spark version 1.4.1 binary built on Hadoop 2.6.0.

  • Create an event stream in DAS and publish some events to it. (This is the data we are going to use as our dataset to perform predictive analysis.) 

Configurations in the Spark cluster

Follow the procedure below to do the required Spark cluster related configurations.

  1. Create the directory named analytics in the <SPARK_HOME> directory and copy all the following DAS related jars to it. These jars can be found in <DAS_HOME>/repository/components/plugins directory.
    • axiom_1.2.11.wso2v6.jar
    • axis2_1.6.1.wso2v14.jar
    • h2-database-engine_1.2.140.wso2v3.jar
    • hazelcast_3.5.0.wso2v1.jar
    • jdbc-pool_7.0.34.wso2v2.jar
    • lucene_5.2.1.wso2v1.jar
    • org.wso2.carbon.analytics.api_1.0.3.jar
    • org.wso2.carbon.analytics.dataservice.commons_1.0.3.jar
    • org.wso2.carbon.analytics.dataservice.core_1.0.3.jar
    • org.wso2.carbon.analytics.datasource.cassandra_1.0.3.jar
    • org.wso2.carbon.analytics.datasource.commons_1.0.3.jar
    • org.wso2.carbon.analytics.datasource.core_1.0.3.jar
    • org.wso2.carbon.analytics.datasource.hbase_1.0.3.jar
    • org.wso2.carbon.analytics.datasource.rdbms_1.0.3.jar
    • org.wso2.carbon.analytics.io.commons_1.0.3.jar
    • org.wso2.carbon.analytics.spark.admin_1.0.3.jar
    • org.wso2.carbon.analytics.spark.core_1.0.3.jar
    • org.wso2.carbon.analytics.spark.utils_1.0.3.jar
    • org.wso2.carbon.analytics.tools.backup_1.0.3.jar
    • org.wso2.carbon.analytics.tools.migration_1.0.3.jar
    • org.wso2.carbon.base_4.4.1.jar
    • org.wso2.carbon.core.common_4.4.1.jar
    • org.wso2.carbon.core.services_4.4.1.jar
    • org.wso2.carbon.core_4.4.1.jar
    • org.wso2.carbon.datasource.reader.hadoop_4.3.1.jar
    • org.wso2.carbon.ndatasource.common_4.4.1.jar
    • org.wso2.carbon.ndatasource.core_4.4.1.jar
    • org.wso2.carbon.ndatasource.rdbms_4.4.1.jar
    • org.wso2.carbon.ntask.common_4.4.7.jar
    • org.wso2.carbon.ntask.core_4.4.7.jar
    • org.wso2.carbon.ntask.solutions_4.4.7.jar
    • org.wso2.carbon.registry.admin.api_4.4.8.jar
    • org.wso2.carbon.registry.api_4.4.1.jar
    • org.wso2.carbon.registry.common_4.4.8.jar
    • org.wso2.carbon.registry.core_4.4.1.jar
    • org.wso2.carbon.registry.indexing_4.4.8.jar
    • org.wso2.carbon.registry.properties_4.4.8.jar
    • org.wso2.carbon.registry.resource_4.4.8.jar
    • org.wso2.carbon.registry.search_4.4.8.jar
    • org.wso2.carbon.registry.server_4.4.1.jar
    • org.wso2.carbon.registry.servlet_4.4.8.jar
    • org.wso2.carbon.utils_4.4.1.jar
  2. Create a directory named ml in the <SPARK_HOME> directory and copy the following ML related jars to it. These jars can be found in <ML_HOME>/repository/components/plugins directory.
    • org.wso2.carbon.ml.commons_1.0.2.jar
    • org.wso2.carbon.ml.core_1.0.2.jar
    • org.wso2.carbon.ml.database_1.0.2.jar
    • kryo_2.24.0.wso2v1.jar
  3. Create a file named spark-env.sh with the following entries and save it in the <SPARK_HOME>/conf directory.

    Note

    Change SPARK_MASTER_IP and SPARK_CLASSPATH values accordingly.

    Code Block
    SPARK_MASTER_IP=127.0.0.1
    SPARK_CLASSPATH={SPARK_HOME}/ml/org.wso2.carbon.ml.core_1.0.2.jar:{SPARK_HOME}/ml/org.wso2.carbon.ml.commons_1.0.2.jar:{SPARK_HOME}/ml/org.wso2.carbon.ml.database_1.0.2.jar:{SPARK_HOME}/ml/kryo_2.24.0.wso2v1.jar:{SPARK_HOME}/analytics/axiom_1.2.11.wso2v6.jar:{SPARK_HOME}/analytics/axis2_1.6.1.wso2v14.jar:{SPARK_HOME}/analytics/h2-database-engine_1.2.140.wso2v3.jar:{SPARK_HOME}/analytics/hazelcast_3.5.0.wso2v1.jar:{SPARK_HOME}/analytics/jdbc-pool_7.0.34.wso2v2.jar:{SPARK_HOME}/analytics/lucene_5.2.1.wso2v1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.api_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.dataservice.commons_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.dataservice.core_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.datasource.cassandra_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.datasource.commons_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.datasource.core_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.datasource.hbase_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.datasource.rdbms_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.io.commons_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.spark.admin_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.spark.core_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.spark.utils_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.tools.backup_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.analytics.tools.migration_1.0.3.jar:{SPARK_HOME}/analytics/org.wso2.carbon.base_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.core.common_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.core.services_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.core_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.datasource.reader.hadoop_4.3.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.ndatasource.common_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.ndatasource.core_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.ndatasource.rdbms_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.ntask.common_4.4.7.jar:{SPARK_HOME}/analytics/org.wso2.carbon.ntask.core_4.4.7.jar:{SPARK_HOME}/analytics/org.wso2.carbon.ntask.solutions_4.4.7.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.admin.api_4.4.8.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.api_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.common_4.4.8.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.core_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.indexing_4.4.8.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.properties_4.4.8.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.resource_4.4.8.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.search_4.4.8.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.server_4.4.1.jar:{SPARK_HOME}/analytics/org.wso2.carbon.registry.servlet_4.4.8.jar:{SPARK_HOME}/analytics/org.wso2.carbon.utils_4.4.1.jar
  4. Create a directory named datasources in the <SPARK_HOME>/conf directory. Copy the following files from <DAS_HOME>/repository/conf/datasources directory to it. Make sure that these files contain the URL pointing to the exact databases used by WSO2 DAS.
    • analytics-datasources.xml
    • master-datasources.xml
    Info

    As noted in the prerequisite section, you need to first publish events/data into an event stream of WSO2 DAS.

    Info

    For the H2 database (which is default for DAS), you need to append AUTO_SERVER=TRUE to the database connection string as shown below.

    Code Block
    languagexml
    <url>jdbc:h2:/tmp/wso2das-1.0.0/repository/database/ANALYTICS_FS_DB;DB_CLOSE_ON_EXIT=FALSE;LOCK_TIMEOUT=60000;AUTO_SERVER=TRUE</url>
  5. Create a directory named analytics in the <SPARK_HOME>/conf directory. Copy the following files from <DAS_HOME>/repository/conf/analytics to it.
    • analytics-config.xml

      Info

      Comment out the following section in the analytics-config.xml file once you copy it.

      Code Block
      languagexml
      <!--analytics-data-purging>
            <purging-enable>false</purging-enable>
            <purge-node>true</purge-node>
            <cron-expression>0 0 0 * * ?</cron-expression>
            <purge-include-table-patterns>
               <table>.*</table>
            </purge-include-table-patterns>
            <data-retention-days>365</data-retention-days>
      </analytics-data-purging-->

       

    • analytics-data-config.xml
    • rdbms-query-config.xml 

  6. Restart the Spark cluster using the following commands

    To stop the cluster: <SPARK_HOME>$ ./sbin/stop-all.sh 

    To start the cluster: <SPARK_HOME>$ ./sbin/start-all.sh

Configurations in WSO2 ML

Follow the procedure below to do the required ML related configurations.

  1. Open the <ML_HOME>/repository/conf/etc/spark.config.xml file and do the following changes.
    • Change the spark.master property as required.
      e.g.,

      Code Block
      languagexml
      <property name="spark.master">spark://127.0.0.1:7077</property>
    • Add the spark.executor.extraJavaOptions property.
      e.g.,

      Code Block
      languagexml
       <property name="spark.executor.extraJavaOptions">-Dwso2_custom_conf_dir=/home/wso2/spark/{SPARK_HOME}/conf</property>
  2. Open the <ML_HOME>/repository/conf/datasources/analytics-datasources.xml file. Make sure that the URL in this file for the following databases are the same as that in <DAS_HOME>/repository/conf/datasources/analytics-datasources.xml file.

    • WSO2_ANALYTICS_FS_DB
    • WSO2_ANALYTICS_EVENT_STORE_D
    • WSO2_ANALYTICS_PROCESSED_DATA_STORE_D
    Info

    The H2 database (which is default) in addition requires AUTO_SERVER_TRUE to be appended to the database connection string as shown in the example below.

    Code Block
    languagexml
     <url>jdbc:h2:/tmp/wso2das-1.0.0/repository/database/ANALYTICS_FS_DB;DB_CLOSE_ON_EXIT=FALSE;LOCK_TIMEOUT=60000;AUTO_SERVER=TRUE</url>

After setting up the above configurations, start WSO2 ML and create datasets out of the data tables created by WSO2 DAS and build models using the data collected in those tables.

...