Apache Spark allows UDFs (User Defined Functions) to be created if you want want to use a feature that is not available for Spark by default. WSO2 DAS has an abstraction layer for generic Spark UDF (User Defined Functions) which makes it convenient to introduce UDFs to the server.
The following query is an example of a custom UDF.
The steps to create a custom UDF are as follows.
Step 1: Create the POJO class
The following example shows the UDF POJO for the StringConcatonator custom UDF class. The name of the Spark UDF should be the name of the method defined (concat in this example). This will be used when calling the UDF with Spark. e.g.,
concat(“custom”,”UDF”) returns the
String “Custom UDF”.
Apache Spark does not support primitive data type returns. Therefore, all the methods in a POJO class should return the wrapper class of the corresponding primitive data type.
e.g., A method to add two integers should be defined as shown below.
- Method overloading for UDFs is not supported. Different UDFs should have different method names for the expected behaviour.
If the user consumes a data type that is not supported for Apache Spark, the following error appears when you start the DAS server.
For a list of return types supported for Apache Spark, see Spark SQL and DataFrames and Datasets Guide - Data Types.
If you need to use one or more methods that are not UDF methods, and they contain return types that are not supported for Apache Spark, you can use a separate class to define them. This class does not have to be added to the
Step 2: Package the class in a jar
The custom UDF class you created should be bundled as a jar and added to
Step 3: Update Spark UDF configuration file
Add the newly created custom UDF to the
<DAS_HOME>/repository/conf/analytics/spark/spark-udf-config.xml file as shown in the example below.
This configuration is required for Spark to identify and use the newly defined custom UDF.
Spark adds all the methods in the specified UDF class as custom UDFs.