WSO2 BAM uses MapReduce jobs to archive Cassandra data. As a result, you can archive a large amount of data in a cluster of Hadoop nodes. To initiate the data archiving process, log in to BAM management console and click Archive Data menu in the Configure menu as illustrated below:
The general configuration parameters of the data archival process are as follows:
Parameter | Description |
---|---|
Stream Name | In the BAM data model, stream name maps to a Cassandra column family. Provide the stream name to archive the data stored under that stream name. |
Version | Version of the stream. Used to specify which version to archive, when there are multiple versions under the same stream name (as recommended). |
Date Range | Select this option to archive the data manually. |
Retention Period | Select this option to archive the data by scheduling it using a cron expression. |
Archival / Deletion | Select Archival. For instructions on deleting data, see Deleting Cassandra Data. |
You can run the archival process through the BAM management console, either manually or by scheduling it using a cron expression as explained below.
Archiving data manually
Follow the steps below to for the manual data archiving process:
Click Date Range to manually archive data between a specific date range.
The specific configuration parameters of the manual data archival process are explained below:
Parameter Description From The start date of the date range to be specified. For example: 25/01/2013 00:00:00 AM To The end date of the date range to be specified. For example: 03/02/2013 00:00:00 AM Click Submit.
Scheduling the data archive
Follow the steps below to for the scheduled data archiving process:
Click Retention Period to schedule an archive process. For example:
The specific configuration parameters of the scheduled data archival process are explained below:Parameter Description No of days
Keeps only last 'n' no of days data in the Column Family. For example, according to above configuration, the system only runs data from the last 90 days and archives the older data. Cron expression Cron expression is used to schedule the archive process. For example, the following cron expression will configure the archive job to run at 12:00 PM (noon) every day: 0 0 12 * * ?
For more information on cron expressions, go to Oracle Documentation.- Name of the archive column family is:
<original column family name> + _arch
- Cassandra streams are generated with underscores (_). Replace the underscores in the stream name with dot (.) when archiving. For example: If stream name is
org_wso2_bam_phone_retail_store_kpi
, mention it asorg.wso2.bam.phone.retail_store.kpi
when archiving.
- Name of the archive column family is:
Click Submit. Once you submit a scheduled archive, the system creates a Hive script and executes it.
Click List under Analytics in the Main menu.
This step is not applicable in the manual data archival process, which only executes the Hive query, but doesn't save it.
Click Schedule Script associated with your script, to change the schedule time of your script.