RapidMiner Integration

Use the vovrtsad daemon for integration with the RapidMiner Runtime Scoring Agent.

RapidMiner Overview

Altair RapidMiner is a data analytics and artificial intelligence (AI) platform that can help organizations gain a competitive advantage by leveraging their data. It can be used by people of all skill levels to:
  • Create models: Use automated, visual, and code-based approaches to streamline model creation
  • Train and evaluate models: Use the latest data analytics techniques to train, evaluate, explain, and deploy models
  • Build decision trees: Use a wizard-based tool to build explainable decision trees that visualize complex interactions within data
  • Design and prototype AI models: Use a visual drag-and-drop workflow designer to design and prototype AI and machine learning models
  • Prepare data: Use interactive data prep capabilities to import, clean up, and prepare data
  • Deploy models: Deploy models directly on new data

Integrate vovrtsad with RapidMiner Runtime Scoring Agent

  1. Obtain a copy of RapidMiner’s Runtime Scoring Agent.
    For the remainder of this documentation, assume that the RapidMiner Scoring Agent has been installed in the directory /opt/rapidminer-scoring-agent-embedded. Remember where you installed this executable and related files on your system, as you will need to point your vovrtsad daemon configuration file to this location in order to retrain models. Current Accelerator configuration setup expects a minimum of three RapidMiner AI Studio Processes in a deployment; one process for duration predictions, one process (preferably empty) used for a quick ping to check REST API connectivity, and another process for retraining the duration prediction model. If you are not going to retrain models you can potentially just work with a containerized version of the Scoring Agent directly but will miss out on the ability to update models directly with the Scoring Agent.
  2. Create a new Accelerator project.
    SWD=$(nc cmd vovserverdir -p .)
    ncmgr start -force -q ncTest
    vovproject enable ncTest
    Alternatively, you can add the required config files and daemon to an existing Accelerator project. For the rest of this document, assume the active Accelerator project directory is ~/vov/vnc/ncTest.swd.
    Tip: A new silent option is available under Accelerator’s nc run command for integrating with RapidMiner’s Runtime Scoring Agent via the -predict option flag. For example:
     nc run -predict – vovmemtime 5 0
  3. Setup a new vovrtsad daemon and start it. Some template files have been provided that you can copy to get started.
    cp $VOVDIR/../common/etc/autostart/start_vovrtsad.tcl ~/vov/vnc/ncTest.swd/autostart/start_vovrtsad.tcl
    mkdir -p *.swd/vovrtsad
    cp $VOVDIR/../common/etc/config/vovrtsad/config.tcl ~/vov/vnc/ncTest.swd/vovrtsad/config.tcl
    cp $VOVDIR/../common/etc/config/vovrtsad/vovrtsad.cfg ~/vov/vnc/ncTest.swd/vovrtsad/vovrtsad.cfg
    vovdaemon start vovrtsad

  4. Configure the vovrtsad daemon ~/vov/vnc/ncTest.swd/vovrtsad/vovrtsad.cfg file to match expected data output from the RapidMiner Embedded Runtime Scoring Agent. This is covered in a later step, but for now you need to prepare to work with other RapidMiner tools such as RapidMiner AI Studio, and RapidMiner AI Hub to create the initial model and a deployment file that can be used later with the Embedded Scoring Agent to predict duration or other values of interest from Accelerator.
    1. Create a snapshot of existing Accelerator job data using existing *.swd/data/job log files. To create an import file that is friendly with AI Studio, run vovrtsa_util to generate a data file that is a snapshot of recent Accelerator data/job log files, such asvovrtsa_util -start 13w.
      This creates a new data file in the *.swd/data/job/rtsa directory that is a snapshot of the desired range of daily data/job log files from today back 13 weeks. This same tool will also be used to retrain your model if desired, according to parameters in the vovrtsad daemon configuration file.
RapidMiner AI Studio is used for the data processing, such as removing uninteresting features or attributes, which may have similar values, doesn't contribute to producing a quality model, or for handling missing values, and for normalizing features. Here is an example image of an AI Studio process for preparing Accelerator job data. Please note the Discretize DURATION block below; this is the key operation where the given job duration values are used to create discrete time value buckets which are later used as the predicted duration values. Please consult Altair RapidMiner Documentation for more details on how to administrate and use AI Studio and AI Hub.
Figure 1.


  1. Once the data is prepared and ready for training, select one of the pre-existing model operators in AI Studio and train the model that you will use with Accelerator for predictions. Again here is an example of an AI Studio process for illustration, which is using a Decision Tree model operator. The actual process that you would use for predictions will be a series of operations to execute a combination of data prep like the one shown above followed by loading and applying the model created below.
    Figure 2.


    The prediction process should end up looking something like this:
    Figure 3.


  2. Once you have a trained model, and all of your processes (predict, retrain, ping) are ready to your satisfaction, create a snapshot and add it to AI Hub. From that published snapshot in AI Hub you will create a deployment file (*.zip) that is used inside of Accelerator for the predictions. The following images show the key steps for how endpoints for other devices and deployments are created for a project in AI Hub. Please consult Altair RapidMiner AI Hub documentation for more details.
    Figure 4.


    Figure 5.


  3. You can use the newly created deployment *.zip file elsewhere to integrate Accelerator with RapidMiner Runtime Scoring Agent. Copy it to the proper location inside of the rapidminer-scoring-agent location from Step 1, which for this example is in the folder /opt/rapidminer-scoring-agent-embedded/home/deployments.
You can now finish configuring the vovrtsad daemon vovrtsad.cfg file. This configuration file is essentially a JSON file which contains two key sections, a simple setup section required to communicate with the RapidMiner Embedded Runtime Scoring Agent’s REST APIs, and a section of JSON arrays for mapping RapidMiner process discretized attribute values, such as for job duration, back into something that the vov_rtsa wrapper utility, called by nc run -predict can use to convert the REST API output into meaningful values for Accelerator job submission values.
  1. For the REST API portion of the config file you will need to make sure that the URL, PORT and URN portions of the file match what was configured for your RapidMiner AI Hub server, which will be the same values that are used for the runtime-scoring-agent (RTSA). The following values should follow their respective rules:
    Value
    Rules
    Project name and deployment
    Must match the values used in AI Hub
    Predict, retrain, and ping
    Must match processes created in AI Studio, and published and deployed via AI Hub to your *.zip deployment file
    User and host
    Must be supplied if the ssh value is enabled and non-zero
    Refresh
    Indicates how long in seconds between attempts to update the vovrtsad.cfg config file
    Start
    Specifies in typical timespec values, such as 1y, 1m, 1w, 1d, etc, the interval from current time used for selecting retraining data. By default, all files found in *.swd/data/jobs will be used if this value is not set.
    Debug
    Used to toggle verbose debugging information.
    csvPath, csvDir, and csvName
    Specifies external locations outside of the default *.swd/data/jobs/rtsa/*.csv locations
  2. To map data between the RTSA embedded REST API endpoints and Accelerator job submission data, specify the same discrete values defined in your Altair RapidMiner AI Studio processes Discretize operator, in your vovrtsad.cfg list of values, as shown below. This is how Accelerator matches predicted values and replaces expected values with the prediction results. Both the class names and upper limits should match the values used in the config file.
    Figure 6.


    Figure 7.


    The value used to map to the returned prediction output is prediction(DURATIONML). If you call your prediction attribute something other than DURATIONML you must rename the corresponding attribute in the vovrtsad.cfg file. The scoring process name is only used for reference. The remaining values are used by the vovrtsad daemon or nc_rtsa wrapper for various things. Use the confidence_minimum value as the threshold that must be met apply a prediction value, such as 0.6250 in this example. The quality_minimum value is used for retraining; in this example it is set to 0.4000. The frequency the daemon checks for potential retraining is set via check_quality_secs in terms of seconds, so once an hour in this example. Finally, retrain_job_minimum is used to set a minimum threshold of completed jobs monitored by the vortsad daemon before it will permit model retraining.

  3. The vovrtsa_util tool also offers some options for obscuring the key used for launching the scoring agent in the vovrtsad.cfg config file, ( -e -u, -g). Here is an example usage:
    $ vovrtsa_util -g 
    aVV1AGhVO63rDAwGWNU3Vt+9T/FDZzFEa/zvk3YKmek=
    
    $ vovrtsa_util -e "My string" -u aVV1AGhVO63rDAwGWNU3Vt+9T/FDZzFEa/zvk3YKmek=
    33:wiATyTB7ATpsWGYcu5HkARNq7yyVFhpDD7r8m/3RndBo

    As used from within the vovrtsad.cfg config file:

    "key"     = 33:wiATyTB7ATpsWGYcu5HkARNq7yyVFhpDD7r8m/3RndBo",
    "pkey"    = "aVV1AGhVO63rDAwGWNU3Vt+9T/FDZzFEa/zvk3YKmek",