After the SDF is installed, the sdfadmin tool becomes invalid. Use the sdfadmin tool to clear data.sdfadmin Data cleansing tool instructions

If you are not sure if SDF is installed in your environment, please consult a fellow student on duty for one-on-one assistance.

1. summarize

The data cleansing tool can be used to either clean the imported behavior event data in the strategy analysis or to de-duplicate the imported behavior event data.

The tool does not provide with the following functions:

  1. Deletes the data for the specified property.
  2. Delete data imported in batches.
  3. Delete the data imported in a certain period.
  4. Deletes an event definition. However, events can be hidden in metadata management and can be operated by the administrator.

Data clearing is an irreversible operation. Frequent or large amounts of data clearing may cause excessive fragmentation, which may affect the import progress. Exercise caution when performing this operation. In addition to the HdfsImporter import tool, the imported data passes through Kudu first. After certain conditions are met, event data is converted to the Hdfs for storage. For example, if your Oracle Analytics version is 1.13 or later and the cluster version is used, note that the data cleansing tool only deletes data in Hdfs by default, and does not delete data in Kudu. If you need to delete the recently imported data, do not stop the data import and wait for more than 4 hours to ensure that the recently imported data is successfully imported to the Hdfs.

2. Usage method

First, ssh to any machine where the Oracle service is deployed and use the data cleansing tool under the sa_cluster account. Switching from root to sa_cluster account:

su - sa_cluster
CODE
  • Notice the minus sign between su and sa_cluster.

2.1.  Event deletion

This method can clean the behavior event data of all events or some specified events for a specified time period of an item.

  • The time range is Event behavior , not the time range when data is imported.
  • This method preserves the definition of events and event properties.
  • Standalone version: The disk will not be released immediately, the background will be slowly released, if the deletion is the specified event data then the release time may be long.
  • Cluster version: The disk is not released immediately, and is routinely cleaned every morning.Note :Cluster version event deletion generally takes a lot of time and resources, as much as possibledo not specify too large a time range. Also support the use of-m parameters speed up execution.

Parameter List:

Parameter NameRequiredDescriptionFormatExampleRemarks
beginThe start date (including this day) for data deletionyyyy-MM-dd2015-12-21
endThe end date (including this day) for data deletionyyyy-MM-dd2015-12-22
events
Specify the event to delete data. For the cluster version, multiple events can be specified, separated by commas.Event NameButtonClick
project
The project corresponding to the operation. The default value is "Default project".Project Namemy_project
hours
Specify the number or range of hours to which the data to be deleted belongs. Use a comma to separate multiple non-continuous times.[0-23]0,[3-5]Only supported in the cluster version
max_tasks
When deleting data for a specific event, this parameter can be used to set the concurrency of tasks in order to speed up execution. The default value is 1.Positive integer2Only supported in Cluster Edition
libs
Deleted event sources, separated by commas if there are multiple sources.LIBSLIBSOnly supported in Cluster Edition
  • Clear the behavior event data of the day December 22, 2015 under the default project:
sa_clean clean_event_by_date --begin 2015-12-22 --end 2015-12-22 --project default
CODE
  • Clear the behavior event data from December 20 to December 22, 2015 under the production project:
sa_clean clean_event_by_date --begin 2015-12-20 --end 2015-12-22 --project production
CODE
  • Clear the behavior event data of the ButtonClick event from December 20 to December 22, 2015 under the my_project project, with no changes to other events:
sa_clean clean_event_by_date --begin 2015-12-20 --end 2015-12-22 --events ButtonClick --project my_project 
CODE

Note: If the event name in "events" contains the "$" symbol, the event name after "events" needs to be enclosed in single quotes:

sa_clean clean_event_by_date --begin 2015-12-20 --end 2015-12-22 --events '$AppStart' --project my_project
CODE
  • Clear the behavior event data at 0 o'clock, 3 o'clock, 4 o'clock, and 5 o'clock on December 22, 2015 of the ButtonClick event under the my_project project, with no changes to other events:
sa_clean clean_event_by_date --begin 2015-12-22 --end 2015-12-22 --events ButtonClick --hours 0,[3-5] --project my_project
CODE
  • Clear the behavior event data at 0 o'clock, 3 o'clock, 4 o'clock, and 5 o'clock on December 22, 2015 of the ButtonClick event under the my_project project, from the sources "scala" and "python", with no changes to other events:
sa_clean clean_event_by_date --begin 2015-12-22 --end 2015-12-22 --events ButtonClick --hours 0,[3-5] --project my_project --libs scala,python 
CODE

2.2. Support GDPR European standard, delete underlying user data (currently only supported in Single Edition & Cloud Edition)

  • Solution 1: Delete using user ID (currently, data in the events and users tables will be deleted, and user associations will also be deleted)

List of parameters:

Parameter NameRequiredExplanationFormatExampleNotes
filenameFile containing id setfilenameuser_id.txtDefault is user_id
is_distinct_id
Specifies whether the data in the id set file is distinct_id

If the data in the id set is distinct_id, this parameter needs to be specified
project
Default is "default project" for the operationproject namemy_project
  1. Create user text, one user per line (user_id or distinct_id) (USER_ID_SET_FILENAME or DISTINCT_ID_SET_FILENAME)
  2. Use the following command to delete

Command to delete using user_id

sa_clean clean_event_and_profile_by_id_list --project PROJECT_NAME --filename USER_ID_SET_FILENAME
CODE

Command to delete using distinct_id

sa_clean clean_event_and_profile_by_id_list --project PROJECT_NAME --filename DISTINCT_ID_SET_FILENAME --is_distinct_id
CODE

Note: If the command to delete using distinct_id is used to delete user event data and user data, it is necessary to ensure that the events table contains the user's event data in order to find the user's information in the users table and delete it. Otherwise, the deletion will not be successful and the command to delete using user_id must be used.

  • Solution 2: Use the profile_delete flag for deletion marking

Parameter list:

Parameter nameRequiredDescriptionFormatExampleNotes
only_profile
Whether to only delete data in the profile table

When specified, only profile data is deleted
project
The project corresponding to the operation, default is "Default project"Project namemy_project
  1. Users can call the profile_delete interface to mark the data as deleted
  2. Use the following command to delete data, which will delete all data in the events and users tables, as well as the user relationships
sa_clean clean_event_and_profile_by_is_deleted --project PROJECT_NAME
CODE

Only delete user data and user relationships in the users table

sa_clean clean_event_and_profile_by_is_deleted --project PROJECT_NAME --only_profile
CODE

2.3. Event deduplication

This method removes duplicate imported data. Currently, only private environments in the cluster support this method.

Parameter list:

Parameter nameRequiredDescriptionFormatExampleRemarks
beginDelete data start date (including this day)yyyy-MM-dd2015-12-21
endDelete data end date (including this day)yyyy-MM-dd2015-12-22
project
Project corresponding to the operation, default is "Default Project"Project namemy_project
skip_properties
Properties to ignore when deduplicating, separated by commasProperty Name$ip, $citySupported in version 1.13 and later, please upgrade to a minor version first
  • Deduplicate data for the my_project project on January 2nd, 2016, ignoring the $ip and $city attributes
sa_clean distinct_event_by_date --project my_project --begin '2016-01-02' --end '2016-01-02' --skip_properties '$ip','$city'
CODE

Note: You can specify to ignore duplicate events based on a specific event attribute, meaning that the value of that attribute will not be used to determine if two events are duplicates. Currently, specifying to ignore the "time" attribute for deduplication is not supported.

3. Other

As deletion is an irreversible operation, during the execution, the user needs to enter yes and press Enter to perform the actual deletion operation. If it has been confirmed that the operation is correct before execution (mainly for automation scripts), the --yes parameter can be added, and it will no longer prompt for entering yes in order to execute.