sa_clean Data cleansing tool instructions
|
Collect
After the SDF is installed, the sdfadmin tool becomes invalid. Use the sdfadmin tool to clear data.sdfadmin Data cleansing tool instructions
If you are not sure if SDF is installed in your environment, please consult a fellow student on duty for one-on-one assistance.
1. summarize
The data cleansing tool can be used to either clean the imported behavior event data in the strategy analysis or to de-duplicate the imported behavior event data.
The tool does not provide with the following functions:
- Deletes the data for the specified property.
- Delete data imported in batches.
- Delete the data imported in a certain period.
- Deletes an event definition. However, events can be hidden in metadata management and can be operated by the administrator.
Data clearing is an irreversible operation. Frequent or large amounts of data clearing may cause excessive fragmentation, which may affect the import progress. Exercise caution when performing this operation. In addition to the HdfsImporter import tool, the imported data passes through Kudu first. After certain conditions are met, event data is converted to the Hdfs for storage. For example, if your Oracle Analytics version is 1.13 or later and the cluster version is used, note that the data cleansing tool only deletes data in Hdfs by default, and does not delete data in Kudu. If you need to delete the recently imported data, do not stop the data import and wait for more than 4 hours to ensure that the recently imported data is successfully imported to the Hdfs.
2. Usage method
First, ssh to any machine where the Oracle service is deployed and use the data cleansing tool under the sa_cluster account. Switching from root to sa_cluster account:
su - sa_cluster
- Notice the minus sign between su and sa_cluster.
2.1. Event deletion
This method can clean the behavior event data of all events or some specified events for a specified time period of an item.
- The time range is Event behavior , not the time range when data is imported.
- This method preserves the definition of events and event properties.
- Standalone version: The disk will not be released immediately, the background will be slowly released, if the deletion is the specified event data then the release time may be long.
- Cluster version: The disk is not released immediately, and is routinely cleaned every morning.Note :Cluster version event deletion generally takes a lot of time and resources, as much as possibledo not specify too large a time range. Also support the use of-m parameters speed up execution.
Parameter List:
Parameter Name | Required | Description | Format | Example | Remarks |
---|---|---|---|---|---|
begin | √ | The start date (including this day) for data deletion | yyyy-MM-dd | 2015-12-21 | |
end | √ | The end date (including this day) for data deletion | yyyy-MM-dd | 2015-12-22 | |
events | Specify the event to delete data. For the cluster version, multiple events can be specified, separated by commas. | Event Name | ButtonClick | ||
project | The project corresponding to the operation. The default value is "Default project". | Project Name | my_project | ||
hours | Specify the number or range of hours to which the data to be deleted belongs. Use a comma to separate multiple non-continuous times. | [0-23] | 0,[3-5] | Only supported in the cluster version | |
max_tasks | When deleting data for a specific event, this parameter can be used to set the concurrency of tasks in order to speed up execution. The default value is 1. | Positive integer | 2 | Only supported in Cluster Edition | |
libs | Deleted event sources, separated by commas if there are multiple sources. | LIBS | LIBS | Only supported in Cluster Edition |
- Clear the behavior event data of the day December 22, 2015 under the default project:
sa_clean clean_event_by_date --begin 2015-12-22 --end 2015-12-22 --project default
- Clear the behavior event data from December 20 to December 22, 2015 under the production project:
sa_clean clean_event_by_date --begin 2015-12-20 --end 2015-12-22 --project production
- Clear the behavior event data of the ButtonClick event from December 20 to December 22, 2015 under the my_project project, with no changes to other events:
sa_clean clean_event_by_date --begin 2015-12-20 --end 2015-12-22 --events ButtonClick --project my_project
Note: If the event name in "events" contains the "$" symbol, the event name after "events" needs to be enclosed in single quotes:
sa_clean clean_event_by_date --begin 2015-12-20 --end 2015-12-22 --events '$AppStart' --project my_project
- Clear the behavior event data at 0 o'clock, 3 o'clock, 4 o'clock, and 5 o'clock on December 22, 2015 of the ButtonClick event under the my_project project, with no changes to other events:
sa_clean clean_event_by_date --begin 2015-12-22 --end 2015-12-22 --events ButtonClick --hours 0,[3-5] --project my_project
- Clear the behavior event data at 0 o'clock, 3 o'clock, 4 o'clock, and 5 o'clock on December 22, 2015 of the ButtonClick event under the my_project project, from the sources "scala" and "python", with no changes to other events:
sa_clean clean_event_by_date --begin 2015-12-22 --end 2015-12-22 --events ButtonClick --hours 0,[3-5] --project my_project --libs scala,python
2.2. Support GDPR European standard, delete underlying user data (currently only supported in Single Edition & Cloud Edition)
- Solution 1: Delete using user ID (currently, data in the events and users tables will be deleted, and user associations will also be deleted)
List of parameters:
Parameter Name | Required | Explanation | Format | Example | Notes |
---|---|---|---|---|---|
filename | √ | File containing id set | filename | user_id.txt | Default is user_id |
is_distinct_id | Specifies whether the data in the id set file is distinct_id | If the data in the id set is distinct_id, this parameter needs to be specified | |||
project | Default is "default project" for the operation | project name | my_project |
- Create user text, one user per line (user_id or distinct_id) (USER_ID_SET_FILENAME or DISTINCT_ID_SET_FILENAME)
Use the following command to delete
Command to delete using user_id
sa_clean clean_event_and_profile_by_id_list --project PROJECT_NAME --filename USER_ID_SET_FILENAME
Command to delete using distinct_id
sa_clean clean_event_and_profile_by_id_list --project PROJECT_NAME --filename DISTINCT_ID_SET_FILENAME --is_distinct_id
Note: If the command to delete using distinct_id is used to delete user event data and user data, it is necessary to ensure that the events table contains the user's event data in order to find the user's information in the users table and delete it. Otherwise, the deletion will not be successful and the command to delete using user_id must be used.
- Solution 2: Use the profile_delete flag for deletion marking
Parameter list:
Parameter name | Required | Description | Format | Example | Notes |
---|---|---|---|---|---|
only_profile | Whether to only delete data in the profile table | When specified, only profile data is deleted | |||
project | The project corresponding to the operation, default is "Default project" | Project name | my_project |
- Users can call the profile_delete interface to mark the data as deleted
- Use the following command to delete data, which will delete all data in the events and users tables, as well as the user relationships
sa_clean clean_event_and_profile_by_is_deleted --project PROJECT_NAME
Only delete user data and user relationships in the users table
sa_clean clean_event_and_profile_by_is_deleted --project PROJECT_NAME --only_profile
2.3. Event deduplication
This method removes duplicate imported data. Currently, only private environments in the cluster support this method.
Parameter list:
Parameter name | Required | Description | Format | Example | Remarks |
---|---|---|---|---|---|
begin | √ | Delete data start date (including this day) | yyyy-MM-dd | 2015-12-21 | |
end | √ | Delete data end date (including this day) | yyyy-MM-dd | 2015-12-22 | |
project | Project corresponding to the operation, default is "Default Project" | Project name | my_project | ||
skip_properties | Properties to ignore when deduplicating, separated by commas | Property Name | $ip, $city | Supported in version 1.13 and later, please upgrade to a minor version first |
- Deduplicate data for the my_project project on January 2nd, 2016, ignoring the $ip and $city attributes
sa_clean distinct_event_by_date --project my_project --begin '2016-01-02' --end '2016-01-02' --skip_properties '$ip','$city'
Note: You can specify to ignore duplicate events based on a specific event attribute, meaning that the value of that attribute will not be used to determine if two events are duplicates. Currently, specifying to ignore the "time" attribute for deduplication is not supported.
3. Other
As deletion is an irreversible operation, during the execution, the user needs to enter yes and press Enter to perform the actual deletion operation. If it has been confirmed that the operation is correct before execution (mainly for automation scripts), the --yes parameter can be added, and it will no longer prompt for entering yes in order to execute.
Note: The content of this document is a technical document that provides details on how to use the Sensors product and does not include sales terms; the specific content of enterprise procurement products and technical services shall be subject to the commercial procurement contract.