sdfadmin Data Cleaning Tool User Manual
|
Collect
Only by installing SDF can you use this tool. If you haven't installed SDF, please use the sa_clean tool for data cleaning. See the document:sa_clean Data Cleaning Tool User Manual
If you are unsure whether SDF is installed in your environment, please consult your data consultant for one-on-one assistance.
Overview
The data cleaning tool can be used to clean the behavioral event data imported into Sensors Analytics and to deduplicate the imported behavioral event data.
The following functionality is not provided by this tool:
- Delete the data imported in a specific batch.
- Delete the data imported during a specific time period.
- Delete a specific event definition. However, events can be hidden in the metadata management, and administrators can still operate them.
Data cleaning is an irreversible operation. Frequent or large-scale data cleaning may result in too many fragments, which may affect the import progress. Please operate with caution. Apart from the HdfsImporter import tool, for other ways of data import, the imported data will first go through Kudu. Only when certain conditions are met, the event data will be converted to Hdfs for storage.
Usage
Please SSH to any machine that has the Sensors Analytics service deployed, and use the data cleaning tool under the sa_cluster account. Switch from the root user to the sa_cluster account:
su - sa_cluster
- Note the hyphen between su and sa_cluster.
Event Deletion
- This method can be used to clean all event data within a specific time period of a project, or to clean the event data with specific event names.
- The time period refers to the event behavior time, not the time range of imported data.
- This method will preserve the definition of events and event properties.
- For standalone version: the disk space will not be released.
- For cluster version: the disk space will not be released immediately, it will be cleaned up regularly at midnight. Note that event deletion in the cluster version usually consumes a lot of time and resources. Try not to specify a too large time range.
Submit an event deletion task
Command execution
sdfadmin event_delete --method/-m execute <操作类型> \ --project/-p default <项目名> \ --begin/-b 2020-12-01 <删除数据的起始日期(包含这一天)> \ --end/-e 2020-12-01 <删除数据的结束日期(包含这一天)> \ [--priority] LOW <Job的优先级> \ [--events] 'HomePageView,FistPay' <删除的事件名> \ [--hours] 3,13,21 <删除的小时数或范围> \ [--libs] python,Android <删除的事件来源> \ [--hdfs_import_names] HdfsImporter-785,HdfsImporter-786 <删除的HDFSInporter导入批次> \ [--reserved_events] FollowClick,SpecificRoomView <保留的事件名> \ [--properties] {"city":["BeiJing","ShangHai"],"name":[],"age":[15,16,17]} <指定需要删除的事件的属性和属性值> \
Parameter list
Parameter name | Required | Description | Format | Style | Remark |
---|---|---|---|---|---|
--method | o | Operation type | execute / show / history | execute | |
--begin | o | Start date for data deletion (including this day) | yyyy-MM-dd | 2020-06-17 | This time refers to the event time |
--end | o | The end date (including this day) of the data to be deleted | yyyy-MM-dd | 2020-06-17 | The time is the event time |
--project | o | The project associated with the operation | Project name | default | |
--priority | Job priority | VERY_LOW / LOW / NORMAL | LOW | The default is VERY_LOW. If you want to increase the execution priority, you can specify LOW / NORMAL | |
--events | Specify the event to delete data, multiple events can be specified separated by commas | Event name | 'gameListView,newHobby,search' | Due to the escape issue of the $ sign, the events to be deleted need to be enclosed in single quotes, e.g. 'gameListView,newHobby,search' | |
--hours | Specify the hours of the data to delete, separated by commas | 1,2,5,23 | 15,16,17 | ||
--libs | Source of the events to delete, separated by commas | LIBS | python,java | BatchImpoter or HDFSImporter cannot be specified. To delete a batch of data imported by HDFSImporter, use the following parameters. Deleting a batch of data imported by BatchImporter is currently not supported | |
--hdfs_import_names | Session ID of the HDFSImporter import batch to be deleted | HdfsImporter-785,HdfsImporter-786 | |||
--reserved_events | Specify the reserved event names | Event Name | FollowClick,SpecificRoomView | Due to the escape problem of the $ symbol, the events to be reserved need to be enclosed in single quotes. | |
--properties | Delete data based on the properties and attribute values of events | A valid json string. The entire json string should be enclosed in single quotes and should not contain extra spaces, otherwise it cannot be parsed | '{"city":["BeiJing","ShangHai"],"name":[],"age":[15,16,17],"height":[10]}' |
| |
--properties_file_path | Delete data based on file content Support nested single quotes and spaces in strings | --properties_file_path a.txt | a.txt file content: a valid json string | --properties and --properties_file_path are mutually exclusive. If one is specified, the other should not be specified. Otherwise, the priority is: properties_file_path > properties Version requirements: sdf 2.2.12389+ / 2.3.1.56 |
Note: --events cannot be used with --reserved_events parameter at the same time
Usage example
- Delete the data of project httest on November 14, 2020
sdfadmin event_delete -m execute -p httest -b 2020-11-04 -e 2020-11-04
- Delete the data of project httest imported using python SDK and java SDK on November 14, 2020
sdfadmin event_delete -m execute -p httest -b 2020-11-04 -e 2020-11-04 --libs 'python,java'
- Delete the data of project httest with event names gameListView, newHobby, search on November 1, 2020
sdfadmin event_delete -m execute -p httest -b 2020-11-01 -e 2020-11-01 --events 'gameListView,newHobby,search'
- Delete the data of project httest with hours 1, 5, 7 on November 1, 2020
sdfadmin event_delete -m execute -p httest -b 2020-11-01 -e 2020-11-01 --hours 1,5,7
- Delete the data of project httest with event name search on November 1, 2020, and the deleted data must include the attributes "city," "name," and "height," where the attribute value of "city" needs to be equal to "Beijing" or "Shanghai," the attribute value of "name" has no restrictions, and the attribute value of "height" needs to be equal to 10
sdfadmin event_delete -m execute -p httest -b 2020-11-01 -e 2020-11-01 --events 'search' --properties '{"city":["BeiJing","ShangHai"],"name":[],"height":[10]}'
View event deletion job information
Execute command
sdfadmin event_delete --method/-m show <操作类型> \ --job_id 74562 <execute提交job后返回的job ID> \
Parameter list
Parameter Name | Required | Description | Format | Style | Remarks |
---|---|---|---|---|---|
--method | o | Operation Type | execute / show / history | execute | |
--job_id | o | Job ID returned after submitting the execute job | 74562 | This job ID corresponds to the ID in the sdf_data_loader_job table |
View the event deletion job history
Execution command
sdfadmin event_delete --method/-m history <操作类型> \ --project/-p default <项目名> \ --begin/-b 2020-12-01 <要查看的最早的 Job 的创建时间(包含这一天)> \ --end/-e 2020-12-01 <要查看的最晚的 Job 的创建时间(包含这一天)>
Parameter list
Parameter name | Required | Description | Format | Style | Remark |
---|---|---|---|---|---|
--method | o | Operation type | |||
--project | o | Project name | |||
--begin | o | Earliest job create time | yyyy-MM-dd | ||
--end | o | Latest job create time | yyyy-MM-dd |
Cancel the running deletion task
Execute command
sdfadmin event_delete --method/-m cancel <操作类型> \ --job_id 74562 <execute提交job后返回的job ID> \
Parameter name | Required | Description | Format | Style | Remark |
---|---|---|---|---|---|
--method | o | Action type | cancel | cancel | Select to cancel |
--job_id | o | Job ID returned after submitting the execute job | 74562 | The job ID corresponds to the ID in the sdf_data_loader_job table |
Event Deduplication
- This method deduplicates duplicate imported data. Currently, this method is only supported in private cluster environments.
Submit an event deduplication task
Execute command
sdfadmin event_distinct --method/-m execute <操作类型> \ --project/-p default <项目名> \ --begin/-b 2020-12-01 <去重数据的起始日期(包含这一天)> \ --end/-e 2020-12-01 <去重数据的结束日期(包含这一天)> \ [--priority] LOW <Job的优先级> \ [--events] 'HomePageView,FistPay' <去重的事件名> \ [--skip_properties] property1, property2 <去重时需要忽略的属性>
Parameter list
Parameter Name | Required | Description | Format | Style | Note |
---|---|---|---|---|---|
--method | o | Operation Type | execute / show / history | execute | |
--begin | o | Start date for deduplicated data (including this day) | yyyy-MM-dd | 2020-06-17 | This time refers to the event time |
--end | o | End date for deduplicated data (including this day) | yyyy-MM-dd | 2020-06-17 | This time refers to the event time |
--project | o | Project name corresponding to the operation | Project name | default | |
--priority | Job priority | VERY_LOW/LOW/NORMAL | LOW | By default, it is set to VERY_LOW. If you want to increase the execution priority, you can specify LOW / NORMAL | |
--events | Specify the events to deduplicate. Separate multiple events with commas | Event name | gameListView,newHobby,search | ||
--skip_properties | Properties to be ignored during deduplication, separated by commas | Property Name | property1, property2 | _offset, $kafka_offset, $receive_time are ignored by default. If there are other properties to be ignored, they need to be specified using this parameter. For example, ignore the $lib property for deduplication: --skip_properties '$lib' Note: time is not an event property and cannot be ignored If the data is imported by HDFSImporter, the $hdfs_import_batch_name property needs to be skipped |
Usage Example
- Deduplicate data for project httest on November 20, 2020
sdfadmin event_distinct -m execute -p httest -b 2020-11-20 -e 2020-11-20
- Deduplicate events HomePageView and FistPay for project httest on November 20, 2020
sdfadmin event_distinct -m execute -p httest -b 2020-11-20 -e 2020-11-20 --events 'HomePageView,FistPay'
- Deduplicate events for project httest on November 20, 2020, ignoring the $lib property
sdfadmin event_distinct -m execute -p httest -b 2020-11-20 -e 2020-11-20 --skip_properties '$lib'
View deduplication job information
Execute command
sdfadmin event_distinct --method/-m show <操作类型> \ --job_id 74562 <execute提交job后返回的job ID>
List of Parameters
Parameter | Required | Description | Format | Style | Remarks |
---|---|---|---|---|---|
--method | o | Operation Type | execute / show / history | execute | |
--job_id | o | Job ID returned after executing the job | 74562 | The job ID corresponds to the ID in the sdf_data_loader_job table |
View the history of event deduplication jobs
Execution command
sdfadmin event_distinct --method/-m history <操作类型> \ --project/-p default <项目名> \ --begin/-b 2020-12-01 <Job 的创建时间(包含这一天)> \ --end/-e 2020-12-01 <Job 的创建时间(包含这一天)> \
Parameter list
Parameter | Required | Description | Format | Style | Remarks |
---|---|---|---|---|---|
--method | o | Operation type | |||
--project | o | Project name | |||
--begin | o | Earliest job creation time | yyyy-MM-dd | ||
--end | o | The latest job creation time | yyyy-MM-dd |
Cancel the running deduplication task
Command
sdfadmin event_distinct --method/-m cancel <操作类型> \ --job_id 74562 <execute提交job后返回的job ID> \
Parameter Name | Required | Description | Format | Style | Note |
---|---|---|---|---|---|
--method | o | Operation Type | cancel | cancel | Select to cancel |
--job_id | o | The job ID returned after executing a job | 74562 | The job ID corresponds to the ID in the sdf_data_loader_job table |
Note: The content of this document is a technical document that provides details on how to use the Sensors product and does not include sales terms; the specific content of enterprise procurement products and technical services shall be subject to the commercial procurement contract.