Only by installing SDF can you use this tool. If you haven't installed SDF, please use the sa_clean tool for data cleaning. See the document:sa_clean Data Cleaning Tool User Manual

If you are unsure whether SDF is installed in your environment, please consult your data consultant for one-on-one assistance.

Overview

The data cleaning tool can be used to clean the behavioral event data imported into Sensors Analytics and to deduplicate the imported behavioral event data.

The following functionality is not provided by this tool:

  1. Delete the data imported in a specific batch.
  2. Delete the data imported during a specific time period.
  3. Delete a specific event definition. However, events can be hidden in the metadata management, and administrators can still operate them.

Data cleaning is an irreversible operation. Frequent or large-scale data cleaning may result in too many fragments, which may affect the import progress. Please operate with caution. Apart from the HdfsImporter import tool, for other ways of data import, the imported data will first go through Kudu. Only when certain conditions are met, the event data will be converted to Hdfs for storage.

Usage

Please SSH to any machine that has the Sensors Analytics service deployed, and use the data cleaning tool under the sa_cluster account. Switch from the root user to the sa_cluster account:

su - sa_cluster
CODE
  • Note the hyphen between su and sa_cluster.

Event Deletion

  • This method can be used to clean all event data within a specific time period of a project, or to clean the event data with specific event names.
  • The time period refers to the event behavior time, not the time range of imported data.
  • This method will preserve the definition of events and event properties.
  • For standalone version: the disk space will not be released.
  • For cluster version: the disk space will not be released immediately, it will be cleaned up regularly at midnight. Note that event deletion in the cluster version usually consumes a lot of time and resources. Try not to specify a too large time range.

Submit an event deletion task

Command execution

sdfadmin event_delete --method/-m execute <操作类型> \ --project/-p default <项目名> \ --begin/-b 2020-12-01 <删除数据的起始日期(包含这一天)> \ --end/-e 2020-12-01 <删除数据的结束日期(包含这一天)> \ [--priority] LOW <Job的优先级> \ [--events] 'HomePageView,FistPay' <删除的事件名> \ [--hours] 3,13,21 <删除的小时数或范围> \ [--libs] python,Android <删除的事件来源> \ [--hdfs_import_names] HdfsImporter-785,HdfsImporter-786 <删除的HDFSInporter导入批次> \ [--reserved_events] FollowClick,SpecificRoomView <保留的事件名> \ [--properties] {"city":["BeiJing","ShangHai"],"name":[],"age":[15,16,17]} <指定需要删除的事件的属性和属性值> \
CODE

Parameter list

Parameter nameRequiredDescriptionFormatStyleRemark
--methodoOperation typeexecute / show / historyexecute
--beginoStart date for data deletion (including this day)yyyy-MM-dd2020-06-17This time refers to the event time
--endoThe end date (including this day) of the data to be deletedyyyy-MM-dd2020-06-17The time is the event time
--projectoThe project associated with the operationProject namedefault
--priority
Job priorityVERY_LOW / LOW / NORMALLOWThe default is VERY_LOW. If you want to increase the execution priority, you can specify LOW / NORMAL
--events
Specify the event to delete data, multiple events can be specified separated by commasEvent name'gameListView,newHobby,search'Due to the escape issue of the $ sign, the events to be deleted need to be enclosed in single quotes, e.g. 'gameListView,newHobby,search'
--hours
Specify the hours of the data to delete, separated by commas1,2,5,2315,16,17
--libs
Source of the events to delete, separated by commasLIBSpython,javaBatchImpoter or HDFSImporter cannot be specified. To delete a batch of data imported by HDFSImporter, use the following parameters. Deleting a batch of data imported by BatchImporter is currently not supported
--hdfs_import_names
Session ID of the HDFSImporter import batch to be deleted
HdfsImporter-785,HdfsImporter-786
--reserved_events
Specify the reserved event namesEvent NameFollowClick,SpecificRoomViewDue to the escape problem of the $ symbol, the events to be reserved need to be enclosed in single quotes.
--properties
Delete data based on the properties and attribute values of eventsA valid json string. The entire json string should be enclosed in single quotes and should not contain extra spaces, otherwise it cannot be parsed'{"city":["BeiJing","ShangHai"],"name":[],"age":[15,16,17],"height":[10]}'
  1. Currently, only properties of NUMBER, BOOL, and STRING data types are supported. DATE and LIST types will be supported later
  2. If only the property name is filled in without the property value, then the property value of the event will not be considered during deletion (in the example, the property value of the name is passed as an empty array [], so any event that includes this property name will be deleted)
  3. If multiple properties and values are specified, the intersection of properties is taken during deletion. The conditions that satisfy all properties will be deleted. The union of property values is taken, and as long as the property value of the data is equal to one specified, it will be deleted
  4. When reporting decimal-type property values, keep 3 decimal places
  5. The property value needs to be specified as an array. If there is no property value or there is only one property value, it also needs to be written in array form enclosed in square brackets [] (refer to "height":[10] and "name":[] in the example)
  6. Deleting data with NULL property values is not supported because it is impossible to determine whether it is a real empty value or the value "NULL" passed by the customer
  7. Deleting data with empty ("") property values is supported. "name":[""]
  8. Version requirements sdf 2.2.11292+ / 2.3.0.706+
--properties_file_path

Delete data based on file content

Support nested single quotes and spaces in strings

--properties_file_path a.txt

a.txt file content: a valid json string
cat test1.txt
{"$country":["People's Republic of China"],"$province":["Heilongjiang","Jilin","Liaoning","Hebei","Henan","Shandong","Shanxi","Anhui","Jiangxi","Jiangsu","Zhejiang","Fujian","Guangdong","Hunan","Hubei","Hainan","Yunnan","Guizhou","Sichuan","Qinghai","Gansu","Shaanxi","Inner Mongolia","Xinjiang","Guangxi","Ningxia","Tibet","Beijing","Tianjin","Shanghai","Chongqing","China"]}

--properties and --properties_file_path are mutually exclusive. If one is specified, the other should not be specified. Otherwise, the priority is: properties_file_path &gt; properties

Version requirements: sdf 2.2.12389+ / 2.3.1.56

Note: --events cannot be used with --reserved_events parameter at the same time

Usage example

  • Delete the data of project httest on November 14, 2020
sdfadmin event_delete -m execute -p httest -b 2020-11-04 -e 2020-11-04
CODE
  • Delete the data of project httest imported using python SDK and java SDK on November 14, 2020
sdfadmin event_delete -m execute -p httest -b 2020-11-04 -e 2020-11-04 --libs 'python,java'
CODE
  • Delete the data of project httest with event names gameListView, newHobby, search on November 1, 2020
sdfadmin event_delete -m execute -p httest -b 2020-11-01 -e 2020-11-01 --events 'gameListView,newHobby,search'
CODE
  • Delete the data of project httest with hours 1, 5, 7 on November 1, 2020
sdfadmin event_delete -m execute -p httest -b 2020-11-01 -e 2020-11-01 --hours 1,5,7
CODE
  • Delete the data of project httest with event name search on November 1, 2020, and the deleted data must include the attributes "city," "name," and "height," where the attribute value of "city" needs to be equal to "Beijing" or "Shanghai," the attribute value of "name" has no restrictions, and the attribute value of "height" needs to be equal to 10
sdfadmin event_delete -m execute -p httest -b 2020-11-01 -e 2020-11-01 --events 'search' --properties '{"city":["BeiJing","ShangHai"],"name":[],"height":[10]}'
CODE


View event deletion job information

Execute command

sdfadmin event_delete --method/-m show <操作类型> \ --job_id 74562 <execute提交job后返回的job ID> \
CODE

Parameter list

Parameter NameRequiredDescriptionFormatStyleRemarks
--methodoOperation Typeexecute / show / historyexecute
--job_idoJob ID returned after submitting the execute job
74562This job ID corresponds to the ID in the sdf_data_loader_job table

View the event deletion job history

Execution command

sdfadmin event_delete --method/-m history <操作类型> \ --project/-p default <项目名> \ --begin/-b 2020-12-01 <要查看的最早的 Job 的创建时间(包含这一天)> \ --end/-e 2020-12-01 <要查看的最晚的 Job 的创建时间(包含这一天)> 
CODE

Parameter list

Parameter nameRequiredDescriptionFormatStyleRemark
--methodoOperation type


--projectoProject name


--beginoEarliest job create timeyyyy-MM-dd

--endoLatest job create timeyyyy-MM-dd

Cancel the running deletion task

Execute command

sdfadmin event_delete --method/-m   cancel <操作类型> \ --job_id      74562 <execute提交job后返回的job ID> \
CODE

Parameter name

Required

Description

Format

Style

Remark

--methodoAction typecancelcancelSelect to cancel
--job_idoJob ID returned after submitting the execute job
74562The job ID corresponds to the ID in the sdf_data_loader_job table

Event Deduplication

  • This method deduplicates duplicate imported data. Currently, this method is only supported in private cluster environments.

Submit an event deduplication task

Execute command

sdfadmin event_distinct --method/-m execute <操作类型> \ --project/-p default <项目名> \ --begin/-b 2020-12-01 <去重数据的起始日期(包含这一天)> \ --end/-e 2020-12-01 <去重数据的结束日期(包含这一天)> \ [--priority] LOW <Job的优先级> \ [--events] 'HomePageView,FistPay' <去重的事件名> \ [--skip_properties] property1, property2 <去重时需要忽略的属性>
CODE

Parameter list

Parameter NameRequiredDescriptionFormatStyleNote
--methodoOperation Typeexecute / show / historyexecute
--beginoStart date for deduplicated data (including this day)yyyy-MM-dd2020-06-17This time refers to the event time
--endoEnd date for deduplicated data (including this day)yyyy-MM-dd2020-06-17This time refers to the event time
--projectoProject name corresponding to the operationProject namedefault
--priority
Job priorityVERY_LOW/LOW/NORMALLOWBy default, it is set to VERY_LOW. If you want to increase the execution priority, you can specify LOW / NORMAL
--events
Specify the events to deduplicate. Separate multiple events with commasEvent namegameListView,newHobby,search

--skip_properties


Properties to be ignored during deduplication, separated by commasProperty Nameproperty1, property2

_offset, $kafka_offset, $receive_time are ignored by default. If there are other properties to be ignored, they need to be specified using this parameter. For example, ignore the $lib property for deduplication: --skip_properties '$lib'

Note: time is not an event property and cannot be ignored

If the data is imported by HDFSImporter, the $hdfs_import_batch_name property needs to be skipped

Usage Example

  • Deduplicate data for project httest on November 20, 2020
sdfadmin event_distinct -m execute -p httest -b 2020-11-20 -e 2020-11-20
CODE
  • Deduplicate events HomePageView and FistPay for project httest on November 20, 2020
sdfadmin event_distinct -m execute -p httest -b 2020-11-20 -e 2020-11-20 --events 'HomePageView,FistPay'
CODE
  • Deduplicate events for project httest on November 20, 2020, ignoring the $lib property
sdfadmin event_distinct -m execute -p httest -b 2020-11-20 -e 2020-11-20 --skip_properties '$lib'
CODE

View deduplication job information

Execute command

sdfadmin event_distinct --method/-m show <操作类型> \ --job_id 74562 <execute提交job后返回的job ID> 
CODE

List of Parameters

ParameterRequiredDescriptionFormatStyleRemarks
--methodoOperation Typeexecute / show / historyexecute
--job_idoJob ID returned after executing the job
74562The job ID corresponds to the ID in the sdf_data_loader_job table

View the history of event deduplication jobs

Execution command

sdfadmin event_distinct --method/-m history <操作类型> \ --project/-p default <项目名> \ --begin/-b 2020-12-01 <Job 的创建时间(包含这一天)> \ --end/-e 2020-12-01 <Job 的创建时间(包含这一天)> \
CODE

Parameter list

ParameterRequiredDescriptionFormatStyleRemarks
--methodoOperation type


--projectoProject name


--beginoEarliest job creation timeyyyy-MM-dd

--endoThe latest job creation timeyyyy-MM-dd

Cancel the running deduplication task

Command

sdfadmin event_distinct --method/-m   cancel <操作类型> \ --job_id      74562 <execute提交job后返回的job ID> \
CODE

Parameter Name

Required

Description

Format

Style

Note

--methodoOperation TypecancelcancelSelect to cancel
--job_idoThe job ID returned after executing a job
74562The job ID corresponds to the ID in the sdf_data_loader_job table