Data Preprocessing Module
|
Collect
The content described in this document belongs to the advanced usage of Sensors Analytics, involving many technical details, and is applicable to experienced users. If you have any doubts about the content of the document, please consult the Sensors Analytics team for one-on-one assistance.
1.1. Overview
Sensors Analytics has opened the custom "Data Preprocessing Module" for users since version 1.6, which provides a simple ETL process for data accessed through SDK and other methods (excluding batch import tool), making data integration more flexible. This feature is only available in the private deployment version.
The data sources that can be processed by the Data Preprocessing Module include:
- SDK (data directly sent by various language SDKs, including visualized event tracking data. Excludes data written to files using LoggingConsumer and then imported using the batch import tool);
- LogAgent;
- Logstash;
- FormatImporter;
The workflow of the Data Preprocessing Module is as follows:
数据预处理模块 服务端接收到 SDK 数据 -------------------> 导入神策分析的数据
The Data Preprocessing Module can:
- Modify data content: add, delete, modify field values, modify event names, etc.;
- Discard data: the processing result can directly return null to discard a piece of data;
- Add data: derive multiple pieces of data from one piece of data;
Be cautious when using the Data Preprocessing Module as it can make significant changes to the data!
Exception thrown by the processing function will result in the discarding of the entire piece of data. Handle with care, such as null pointer exceptions.
For implementation of the "Data Preprocessing Module", version 1.13 or earlier, please refer to sensorsdata/ext-processor-sample.
For versions 1.14 and later, please refer to sensorsdata/preprocessor-sample.
For an example of filtering old users' devices after embedding the Sensors Analytics SDK in the app, refer to sensorsdata/ext-processor-identify-old-users;
For an example of recording activated devices on the server side to solve the problem of duplicate activation count after reinstallation, refer to sensorsdata/ext-processor-find-new-user;
Above two examples of features have been integrated into SenssorAnalytics, and the code examples are for reference only.
This article will provide some scenario examples.
1.2. Modify data content
1.2.1. Parse URL parameters
When some fields are not easy to parse on the client side, they can be parsed on the server side through the "Data Preprocessing Module".
For example, when the SDK sends a piece of data, the format passed to the "Data Preprocessing Module" is as follows:
{ "distinct_id": "2b0a6f51a3cd6775", "time": 1434556935000, "type": "track", "event": "ViewProduct", "project": "sample_project", "ip": "123.123.123.123", "properties": { ... "$referrer":"http://www.kbyte.cn/view?title=abc&act=click", ... } }
Now, the title and act fields in $referrer need to be parsed out and set as separate fields.
This requirement can be achieved by implementing a "Data Preprocessing Module" to process the data into the following result and return it:
{ "distinct_id": "2b0a6f51a3cd6775", "time": 1434556935000, "type": "track", "event": "ViewProduct", "project": "sample_project", "properties": { ... "$referrer":"http://www.kbyte.cn/view?title=abc&act=click", "source_title":"abc", "source_act":"click", ... } }
1.2.2. JOIN external fields
When part of the expected data needs to be obtained on the backend, the "Data Preprocessing Module" can be used to JOIN and complete the data.
For example, when the SDK sends a browse product data, which contains the field product_id (product ID), the passed data is as follows:
{ "distinct_id": "2b0a6f51a3cd6775", "time": 1434556935000, "type": "track", "event": "ViewProduct", "project": "sample_project", "ip": "123.123.123.123", "properties": { ... "product_id":"ABCDE-12345", ... } }
Now, it is necessary to query the Chinese name of the product based on the product_id (product ID) on the server, and the mapping relationship between Chinese and English names is stored in Redis.
At this time, a "Data Preprocessing Module" can be implemented to query Redis during the processing and fill the query results into the data:
{ "distinct_id": "2b0a6f51a3cd6775", "time": 1434556935000, "type": "track", "event": "ViewProduct", "project": "sample_project", "properties": { ... "product_id":"ABCDE-12345", "product_name":"唐诗三百首", ... } }
1.3. Discard data
When some data meets the conditions to be discarded, it can be directly returned as null to discard it.
For example, discard data sent by a specified IP.
- For versions after 1.14, you can choose not to execute the command to send data to discard the data.
1.4. Add data
The input of the "Data Processing Module" is a piece of data, and the output can be multiple pieces of data.
When returning multiple pieces of data, the return value needs to be a JSON array, and each element in the array is a piece of data that conforms to the **SenssorAnalytics** data format.
For example, when a user triggers the ViewProduct event, we tag them as VIP users, and the input data is:
{ "distinct_id": "2b0a6f51a3cd6775", "time": 1434556935000, "type": "track", "event": "ViewProduct", "project": "sample_project", "ip": "123.123.123.123", "properties": { ... } }
Multiple data can be returned:
[ { "distinct_id":"2b0a6f51a3cd6775", "time":1434556935000, "type":"track", "event":"ViewProduct", "project": "sample_project", "properties":{ ... } }, { "distinct_id":"2b0a6f51a3cd6775", "type":"profile_set", "time":1434556935000, "project": "sample_project", "properties":{ "is_vip":true } } ]
- For versions after 1.14, multiple data can be returned using the above method. Multiple calls to the sending function can also be used to return multiple data.
Note: The content of this document is a technical document that provides details on how to use the Sensors product and does not include sales terms; the specific content of enterprise procurement products and technical services shall be subject to the commercial procurement contract.