Regular expression
|
Collect
Sensors Analytics supports regular expressions in all its functions, allowing users to use regular expressions for flexible property filtering.
A regular expression is a logical formula for manipulating strings. It uses certain predefined characters and their combinations to form a "rule string" that expresses a filtering logic for strings.
1. Example
Sensors Analytics official website has a documentation channel, and the page addresses under this channel are as follows:
- Introduction page: https://www.sensorsdata.cn/manual/
- Function introduction: https://www.sensorsdata.cn/manual/features.html
- Event analysis: https://www.sensorsdata.cn/manual/event_ana.html
As you can see, the page addresses under the documentation channel follow a pattern, which starts with https://www.sensorsdata.cn/manual/ .
Therefore, when we want to see the overall page views of the documentation channel, we can use the following regular expression for matching:
/manual/.*
2. Syntax of regular expressions
The .
and *
in the above expression are two of the special characters used in regular expressions. The definitions of other regular expression characters are as follows:
2.1. Wildcard
Character | Meaning | Example |
---|---|---|
. | Matches any single character | sens.ors matches sensoors , sens8ors |
* | Matches 0 or more previous entries | The default preceding item is the previous character.sens*ors with senors 、senssors matches |
+ | The same usage as an asterisk, except that the plus sign must match at least one previous item | sens+ors and senssors matches, but with senors dismatches |
? | Matches 0 or 1 previous item | labou?r and labor and labour matches |
∣ | Perform an OR match | a∣b matches a or b |
2.2. Locator
Character | Implication | For Example |
---|---|---|
^ | Require your data to be at the beginning of the field | ^sensors and sensors matches,with mysensors dismatches |
$ | Require your data to be at the end of the field | sensors$ and sensors matches, with sensorscan dismatches |
2.3. Group
Character | Implication | For Example |
---|---|---|
() | Create items using parentheses instead of default items | Thank(s∣you) 与 Thanks 和 Thankyou all matched. |
[] | Use square brackets to create a list of items to match | [abc] match a 、b and c |
- | Use square brackets and dash lines to expand your list | [A-Z] Represents a list of uppercase letters in English |
2.4. Transfer meaning
Character | Implication | For Example |
---|---|---|
\ | Converts regular expression characters to regular characters | sensorsdata\.cn The . in this expression is no longer a wildcard |
For more regular expression syntax, please refer to:
Google/re2 Regular expression syntax
3. User ID Regular expression
3.1. Function
In version 1.17 and later, provides that may limit the distinct_id value of uploaded data by setting whether the value of distinct_ID meets a specified rule. This function can effectively improve the accuracy of data and solve the problem of incorrect data reporting into the database from the root cause.
distinct_id is required to identify a user for distinct_id. The value rules for distinct_id can be "device ID" or "login ID." For details, see the distinct_ID documentUser identitydocument.
Note:If the regular expression rule that is set to distinct_id is incorrect, the distinct_id value is inconsistent with the actual value of uploaded data. As a result, the data cannot be uploaded to the database. Therefore, exercise caution when using this function. Before using this function, be sure to confirm the use scenario with the student on duty.
3.2. Instructions for use
Sensors supports a variety of data import sources, including the client SDK, the server SDK, and various Sensors providedImport toolimport data. At present, only the client SDK (including Android SDK, iOS SDK, JS SDK, and various small program SDKS) can be used「Device ID」markeddata, set the distinct_id rule for receiving distinct_ID with "Login ID" for all data from the imported source.
Under normal circumstances, the data uploaded by the server SDK and various import tools provided by Sensors is generally reported as "Login ID" . If the source data is not set "Device ID" or "Login ID" verification rules, it can be directly stored. For example, when the server SDK is uploading data with a "Device ID" , the "Device ID" verification rule is not set for the server SDK. In the case that other data inspection formats are met, the data can be directly stored.
3.3. Common「Device ID」and「login ID」Regular expression example
The following table is commonly used according to the Divine client SDK「Device ID」style and common「login ID」style, write the corresponding regular expression for example. For more ID style regular expressions, see the regular expression rules introduction to implement yourself.
Client type | ID sort | ID Rule | ID Style | ID Value Timing | Regular Expression for ID | Regular Expression for ID/Login ID of SDK Devices on various platforms |
---|---|---|---|---|---|---|
Android SDK | Android ID | Normally 16 characters (0~9, a~f) (It does not exclude the possibility that some mobile phone manufacturers may generate Android IDs with 15, 14, or 13 characters. Therefore, the number of characters in this example is set to be 1~16) | 774d56d682e549c | Default Anonymous ID for Android SDK | ^([0-9a-z]{1,16})$ | ^([0-9a-z]{1,16})$|^([0-9a-z]{8})(([/\s-][0-9a-z]{4}){3})([/\s-][0-9a-z]{12})$ |
UUID | 8-4-4-4-12, each digit is a hexadecimal number (0~9, a~f), only lowercase letters | 550e8400-e29b-41d4-a716-446655440000 | Default Anonymous ID for Android SDK in case Android ID is not available or Android SDK version is before 1.10.5 | ^([0-9a-z]{8})(([/\s-][0-9a-z]{4}){3})([/\s-][0-9a-z]{12})$ | ||
iOS SDK | IDFA | 8-4-4-4-12 Each digit is a hexadecimal number (0~9, A-F), with only uppercase letters. | DA067C52-8D48-49CE-9500-5A01368B8859 | iOS SDK default anonymous ID retrieval. | ^([0-9A-Z]{8})(([/\s-][0-9A-Z]{4}){3})([/\s-][0-9A-Z]{12})$ | ^([0-9A-Z]{8})(([/\s-][0-9A-Z]{4}){3})([/\s-][0-9A-Z]{12})$ |
IDFV | 8-4-4-4-12 Each digit is a hexadecimal number (0~9, A-F), with only uppercase letters. | DA067C52-8D48-49CE-9500-5A01368B8859 | iOS SDK, when IDFA is not available, try IDFV as anonymous ID. | |||
UUID | 8-4-4-4-12 Each digit is a hexadecimal number (0~9, A-F), with only uppercase letters. | DA067C52-8D48-49CE-9500-5A01368B8859 | iOS SDK, when IDFA and IDFV are not available, assign a UUID as anonymous ID. | |||
JS SDK | cookie id | n-n-n-n-n Total 5 segments, each segment is a hexadecimal number (0~9, a~f), the number of n is not fixed (expected to be 5~25) | 16e39c2c8b999e-05ae1754c671f3-38607701-2073600-16e39c2c8ba85c | Default anonymous ID on the web end | ^([0-9a-z]{5,})(([/\s-][0-9a-z]{5,}){4})$ | ^([0-9a-z]{5,})(([/\s-][0-9a-z]{5,}){4})$ |
Mini program (Various Mini Program SDKs) | uuid | 13-n-n-n (The first segment is a 13-digit timestamp, followed by 3 variable-length digits consisting of numbers and lowercase letters) | 1558509239724-9278730-00c1875d5f63f8-41373096 | Default anonymous ID for all mini programs is uuid | ^([0-9]{13})(([/\s-][0-9a-z]{1,}){3})$ | ^([0-9]{13})(([/\s-][0-9a-z]{1,}){3})$|^o[0-9a-zA-Z_-]{27}$|^o[0-9a-zA-Z_-]{28}$ |
WeChat mini program | openid | An openid starting with lowercase letter 'o' and consisting of 28 characters including numbers, lowercase letters, uppercase letters, underscores, and hyphens | oB4nYjnoHhuWrPVi2pYLuPjnCaU0 | WeChat mini-program can be set to use openid as the anonymous ID | ^o[0-9a-zA-Z_-]{27}$ | |
WeChat mini-program | unionid | An unionid starting with lowercase letter 'o' and consisting of 29 characters including numbers, letters, underscores, and hyphens | oJeaRw70h8MKiI3IQuFPJlsZzvTEF | Users can obtain unionid as the anonymous ID or login ID for WeChat mini-program | ^o[0-9a-zA-Z_-]{28}$ | |
Login ID (This article is only for illustrative purposes to explain some login ID styles. Please refer to the actual login ID rules used in the business for specific rules) | Numeric login ID | Numeric, incremented from 0 (0~6666666) | 12345 | Customer-defined login ID rules | ^\d+$ (Validate pure numeric strings) ^[0-9]*$ (Verify pure numeric string) ^\d{n}$ (Verify n digit number, n input specific value) | ^\d+$ |
Alphanumeric login ID | Combination of numbers and letters | u123f56 | Customized login ID rules | ^[0-9a-zA-Z]{n,m}$ (String composed of n to m digits, letters) | ^[0-9a-zA-Z]{n,m}$ | |
Purely alphabetic login ID | Pure alphabets | qazwsx | Customized login ID rules | ^[a-zA-Z]{n,m}$ (String composed of n to m alphabets) | ^[a-zA-Z]{n,m}$ | |
Email used as login ID | Email (Email without Chinese characters) | shence@sensorsdata.cn test123@ss.ss.ss test1-23@s-_s.ss.ss | Email format includes [(digits, uppercase and lowercase letters, underscore, hyphen, period)@(digits, uppercase and lowercase letters, underscore, hyphen, period)] with unlimited length | ^([A-Za-z0-9_\-\.])+\@[a-zA-Z0-9_\-]+([a-zA-Z0-9_\-\.])+$ | ^([A-Za-z0-9_\-\.])+\@[a-zA-Z0-9_\-]+([a-zA-Z0-9_\-\.])+$ |
Explanation of the above regular expressions
Expression | Meaning |
---|---|
[0-9a-zA-Z_-]{27} | 27 characters, each character consists of digits 0-9, lowercase letters a-z, uppercase letters A-Z, underscore _, hyphen - |
$ | Indicates that it ends with this character |
^ | Indicates that it starts with this character |
Note:
(1) If the "Device ID" value of a certain client SDK has multiple possibilities, be sure to write them in the regular expression using the "or" relationship.
(2) For the "Login ID," because the ID rules of different products vary, you need to define the "format rule" using regular expressions on your own. You can refer to this document to write regular expressions and contact the Sensetime to confirm the correctness of the regular expression. If there are multiple products in the current project, and their "Login ID" format rules are different, be sure to write them in the regular expression using the "or" relationship. If only one group rule is described, it may cause the user data of other products to not be imported correctly.
Note: The content of this document is a technical document that provides details on how to use the Sensors product and does not include sales terms; the specific content of enterprise procurement products and technical services shall be subject to the commercial procurement contract.