Sensors Analytics supports regular expressions in all its functions, allowing users to use regular expressions for flexible property filtering.

A regular expression is a logical formula for manipulating strings. It uses certain predefined characters and their combinations to form a "rule string" that expresses a filtering logic for strings.

1. Example

Sensors Analytics official website has a documentation channel, and the page addresses under this channel are as follows:

  • Introduction page: https://www.sensorsdata.cn/manual/
  • Function introduction: https://www.sensorsdata.cn/manual/features.html
  • Event analysis: https://www.sensorsdata.cn/manual/event_ana.html

As you can see, the page addresses under the documentation channel follow a pattern, which starts with https://www.sensorsdata.cn/manual/ .

Therefore, when we want to see the overall page views of the documentation channel, we can use the following regular expression for matching:

/manual/.*

2. Syntax of regular expressions

The . and * in the above expression are two of the special characters used in regular expressions. The definitions of other regular expression characters are as follows:

2.1. Wildcard

CharacterMeaningExample
.Matches any single charactersens.ors matches sensoors, sens8ors
*Matches 0 or more previous entriesThe default preceding item is the previous character.sens*ors with senorssenssors matches
+The same usage as an asterisk, except that the plus sign must match at least one previous itemsens+ors and senssors matches, but with senors dismatches
?Matches 0 or 1 previous itemlabou?r and labor and labour matches
Perform an OR matcha∣b matches a or b

2.2. Locator

CharacterImplicationFor Example
^Require your data to be at the beginning of the field^sensors and sensors matches,with mysensors dismatches
$Require your data to be at the end of the fieldsensors$ and sensors matches, with sensorscan dismatches

2.3. Group

CharacterImplicationFor Example
()Create items using parentheses instead of default itemsThank(s∣you)  Thanks  Thankyou all matched.
[]Use square brackets to create a list of items to match[abc] match ab and c
-Use square brackets and dash lines to expand your list[A-Z] Represents a list of uppercase letters in English

2.4. Transfer meaning

CharacterImplicationFor Example
\Converts regular expression characters to regular characterssensorsdata\.cn The . in this expression is no longer a wildcard


For more regular expression syntax, please refer to:

Google/re2 Regular expression syntax


3. User ID Regular expression

3.1. Function

In version 1.17 and later, provides that may limit the distinct_id value of uploaded data by setting whether the value of distinct_ID meets a specified rule. This function can effectively improve the accuracy of data and solve the problem of incorrect data reporting into the database from the root cause.

distinct_id is required to identify a user for distinct_id. The value rules for distinct_id can be "device ID" or "login ID." For details, see the distinct_ID documentUser identitydocument.


Note:If the regular expression rule that is set to distinct_id is incorrect, the distinct_id value is inconsistent with the actual value of uploaded data. As a result, the data cannot be uploaded to the database. Therefore, exercise caution when using this function. Before using this function, be sure to confirm the use scenario with the student on duty.

3.2. Instructions for use

Sensors supports a variety of data import sources, including the client SDK, the server SDK, and various Sensors providedImport toolimport data. At present, only the client SDK (including Android SDK, iOS SDK, JS SDK, and various small program SDKS) can be used「Device ID」markeddata, set the distinct_id rule for receiving distinct_ID with "Login ID" for all data from the imported source.

Under normal circumstances, the data uploaded by the server SDK and various import tools provided by Sensors is generally reported as "Login ID" . If the source data is not set "Device ID" or "Login ID" verification rules, it can be directly stored. For example, when the server SDK is uploading data with a "Device ID" , the "Device ID" verification rule is not set for the server SDK. In the case that other data inspection formats are met, the data can be directly stored.


3.3. Common「Device ID」and「login ID」Regular expression example

The following table is commonly used according to the Divine client SDK「Device ID」style and common「login ID」style, write the corresponding regular expression for example. For more ID style regular expressions, see the regular expression rules introduction to implement yourself.

Client typeID sortID RuleID StyleID Value TimingRegular Expression for IDRegular Expression for ID/Login ID of SDK Devices on various platforms


Android SDK

Android IDNormally 16 characters (0~9, a~f) (It does not exclude the possibility that some mobile phone manufacturers may generate Android IDs with 15, 14, or 13 characters. Therefore, the number of characters in this example is set to be 1~16)774d56d682e549cDefault Anonymous ID for Android SDK^([0-9a-z]{1,16})$


^([0-9a-z]{1,16})$|^([0-9a-z]{8})(([/\s-][0-9a-z]{4}){3})([/\s-][0-9a-z]{12})$

UUID8-4-4-4-12, each digit is a hexadecimal number (0~9, a~f), only lowercase letters550e8400-e29b-41d4-a716-446655440000Default Anonymous ID for Android SDK in case Android ID is not available or Android SDK version is before 1.10.5^([0-9a-z]{8})(([/\s-][0-9a-z]{4}){3})([/\s-][0-9a-z]{12})$



iOS SDK

IDFA8-4-4-4-12 Each digit is a hexadecimal number (0~9, A-F), with only uppercase letters.DA067C52-8D48-49CE-9500-5A01368B8859iOS SDK default anonymous ID retrieval.


^([0-9A-Z]{8})(([/\s-][0-9A-Z]{4}){3})([/\s-][0-9A-Z]{12})$



^([0-9A-Z]{8})(([/\s-][0-9A-Z]{4}){3})([/\s-][0-9A-Z]{12})$

IDFV8-4-4-4-12 Each digit is a hexadecimal number (0~9, A-F), with only uppercase letters.

DA067C52-8D48-49CE-9500-5A01368B8859

iOS SDK, when IDFA is not available, try IDFV as anonymous ID.
UUID8-4-4-4-12 Each digit is a hexadecimal number (0~9, A-F), with only uppercase letters.DA067C52-8D48-49CE-9500-5A01368B8859iOS SDK, when IDFA and IDFV are not available, assign a UUID as anonymous ID.


JS SDK

cookie idn-n-n-n-n Total 5 segments, each segment is a hexadecimal number (0~9, a~f), the number of n is not fixed (expected to be 5~25)16e39c2c8b999e-05ae1754c671f3-38607701-2073600-16e39c2c8ba85cDefault anonymous ID on the web end^([0-9a-z]{5,})(([/\s-][0-9a-z]{5,}){4})$


^([0-9a-z]{5,})(([/\s-][0-9a-z]{5,}){4})$


Mini program (Various Mini Program SDKs)

uuid 13-n-n-n (The first segment is a 13-digit timestamp, followed by 3 variable-length digits consisting of numbers and lowercase letters)1558509239724-9278730-00c1875d5f63f8-41373096Default anonymous ID for all mini programs is uuid^([0-9]{13})(([/\s-][0-9a-z]{1,}){3})$



^([0-9]{13})(([/\s-][0-9a-z]{1,}){3})$|^o[0-9a-zA-Z_-]{27}$|^o[0-9a-zA-Z_-]{28}$


WeChat mini program

openidAn openid starting with lowercase letter 'o' and consisting of 28 characters including numbers, lowercase letters, uppercase letters, underscores, and hyphens

oB4nYjnoHhuWrPVi2pYLuPjnCaU0
oB4nYjhJHQVaD0PL7qs0W1kL-_ls
oB4nYjvY13SVtaWC-AFztM2f3TlU

WeChat mini-program can be set to use openid as the anonymous ID^o[0-9a-zA-Z_-]{27}$


WeChat mini-program

unionidAn unionid starting with lowercase letter 'o' and consisting of 29 characters including numbers, letters, underscores, and hyphens

oJeaRw70h8MKiI3IQuFPJlsZzvTEF

Users can obtain unionid as the anonymous ID or login ID for WeChat mini-program^o[0-9a-zA-Z_-]{28}$



Login ID (This article is only for illustrative purposes to explain some login ID styles. Please refer to the actual login ID rules used in the business for specific rules)

Numeric login IDNumeric, incremented from 0 (0~6666666)12345Customer-defined login ID rules

^\d+$    (Validate pure numeric strings)

^[0-9]*$    (Verify pure numeric string)

^\d{n}$   (Verify n digit number, n input specific value)

^\d+$
Alphanumeric login IDCombination of numbers and lettersu123f56Customized login ID rules

^[0-9a-zA-Z]{n,m}$

(String composed of n to m digits, letters)

^[0-9a-zA-Z]{n,m}$
Purely alphabetic login IDPure alphabetsqazwsxCustomized login ID rules

^[a-zA-Z]{n,m}$

(String composed of n to m alphabets)

^[a-zA-Z]{n,m}$
Email used as login IDEmail (Email without Chinese characters)

shence@sensorsdata.cn

test123@ss.ss.ss

test1-23@s-_s.ss.ss

Email format includes [(digits, uppercase and lowercase letters, underscore, hyphen, period)@(digits, uppercase and lowercase letters, underscore, hyphen, period)] with unlimited length

^([A-Za-z0-9_\-\.])+\@[a-zA-Z0-9_\-]+([a-zA-Z0-9_\-\.])+$

^([A-Za-z0-9_\-\.])+\@[a-zA-Z0-9_\-]+([a-zA-Z0-9_\-\.])+$


Explanation of the above regular expressions

ExpressionMeaning
[0-9a-zA-Z_-]{27}27 characters, each character consists of digits 0-9, lowercase letters a-z, uppercase letters A-Z, underscore _, hyphen -
$Indicates that it ends with this character
^Indicates that it starts with this character


Note:

(1) If the "Device ID" value of a certain client SDK has multiple possibilities, be sure to write them in the regular expression using the "or" relationship.

(2) For the "Login ID," because the ID rules of different products vary, you need to define the "format rule" using regular expressions on your own. You can refer to this document to write regular expressions and contact the Sensetime to confirm the correctness of the regular expression. If there are multiple products in the current project, and their "Login ID" format rules are different, be sure to write them in the regular expression using the "or" relationship. If only one group rule is described, it may cause the user data of other products to not be imported correctly.