1. Generated data

1.1 Generate data files using the Sensors Analytics SDK

The SDK for each back-end language supports writing data to a file, such as using it Java SDK ConcurrentLoggingConsumerPHP SDK  FileConsumerPython SDK  LoggingConsumer。The following uses the Java SDK as an example:

// 使用 ConcurrentLoggingConsumer 初始化 SensorsAnalytics // 将数据输出到 /data/sa 下的 service_log.2017-09-25 文件中,每天一个文件 final SensorsAnalytics sa = new SensorsAnalytics( new SensorsAnalytics.ConcurrentLoggingConsumer("/data/sa/service_log")); // 使用神策分析记录用户行为数据 sa.track(distinctId, true, "UserLogin"); sa.track(distinctId, true, "ViewProduct"); // 程序结束前,停止神策分析 SDK 所有服务 sa.shutdown();
JAVA

The above configuration will be in /data/sa Data files are generated in the directory, one file a day, and the file list is as follows:

service_log.20170923 service_log.20170924 service_log.20170925
CODE

Corresponding configuration in LogAgent:

path=/data/sa pattern=service_log.*
CODE

1.2 Output data files using other methods

If the programming language has a corresponding SDK, it is recommended to use the SDK directly to output data files. If you need to write your own data files, there are a few things to note:

  1. (Very important) Files can only be appended, that is, opened as Open Append when the file is opened;
  2. The file content must beOne data line,And is consistent withdata format full JSON;
  3. Data file name must include a date, and you can also include a more detailed time, such as writing data service_log.20170925 orservice_log.2017092517, The files in the file directory should look like the following:

    service_log.20170923 service_log.20170924 service_log.20170925
    CODE

    Corresponding configuration in LogAgent:

    path=/data/sa pattern=service_log.*
    CODE
  4. If multiple processes write to the same file, use file locks to avoid data damage.

2. Configure the IP address of the data receiving service

2.1 Private deployment edition configuration

The LogAgent sends data to the server. At least the IP address of the server that receives data needs to be specified in the configuration file host and port two parameters:

  • host:一个 或 多个以 ; 半角分号分隔的服务器 IP 地址。For exampleOne or more of; Server IP addresses separated by semicolons 192.168.50.10;192.168.50.11;192.168.50.12, when LogAgent starts, one of the two servers is selected to send data to balance the load.
  • port: indicates the port number of the data receiving service. By default, the port number is 8106 for the cluster version and 806 for the stand-alone version (for version 1.7 and earlier, 8106 for version 1.8 and later). The cloud version does not have a port number.

If both the Internet and Intranet IP addresses are deployed, the value of host must be the Intranet IP address.

2.2 Cloud configuration

Click "Copy https data receiving address", for example, copy to Yeshttps://example.datasink.sensorsdata.cn/sa?project=production&token=c9239cb3139077ca

In the configuration file, delete or comment it outhostport 和 token field, open service_uri field, the value is one of the values obtained above/sa rename /log_agent,For example, in the preceding example, you need to configure https://example.datasink.sensorsdata.cn/log_agent?token=c9239cb3139077ca。

  • For details about the items to which data is sent, see Section 4.1 of this document.

2.3 Using an extranet requires additional configuration attention

LogAgent is designed and developed for real-time data import from the Intranet back-end. If LogAgent is used through the public network, additional attention should be paid to:

  • Make sure that a LogAgent lands on the same machine for every request . If load balancing is used, the load balancing algorithm must be ip_hash Source address . If the domain name DNS directly resolves to multiple machines, the domain name cannot be used, and you should directly specify an IP address or splicing the external IP addresses of multiple servers according to the above rules.
  • If LogAgent is used on the public network and HTTPS data access mode has been configured on the server, perform the following steps to configure HTTPS data sending:
    1. Note out the host, port, and token fields in the configuration file, that is, the host, port, and token parameters are not used.
    2. Enable the service_uri field in the configuration file. An example value ishttps://example.sensorsdata.cn:4006/log_agent,note that the port number should be the address of the HTTPS data access service, and the uri is /log_agent,rather than /sa;

3. Other configuration description

3.1 The name of the file to be written in real time is specified real_time_file_name

In configuration file real_time_file_name That is, the name of the file is written in real time, which is generally used in the scenario of rolling log files. For example, the file list in the daily data directory is as follows:

service_log.20170922 service_log.20170923 service_log
CODE

Data output process only write toservice_log , When it's a new day, the scheduler willservice_log Rename to file with date e.g service_log.20170923,The data output process starts writing a new one service_log file, in this scenario the configuration should be:

pattern=service_log.* real_time_file_name=service_log
CODE

pattern can match all renamed data file names, whilereal_time_file_name Indicates the data file name for real-time output.

3.2 No configuration is required in the following scenarios real_time_file_name

In configuration file real_time_file_name yyyymmdd/service_log. yyyYMmdd/service_log.yyyymmddhh and a new file is automatically generated every day, note out this parameter and set only the pattern parameter. Otherwise, data may fail to be imported.

In the JAVA SDK, for example, using the JAVA SDK ConcurrentLoggingConsumer, Through the new SensorsAnalytics. ConcurrentLoggingConsumer ("/data/sa/service_log ")), the configuration will be generated every day in the/data/sa directory a data file, the date of receipt file list such as:

service_log.20170923 service_log.20170924 service_log.20170925
CODE


Corresponding configuration in LogAgent:

path=/data/sa pattern=service_log.* # real_time_file_name=access_log
CODE

Note: now Sensors Analytics policy analysis all the server-side SDK ConcurrentLoggingConsumer mode to generate log files are generated every day a data file, the date of receipt of corresponding LogAgent configuration without real_time_file_name configuration parameters.

4. Other usage scenarios

4.1 Specify the data import project

指定数据导入的项目有两种方法:

  1. Add to each piece of data project field(See Data format)In this way, each piece of data can be imported into its own project;
  2. Specified in the configuration file project. Each piece of data is specified or not project fields are imported into the project with the parameter Settings.

The process for the system to determine the project of a piece of data is as follows:

  1. Get the configuration file project
  2. If it is not obtained in the previous step, it is obtained from the data project field;
  3. If not obtained in the previous step, (if configured) service_uri  project parameter;
  4. If not obtained in the previous step, take the project name of the default project, that is default.

If you want to import data from multiple projects with a single LogAgent, do not set it in the configuration file project,Add to the data project Field to set the item to which each piece of data belongs.

4.2 Use file-list-tool to list file reads

LogAgent provides file-list-tool

$ bin/file-list-tool usage: [file-checker] --context_file <arg> [-h] [--new_file_format <arg>] [--old_file_format <arg>] [--reading_file_format <arg>] --context_file <arg> LogAgent context file path, e.g. logagent.pid.context -h,--help help --new_file_format <arg> 对每个未读过的新文件的输出格式,指定后会格式化并输出,文件名以 {} 代替 --old_file_format <arg> 对每个旧文件的输出格式,指定后会格式化并输出,文件名以 {} 代替 --reading_file_format <arg> 对正在读取的文件的输出格式,指定后会格式化并输出,文件名以 {} 代替
BASH

Use file-list-tool analyze context file(with pid Files in the same directory) can list the read status of each file:

$ bin/file-list-tool --context_file logagent.pid.context 18/01/22 17:35:29 INFO logagent.SourceFileReader: SourceFileReader recover progress: [ fileKey: (dev=fc10,ino=118226959), offset: 11260636, fileName: test_data.2018012213, count: 161982, sourceCount: 161982 ] 18/01/22 17:35:29 INFO logagent.FileListTool: 进度之前的文件, path: /data/test/logagent/data/test_data.2018012212, size: 19191980, key: (dev=fc10,ino=118226958) 18/01/22 17:35:29 INFO logagent.FileListTool: 读取中的文件, path: /data/test/logagent/data/test_data.2018012213, size: 11260824, key: (dev=fc10,ino=118226959), tell: 11260636 18/01/22 17:35:29 INFO logagent.FileListTool: 进度之后的文件, path: /data/test/logagent/data/test_data.2018012214, size: 19191980, key: (dev=fc10,ino=118226960)
BASH

"Pre-schedule files" are those that have already been read and can be safely removed or deleted from the LogAgent read directory.

file-list-tool You can also output a list of files in a specified format, for example, the following example generates the command file mv.sh that moves the already read file to the specified directory and executes:

$ bin/file-list-tool --context_file logagent.pid.context --old_file_format 'mv {} /data/test/logagent/backup' > mv.sh 18/01/22 17:44:29 INFO logagent.SourceFileReader: SourceFileReader recover progress: [ fileKey: (dev=fc10,ino=118226959), offset: 11260636, fileName: test_data.2018012213, count: 161982, sourceCount: 161982 ] 18/01/22 17:44:29 INFO logagent.FileListTool: 进度之前的文件, path: /data/test/logagent/data/test_data.2018012212, size: 19191980, key: (dev=fc10,ino=118226958) 18/01/22 17:44:29 INFO logagent.FileListTool: 读取中的文件, path: /data/test/logagent/data/test_data.2018012213, size: 11260824, key: (dev=fc10,ino=118226959), tell: 11260636 18/01/22 17:44:29 INFO logagent.FileListTool: 进度之后的文件, path: /data/test/logagent/data/test_data.2018012214, size: 19191980, key: (dev=fc10,ino=118226960) $ cat mv.sh mv /data/test/logagent/data/test_data.2018012212 /data/test/logagent/backup $ bash mv.sh
BASH

4.3 Use the --filename --offset argument to specify the start file and offset

If the LogAgent version is later than 20191021, you can manually specify the start file and offset:

$ bin/logagent --filename service.log.20191021 --offset 100
BASH

Note:

1. After startup, all log files before the start value are ignored.
2. The configuration takes effect only after the progress is obtained from the server.
3. --filename Simply set the file name without including the path, the file must be the path specified in the configuration file path The name must match that of the preceding command pattern Prescribed form。
4. --offset must be --filename Used in combination, the default value is 0.
5. If --offset specified location splits a complete buried point data, the data will be discarded.