内容简介:Secor is a service persistingEditKafka headers are only supported with kafka 2.0.0 or higher, compile secor with
Pinterest Secor
Secor is a service persisting Kafka logs to Amazon S3 , Google Cloud Storage , Microsoft Azure Blob Storage and Openstack Swift .
Key features
- strong consistency : as long as Kafka is not dropping messages (e.g., due to aggressive cleanup policy) before Secor is able to read them, it is guaranteed that each message will be saved in exactly one S3 file. This property is not compromised by the notorious temporal inconsistency of S3 caused by the eventual consistency model,
- fault tolerance : any component of Secor is allowed to crash at any given point without compromising data integrity,
- load distribution : Secor may be distributed across multiple machines,
- horizontal scalability : scaling the system out to handle more load is as easy as starting extra Secor processes. Reducing the resource footprint can be achieved by killing any of the running Secor processes. Neither ramping up nor down has any impact on data consistency,
- output partitioning : Secor parses incoming messages and puts them under partitioned s3 paths to enable direct import into systems like Hive . day,hour,minute level partitions are supported by secor
- configurable upload policies : commit points controlling when data is persisted in S3 are configured through size-based and time-based policies (e.g., upload data when local buffer reaches size of 100MB and at least once per hour),
- monitoring : metrics tracking various performance properties are exposed through Ostrich and optionally exported to OpenTSDB / statsD ,
- customizability : external log message parser may be loaded by updating the configuration,
- event transformation : external message level transformation can be done by using customized class.
- Qubole interface : Secor connects to Qubole to add finalized output partitions to Hive tables.
Setup Guide
Get Secor code
git clone [git-repo-url] secor cd secor
Customize configuration parameters
Edit src/main/config/*.properties
files to specify parameters describing the environment. Those files contain comments describing the meaning of individual parameters.
Kafka Headers support
Kafka headers are only supported with kafka 2.0.0 or higher, compile secor with -Pkafka-2.0.0
to use kafka 2.0.0 libraries.
Create and install jars
# By default this will install the "kafka-0.10.2.0" profile mvn package mkdir ${SECOR_INSTALL_DIR} # directory to place Secor binaries in. tar -zxvf target/secor-0.1-SNAPSHOT-bin.tar.gz -C ${SECOR_INSTALL_DIR} # To use the Kafka 1.0.0 kafka libraries with scala 2.11 mvn -Pkafka-1.0.0
Run tests (optional)
cd ${SECOR_INSTALL_DIR} ./scripts/run_tests.sh # OR: MVN_PROFILE=<profile> ./scripts/run_tests.sh
Run Secor
cd ${SECOR_INSTALL_DIR} java -ea -Dsecor_group=secor_backup \ -Dlog4j.configuration=log4j.prod.properties \ -Dconfig=secor.prod.backup.properties \ -cp secor-0.1-SNAPSHOT.jar:lib/* \ com.pinterest.secor.main.ConsumerMain
Please take note that Secor requires JRE8 for it's runtime as source code uses JRE8 language features. JRE9 and JRE10 is untested .
Output grouping
One of the convenience features of Secor is the ability to group messages and save them under common file prefixes. The partitioning is controlled by a message parser. Secor comes with the following parsers:
-
offset parser: parser that groups messages based on offset ranges. E.g., messages with offsets in range 0 to 999 will end up under
s3n://bucket/topic/offset=0/
, offsets 1000 to 2000 will go tos3n://bucket/topic/offset=1000/
. To use this parser, start Secor with properties file secor.prod.backup.properties . -
Thrift date parser: parser that extracts timestamps from thrift messages and groups the output based on the date (at a day granularity). To keep things simple, this parser assumes that the timestamp is carried in the first field (id 1) of the thrift message schema by default. The field id can be changed by setting
message.timestamp.id
as long as the field is at the top level of the thrift object (i.e. it is not in a nested structure). The timestamp may be expressed either in seconds or milliseconds, or nanoseconds since the epoch. The output goes to date-partitioned paths (e.g.,s3n://bucket/topic/dt=2014-05-01
,s3n://bucket/topic/dt=2014-05-02
). Date partitioning is particularly convenient if the output is to be consumed by ETL tools such as Hive . To use this parser, start Secor with properties file secor.prod.partition.properties . Note themessage.timestamp.name
property has no effect on the thrift parsing, which is determined by the field id. -
JSON timestamp parser: parser that extracts UNIX timestamps from JSON messages and groups the output based on the date, similar to the Thrift parser above. To use this parser, start Secor with properties file secor.prod.partition.properties and set
secor.message.parser.class=com.pinterest.secor.parser.JsonMessageParser
. You may override the field used to extract the timestamp by setting the "message.timestamp.name" property. -
Avro timestamp parser: parser that extracts UNIX timestamps from AVRO messages and groups the output based on the date, similar to the Thrift parser above. To use this parser, start Secor with properties file secor.prod.partition.properties and set
secor.message.parser.class=com.pinterest.secor.parser.AvroMessageParser
. You may override the field used to extract the timestamp by setting the "message.timestamp.name" property. -
JSON ISO 8601 date parser: Assumes your timestamp field uses ISO 8601. To use this parser, start Secor with properties file secor.prod.partition.properties and set
secor.message.parser.class=com.pinterest.secor.parser.Iso8601MessageParser
. You may override the field used to extract the timestamp by setting the "message.timestamp.name" property. -
MessagePack date parser: parser that extracts timestamps from MessagePack messages and groups the output based on the date, similar to the Thrift and JSON parser. To use this parser, set
secor.message.parser.class=com.pinterest.secor.parser.MessagePackParser
. Like the Thrift parser, the timestamp may be expressed either in seconds or milliseconds, or nanoseconds since the epoch and respects the "message.timestamp.name" property. -
Protocol Buffers date parser: parser that extracts timestamps from protobuf messages and groups the output based on the date, similar to the Thrift, JSON or MessagePack parser. To use this parser, set
secor.message.parser.class=com.pinterest.secor.parser.ProtobufMessageParser
. Like the Thrift parser, the timestamp may be expressed either in seconds or milliseconds, or nanoseconds since the epoch and respects the "message.timestamp.name" property. -
Output grouping with Flexible partitions: The default partitioning granularity for date, hours and minutes have prefix for convenient consumption for
Hive
. If you require different naming of partition with(out) prefix and other date, hour or minute format update the following properties insecor.common.properties
partitioner.granularity.date.prefix=dt= partitioner.granularity.hour.prefix=hr= partitioner.granularity.minute.prefix=min= partitioner.granularity.date.format=yyyy-MM-dd partitioner.granularity.hour.format=HH partitioner.granularity.minute.format=mm
If none of the parsers available out-of-the-box is suitable for your use case, note that it is very easy to implement a custom parser. All you have to do is to extend MessageParser
and tell Secor to use your parser by setting secor.message.parser.class
in the properties file.
Output File Formats
Currently secor supports the following output formats
-
Sequence Files: Flat file containing binary key value pairs. To use this format, set
secor.file.reader.writer.factory=com.pinterest.secor.io.impl.SequenceFileReaderWriterFactory
option. -
Delimited Text Files: A new line delimited raw text file. To use this format, set
secor.file.reader.writer.factory=com.pinterest.secor.io.impl.DelimitedTextFileReaderWriterFactory
option. -
ORC Files: Optimized row columnar format. To use this format, set
secor.file.reader.writer.factory=com.pinterest.secor.io.impl.JsonORCFileReaderWriterFactory
option. Additionally, ORC schema must be specified per topic like thissecor.orc.message.schema.<topic>=<orc schema>
. If all Kafka topics receive same format data then this option can be usedsecor.orc.message.schema.*=<orc schema>
. User can implement custom ORC schema provider by implementing ORCSchemaProvider interface and the new provider class should be specified using optionsecor.orc.schema.provider=<orc schema provider class name>
. By default this property is DefaultORCSchemaProvider. -
Parquet Files (for Protobuf messages) : Columnar storage format. To use this output format, set
secor.file.reader.writer.factory=com.pinterest.secor.io.impl.ProtobufParquetFileReaderWriterFactory
option. In addition, Protobuf message class per Kafka topic must be defined using optionsecor.protobuf.message.class.<topic>=<protobuf class name>
. If all Kafka topics transfer the same protobuf message type, setsecor.protobuf.message.class.*=<protobuf class name>
. -
Parquet Files (for JSON messages) : Columnar storage format. In addition to setting all options necessary to write Protobuf to Parquet (see above), the JSON topics must be explicitly defined using the option
secor.topic.message.format.<topic>=JSON
orsecor.topic.message.format.*=JSON
if all Kafka topics use JSON. The protobuf classes defined per topic will be used as intermediaries between the JSON messages and Parquet files. -
Parquet Files (for Thrift messages) : Columnar storage format. To use this output format, set
secor.file.reader.writer.factory=com.pinterest.secor.io.impl.ThriftParquetFileReaderWriterFactory
option. In addition, thrift message class per Kafka topic must be defined using optionsecor.thrift.message.class.<topic>=<thrift class name>
. If all Kafka topics transfer the same thrift message type, setsecor.thrift.message.class.*=<thrift class name>
. It is assumed all messages use the same thrift protocol. Thrift protocol is set insecor.thrift.protocol.class
. -
Parquet Files (for Avro messages) : Columnar storage format. To use this output format, set
secor.file.reader.writer.factory=com.pinterest.secor.io.impl.AvroParquetFileReaderWriterFactory
option. Theschema.registry.url
option must be set. -
Gzip upload format: To enable compression on uploaded files to the cloud, in
secor.common.properties
setsecor.compression.codec
to a valid compression codec implementingorg.apache.hadoop.io.compress.CompressionCodec
interface, such asorg.apache.hadoop.io.compress.GzipCodec
.
Tools
Secor comes with a number of tools implementing interactions with the environment.
Log file printer
Log file printer displays the content of a log file.
java -ea -Dlog4j.configuration=log4j.prod.properties -Dconfig=secor.prod.backup.properties -cp "secor-0.1-SNAPSHOT.jar:lib/*" com.pinterest.secor.main.LogFilePrinterMain -f s3n://bucket/path
Log file verifier
Log file verifier checks the consistency of log files.
java -ea -Dlog4j.configuration=log4j.prod.properties -Dconfig=secor.prod.backup.properties -cp "secor-0.1-SNAPSHOT.jar:lib/*" com.pinterest.secor.main.LogFileVerifierMain -t topic -q
Partition finalizer
Topic finalizer writes _SUCCESS files to date partitions that very likely won't be receiving any new messages and (optionally) adds the corresponding dates to Hive through Qubole API.
java -ea -Dlog4j.configuration=log4j.prod.properties -Dconfig=secor.prod.backup.properties -cp "secor-0.1-SNAPSHOT.jar:lib/*" com.pinterest.secor.main.PartitionFinalizerMain
Progress monitor
Progress monitor exports offset consumption lags per topic partition to OpenTSDB / statsD . Lags track how far Secor is behind the producers.
java -ea -Dlog4j.configuration=log4j.prod.properties -Dconfig=secor.prod.backup.properties -cp "secor-0.1-SNAPSHOT.jar:lib/*" com.pinterest.secor.main.ProgressMonitorMain
Set monitoring.interval.seconds
to a value larger than 0 to run in a loop, exporting stats every monitoring.interval.seconds
seconds.
Detailed design
Design details are available in DESIGN.md .
License
Secor is distributed under Apache License, Version 2.0 .
Maintainers
Contributors
- Andy Kramolisch
- Brenden Matthews
- Lucas Zago
- James Green
- Praveen Murugesan
- Zack Dever
- Leo Woessner
- Jerome Gagnon
- Taichi Nakashima
- Lovenish Goyal
- Ahsan Nabi Dar
- Ashish Kumar
- Ashwin Sinha
- Avi Chad-Friedman
Companies who use Secor
- Airbnb
- Strava
- TiVo
- Yelp
- Credit Karma
- VarageSale
- Skyscanner
- Nextperf
- Zalando
- Rakuten
- Appsflyer
- Wego
- GO-JEK
- Branch
- Viacom
- Simplaex
- Zapier
Help
If you have any questions or comments, you can reach us at secor-users@googlegroups.com
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
The Joy of X
Niall Mansfield / UIT Cambridge Ltd. / 2010-7-1 / USD 14.95
Aimed at those new to the system seeking an overall understanding first, and written in a clear, uncomplicated style, this reprint of the much-cited 1993 classic describes the standard windowing syste......一起来看看 《The Joy of X》 这本书的介绍吧!