Monday, February 20, 2017

Book Notes: I love logs by Jay Kreps


This book is by Jay Kreps who is a Software architect at LinkedIn and primary author of Apache Kafka and Apache Samza open source software.

Following are some of the salient points from the book:
  1. The use of logs in much of the rest of this book will be variations on the two uses in database internals: 
    1. The log is used as a publish/subscribe mechanism to transmit data to other replicas 
    2. The log is used as a consistency mechanism to order the updates that are applied to multiple replicas
  2. So this is actually a very intuitive notion: if you feed two deterministic pieces of code the same input log, they will produce the same output in the same order.
  3. You can describe the state of each replica by a single number: the timestamp for the maximum log entry that it has processed. Two replicas at the same time will be in the same state. Thus, this timestamp combined with the log uniquely capture the entire state of the replica. This gives a discrete, event-driven notion of time that, unlike the machine’s local clocks, is easily comparable between different machines.
  4. In each case, the usefulness of the log comes from the simple function that the log provides: producing a persistent, replayable record of history.
  5. Surprisingly, at the core of the previously mentioned log uses is the ability to have many machines play back history at their own rates in a deterministic manner.
  6. It’s worth noting the obvious: without a reliable and complete data flow, a Hadoop cluster is little more than a very expensive and difficult-to-assemble space heater.
  7. Take all of the organization’s data and put it into a central log for real-time subscription.
  8. Each logical data source can be modeled as its own log. A data source could be an application that logs events (such as clicks or page views), or a database table that logs modifications. Each subscribing system reads from this log as quickly as it can, applies each new record to its own store, and advances its position in the log. Subscribers could be any kind of data system: a cache, Hadoop, another database in another site, a search system, and so on.
  9. The log also acts as a buffer that makes data production asynchronous from data consumption.
  10. Of particular importance: the destination system only knows about the log and does not know any details of the system of origin.
  11. use the term “log” here instead of “messaging system” or “pub sub” because it is much more specific about semantics and a much closer description of what you need in a practical implementation to support data replication.
  12. You can think of the log as acting as a kind of messaging system with durability guarantees and strong ordering semantics.
  13. we needed to isolate each consumer from the source of the data. The consumer should ideally integrate with just a single data repository that would give her access to everything.
  14. Amazon has offered a service that is very similar to Kafka called Kinesis - it is the piping that connects all their distributed systems — DynamoDB, RedShift, S3 — as well as the basis for distributed stream processing using EC2.
  15. Google has followed with a data stream and processing framework, and Microsoft has started to move in the same direction with their Azure Service Bus offering.
  16. ETL is really two things. First, it is an extraction and data cleanup process, essentially liberating data locked up in a variety of systems in the organization and removing any system-specific nonsense. Secondly, that data is restructured for data warehousing queries (that is, made to fit the type system of a relational database, forced into a star or snowflake schema, perhaps broken up into a high performance column format, and so on). Conflating these two roles is a problem. The clean, integrated repository of data should also be available in real time for low-latency processing, and for indexing in other real-time storage systems.
  17. A better approach is to have a central pipeline, the log, with a well-defined API for adding data. The responsibility of integrating with this pipeline and providing a clean, well-structured data feed lies with the producer of this data feed. This means that as part of their system design and implementation, they must consider the problem of getting data out and into a well-structured form for delivery to the central pipeline.
  18. benefit of this architecture: it enables decoupled, event-driven systems.
  19. In order to allow horizontal scaling, we chop up our log into partitions
    1. Each partition is a totally ordered log, but there is no global ordering between partitions. 
    2. The writer controls the assignment of the messages to a particular partition, with most users choosing to partition by some kind of key (such as a user ID). Partitioning allows log appends to occur without coordination between shards, and allows the throughput of the system to scale linearly with the Kafka cluster size while still maintaining ordering within the sharding key.
    3. Each partition is replicated across a configurable number of replicas, each of which has an identical copy of the partition’s log. At any time, a single partition will act as the leader; if the leader fails, one of the replicas will take over as leader.
    4. each partition is order preserving, and Kafka guarantees that appends to a particular partition from a single sender will be delivered in the order they are sent.
  20. Kafka uses a simple binary format that is maintained between in-memory log, on-disk log, and in-network data transfers.
  21. It turns out that “log” is another word for “stream” and logs are at the heart of stream processing - see stream processing as something much broader: infrastructure for continuous data processing.
  22. Data collected in batch is naturally processed in batch. When data is collected continuously, it is naturally processed continuously.
  23. many data transfer processes still depend on taking periodic dumps and bulk transfer and integration. The only natural way to process a bulk dump is with a batch process. As these processes are replaced with continuous feeds, we naturally start to move towards continuous processing to smooth out the processing resources needed and reduce latency.
  24. This means that a stream processing system produces output at a user-controlled frequency instead of waiting for the “end” of the data set to be reached.
  25. The final use of the log is arguably the most important, and that is to provide buffering and isolation to the individual processes.
  26. The log acts as a very, very large buffer that allows the process to be restarted or fail without slowing down other parts of the processing graph. This means that a consumer can come down entirely for long periods of time without impacting any of the upstream graph; as long as it is able to catch up when it restarts, everything else is unaffected.
  27. An interesting application of this kind of log-oriented data modeling is the Lambda Architecture. This is an idea introduced by Nathan Marz, who wrote a widely read blog post describing an approach to combining stream processing with offline processing.
  28. For event data, Kafka supports retaining a window of data. The window can be defined in terms of either time (days) or space (GBs), and most people just stick with the one week default retention.
  29. Instead of simply throwing away the old log entirely, we garbage-collect obsolete records from the tail of the log. Any record in the tail of the log that has a more recent update is eligible for this kind of cleanup. By doing this, we still guarantee that the log contains a complete backup of the source system, but now we can no longer recreate all previous states of the source system, only the more recent ones. We call this feature log compaction.
  30. data infrastructure could be unbundled into a collection of services and application-facing system API.
    1. Zookeeper handles much of the system coordination (perhaps with a bit of help from higher-level abstractions like Helix or Curator).
    2. Mesos and YARN process virtualization and resource management.
    3. Embedded libraries like Lucene, RocksDB, and LMDB do indexing.
    4. Netty, Jetty, and higher-level wrappers like Finagle and rest.li handle remote communication.
    5. Avro, Protocol Buffers, Thrift, and umpteen zillion other libraries handle serialization. 
    6. Kafka and BookKeeper provide a backing log.
  31. If you stack these things in a pile and squint a bit, it starts to look like a LEGO version of distributed data system engineering. You can piece these ingredients together to create a vast array of possible systems.
  32. Here are some things a log can do: 
    1. Handle data consistency (whether eventual or immediate) by sequencing concurrent updates to nodes 
    2. Provide data replication between nodes 
    3. Provide “commit” semantics to the writer (such as acknowledging only when your write is guaranteed not to be lost) 
    4. Provide the external data subscription feed from the system 
    5. Provide the capability to restore failed replicas that lost their data or bootstrap new replicas 
    6. Handle rebalancing of data between nodes

No comments:

Book notes: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, by Martin Kleppmann

My notes from the excellent book on how software has evolved to handle data from hierarchical databases to the NoSQL -  https://www.goodrea...