Tuesday, February 28, 2017

Introduction to Microservices Summary

Excerpted from https://www.nginx.com/blog/introduction-to-microservices/

Author of the ebook Chris Richardson, has very nicely summarized each article. I am copy pasting the summary sections from the 7 articles and putting them in one place for easy reference.

  1. Building complex applications is inherently difficult. A Monolithic architecture only makes sense for simple, lightweight applications. You will end up in a world of pain if you use it for complex applications. The Microservices architecture pattern is the better choice for complex, evolving applications despite the drawbacks and implementation challenges.
  2. For most microservices‑based applications, it makes sense to implement an API Gateway, which acts as a single entry point into a system. The API Gateway is responsible for request routing, composition, and protocol translation. It provides each of the application’s clients with a custom API. The API Gateway can also mask failures in the backend services by returning cached or default data.
  3. Microservices must communicate using an inter‑process communication mechanism. When designing how your services will communicate, you need to consider various issues: how services interact, how to specify the API for each service, how to evolve the APIs, and how to handle partial failure. There are two kinds of IPC mechanisms that microservices can use, asynchronous messaging and synchronous request/response.
  4. In a microservices application, the set of running service instances changes dynamically. Instances have dynamically assigned network locations. Consequently, in order for a client to make a request to a service it must use a service‑discovery mechanism.
  5. A key part of service discovery is the service registry. The service registry is a database of available service instances. The service registry provides a management API and a query API. Service instances are registered with and deregistered from the service registry using the management API. The query API is used by system components to discover available service instances.
  6. There are two main service‑discovery patterns: client-side discovery and service-side discovery. In systems that use client‑side service discovery, clients query the service registry, select an available instance, and make a request. In systems that use server‑side discovery, clients make requests via a router, which queries the service registry and forwards the request to an available instance.
  7. There are two main ways that service instances are registered with and deregistered from the service registry. One option is for service instances to register themselves with the service registry, the self‑registration pattern. The other option is for some other system component to handle the registration and deregistration on behalf of the service, the third‑party registration pattern.
  8. In some deployment environments you need to set up your own service‑discovery infrastructure using a service registry such as Netflix Eurekaetcd, or Apache Zookeeper. In other deployment environments, service discovery is built in. For example, Kubernetes and Marathon handle service instance registration and deregistration. They also run a proxy on each cluster host that plays the role of server‑side discovery router.
  9. In a microservices architecture, each microservice has its own private datastore. Different microservices might use different SQL and NoSQL databases. While this database architecture has significant benefits, it creates some distributed data management challenges. The first challenge is how to implement business transactions that maintain consistency across multiple services. The second challenge is how to implement queries that retrieve data from multiple services.
  10. For many applications, the solution is to use an event‑driven architecture. One challenge with implementing an event‑driven architecture is how to atomically update state and how to publish events. There are a few ways to accomplish this, including using the database as a message queue, transaction log mining, and event sourcing.
  11. Deploying a microservices application is challenging. There are tens or even hundreds of services written in a variety of languages and frameworks. Each one is a mini‑application with its own specific deployment, resource, scaling, and monitoring requirements. There are several microservice deployment patterns including Service Instance per Virtual Machine and Service Instance per Container. Another intriguing option for deploying microservices is AWS Lambda, a serverless approach.
  12. The process of migrating an existing application into microservices is a form of application modernization. You should not move to microservices by rewriting your application from scratch. Instead, you should incrementally refactor your application into a set of microservices. There are three strategies you can use: 
    1. implement new functionality as microservices; 
    2. split the presentation components from the business and data access components; and 
    3. convert existing modules in the monolith into services. 
Over time the number of microservices will grow, and the agility and velocity of your development team will increase.

Sunday, February 26, 2017

Slackbot in Java

Using https://github.com/Ullink/simple-slack-api

Following snippet is for a slackbot that looks for <@BOT_ID> prefix in the incoming messages across all channels (or direct messages) that it receives from slack.



Getting Started with Apache Kafka in Docker

To run Apache Kafka in docker container:
  • git clone https://github.com/wurstmeister/kafka-docker.git
  • Edit the docker-compose.yml to provide KAFKA_ADVERTISED_HOST_NAME and KAFKA_ZOOKEEPER_CONNECT host ips. This is the IP of the host on which we are going to run the docker containers for Kafka and Zookeeper.
KAFKA_ADVERTISED_HOST_NAME: 192.168.86.89
KAFKA_ZOOKEEPER_CONNECT: 192.168.86.89:2181
  • Start a cluster in detached mode:
docker-compose up -d

It will start 2 containers:
kafkadocker_kafka_1 - with kafka running at 9092 mapped to 9092 of localhost
kafkadocker_zookeeper_1 - with zookeeper running at 2181 mapped to 2181 of localhost

To start a cluster with 2 brokers:

docker-compose scale kafka=2

 You can use docker-compose ps to show the running instances. 
 If you want to add more Kafka brokers simply increase the value passed to 
 docker-compose scale kafka=n
 
 If you want to customise any Kafka parameters, simply add them as environment variables 
 in docker-compose.yml. For example:
    • to increase the message.max.bytes parameter add KAFKA_MESSAGE_MAX_BYTES: 2000000 to the environment section.
    • to turn off automatic topic creation set KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'false'
  • To interact with Kafka container use the kafka-shell as:
    $ start-kafka-shell.sh  [DOCKER_HOST_IP] [ZOOKEEPER_HOST:ZOOKEEPER_PORT]
Example: $ ./start-kafka-shell.sh 192.168.86.89 192.168.86.89:2181
  • Testing:
    • To test your setup, start a shell, create a topic and start a producer:
$ $KAFKA_HOME/bin/kafka-topics.sh --create --topic "topic" \ --partitions 4 --zookeeper $ZK --replication-factor 2

$ $KAFKA_HOME/bin/kafka-topics.sh --describe --topic "topic" --zookeeper $ZK 

$ $KAFKA_HOME/bin/kafka-console-producer.sh --topic="topic" \
--broker-list=`broker-list.sh
    • Start another shell and start consumer:
$ $KAFKA_HOME/bin/kafka-console-consumer.sh --topic=topic --zookeeper=$ZK
    • Type in producer shell and it should be published to the client's shell.

Friday, February 24, 2017

Running AWS CLI using docker image

For my project i needed to execute AWS commands without having to install awscli or using SDK but do it in a cloud provider agnostic manner. The project in itself requires a separate article but this piece of getting to run awscli built into a docker image (using alpine linux which is very light weight) and executed from Java code seems quite useful by itself outside of the context of my master's project (which by the way is a chatbot for cloud operations management named Amigo).

Docker hub registry entry for the awscli image:
https://hub.docker.com/r/sjsucohort6/docker_awscli/

Docker image with awscli installed. It uses alpine python 2.7 and hence the size of the image is 115MB approx.
This enables one to execute awscli commands without having to install awscli locally.
Usage:
$ docker pull sjsucohort6/docker_awscli
Example command - ecs list-clusters can be executed as:
$ docker run -it --rm -e AWS_DEFAULT_REGION='' -e AWS_ACCESS_KEY_ID='' -e AWS_SECRET_ACCESS_KEY='' --entrypoint aws sjsucohort6/docker_awscli:latest ecs list-clusters
The Java docker-client from Spotify is the one used for the below code that pulls the image from docker hub and executes the command.


Kubernetes Basics

From Building Microservice Systems with Docker and Kubernetes by Ben Straub

  • Kubernetes
    • Runs docker containers
    • Powerful Label matching system for control/grouping/routing traffic
    • Spans across hosts - converts a set of computers into one big one
    • One master (that sends control commands to minions to execute) - multiple minions (that run docker containers)
    • POD - set of docker containers (often just one) always on the same host. For each POD there is one IP address.
    • Replication controller - manages lifecycle of PODs which match the labels associated with the RC.
    • Services - load balance traffic to PODs based on matching label. For eg. Service (name = frontend) will route traffic to PODs with name = frontend. The PODs may be managed by different RCs.
  • Traffic routed by Service named frontend to both old and new version PODs. Once the rollover to new version is completed the traffic continues to get routed to the frontend named PODs which are version 124 and old version PODs and RC are deleted eventually without any downtime.
  • Every service gets a DNS entry same as its name. For example, ServiceA, ServiceB etc. POD looks up a service by its name and communicates with minions under that service via the service.
  • Service can have ingress port configured to receive inbound traffic. Say port 8000 on ServiceA is opened which will map to a port (say 37654) on every minion.
  • Setting up Kubernetes Cluster in AWS:

    • Identity and Access Management (IAM) -
      • Create user and generate creds/download
      • Attach policy
    • Get awscli and install it
    • Download kubernetes - https://github.com/kubernetes/kubernetes/releases/ and unpack it.
    • Open cluster/aws/config-default.sh, edit as needed to change the size of the kubernetes cluster
    • Run: KUBERNETES_PROVIDER=aws cluster/kube-up.sh
      • Created new VPC 172.20.0.*
      • 5 EC2 instances (t2.micro) = 1 master + 4 minions with public Ips
      • ASG for minions
      • SSH Keys for direct access
        • ~/.ssh/kube_aws_rsa
      • Kubectl is configured
        • ~/.kube/config
    • KUBERNETES_PROVIDER=aws cluster/kube-down.sh - to delete all aws resources

Using dnsmasq on Mac OSX

Excerpted from https://passingcuriosity.com/2013/dnsmasq-dev-osx/

dnsmasq makes it easy to redirect development sites to localhost. This is very useful when working on a dev box and running docker containers.

Instead of having to edit the /etc/hosts file everytime we can suffix .dev to any name of our choosing database.dev or frontend.dev and it will be mapped to the localhost address 127.0.0.1.

Following are the steps to get this working on Mac:
# update brew
➜  ~  brew up
➜  ~  brew install dnsmasq
➜  ~ cp $(brew list dnsmasq | grep /dnsmasq.conf.example$) /usr/local/etc/dnsmasq.conf
➜  ~ sudo cp $(brew list dnsmasq | grep /homebrew.mxcl.dnsmasq.plist$) /Library/LaunchDaemons/
Password:
➜  ~ sudo launchctl load /Library/LaunchDaemons/homebrew.mxcl.dnsmasq.plist
# edit dnsmasq.conf and add address=/dev/127.0.0.1 -- this will cause any address ending in dev to be redirected to localhost.

➜  ~ vi /usr/local/etc/dnsmasq.conf
# restart dnsmasq
➜  ~ sudo launchctl stop homebrew.mxcl.dnsmasq
➜  ~ sudo launchctl start homebrew.mxcl.dnsmasq

➜  ~ dig testing.testing.one.two.three.dev @127.0.0.1

# add a new nameresolver to Mac
➜  ~ sudo mkdir -p /etc/resolver
➜  ~ sudo tee /etc/resolver/dev >/dev/null <nameserver 127.0.0.1
EOF
# test
➜  ~ ping -c 1 this.is.a.test.dev
PING this.is.a.test.dev (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.044 ms

--- this.is.a.test.dev ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.044/0.044/0.044/0.000 ms
➜  ~ ping -c 1 iam.the.walrus.dev
PING iam.the.walrus.dev (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.041 ms

--- iam.the.walrus.dev ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.041/0.041/0.041/0.000 ms

# and that we did not break the nameresolver for non-dev sites
➜  ~ ping -c 1 www.google.com
PING www.google.com (216.58.194.196): 56 data bytes
64 bytes from 216.58.194.196: icmp_seq=0 ttl=54 time=54.026 ms

--- www.google.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 54.026/54.026/54.026/0.000 ms

Introduction to Docker

Notes from Introduction to Docker by Andrew Tork Baker

  1. Two main parts of docker - Docker engine and docker hub
  2. Essential Docker commands:
  • docker run [options] image [command] [args…]
    • docker run busybox /bin/echo "hello world"
    • docker run -it ubuntu /bin/bash - gives interactive shell
    • Docker run -p 8000:80 atbaker/nginx-example
    • Docker run -d -p 8000:80 atbaker/nginx-example - runs container in background (detached mode)
    • Docker run -d -p 8000:80 --name webserver atbaker/nginx-example - to name containers
  • Docker images
    • docker images -q - will list all image ids alone
  • Docker ps - shows active/running containers
    • Docker ps -a - shows all containers (even once we exited)
    • docker ps -a -q - shows all container ids.
  • Docker stop
  • Docker start
  • Docker rm
    • Docker rm -f   - to even remove running container
    • Docker rm -f $(docker ps -a -q) - will remove all containers from the system running or not
Where containerId = short form can be used first 4 unique chars from SHA id of container
  • Docker logs webserver - inspect the logs
    • Docker logs -f webserver -- follows the logs on the container
  • Docker attach webserver - attaches to the running container in detached mode but the disadvantage is if you attached and you exit the container shell then container will exit too. So if you need to check the logs alone then just use docker logs instead.
  • Docker port webserver 80 -- will show the mapping for the port 80 on the container to the port on the docker host
  • Docker diff webserver - what has changed on filesystem of container since we started it
  • Docker cp webserver:/usr/local/nginx/html/index.html . - will copy the index.html from container to local directory
  • Docker inspect webserver -- low level info on container (environment variables, hostname of container etc)
  • Docker history atbaker/nginx-example - will show when each layer in the image was applied to the image
  • Docker search postgres - will search docker hub registry for all images for postgres
  • Postgres:
    • Docker pull postgres:latest - will pull down each layer in the image to local host
    • Docker run -p 5432:5432 postgres
    • psql -U postgres -h localhost - will connect to postgres DB container at port 5432 (need to install psql for this)
  • Redis:
    • Docker pull atbaker/redis-example
    • Docker run -it atbaker/redis-example /bin/bash - to run interactive shell
    • Change the file /usr/src/custom-redis.conf - uncomment requirepass line, exit the shell
    • Docker commit -m "message"   - to save the change done to container as a new image
    • Docker commit -m "setting password" redis-passwd
    • Docker run -p 6379:6379 redis-passwd redis-server /usr/src/custom-redis.conf
    • Redis-cli -h localhost -p 6379
    • Auth foobared
    • Set foo bar
    • Get foo
  • Once image is created you can push it to the docker hub account:
    • Docker login - login to docker hub
    • Docker tag redis-passwd sjsucohort6/redis
    • Docker push sjsucohort6/redis -  by default it will tag remote image as latest
    • docker push sjsucohort6/redis:ver1 - will tag remote image as ver1
  • Mongodb:
    • git clone https://github.com/atbaker/mongo-example.git
    • Docker build -t mongodb . - assuming current working dir has Dockerfile
    • Docker run -P mongodb - the -P option maps any port on localhost to exposed port (27017) of mongodb container
    • Docker ps - find what port on localhost is mapped to mongodb port
    • mongo localhost:32768 -- connect to the mongodb assuming here that local port 32768 was mapped to 27017 of mongodb


TODO - cover dockerfiles, docker compose in a later post.

Monday, February 20, 2017

Apache Kafka - An Overview

Excerpted from: https://thenewstack.io/apache-kafka-primer/

  1. Message Oriented Middleware (MOM) such as Apache Qpid, RabbitMQ, Microsoft Message Queue, and IBM MQ Series were used for exchanging messages across various components. While these products are good at implementing the publisher/subscriber pattern (Pub/Sub), they are not specifically designed for dealing with large streams of data originating from thousands of publishers. Most of the MOM software have a broker that exposes Advanced Message Queuing Protocol (AMQP) protocol for asynchronous communication.
  2. Kafka is designed from the ground up to deal with millions of firehose-style events generated in rapid succession. It guarantees low latency, “at-least-once”, delivery of messages to consumers. Kafka also supports retention of data for offline consumers, which means that the data can be processed either in real-time or in offline mode.
  3. Kafka is designed to be a distributed commit log. Much like relational databases, it can provide a durable record of all transactions that can be played back to recover the state of a system. 
  4. Kafka provides redundancy, which ensures high availability of data even when one of the servers faces disruption.
  5. Multiple event sources can concurrently send data to a Kafka cluster, which will reliably gets delivered to multiple destinations.
  6. Key concepts:
  • Message - Each message is a key/value pair. Irrespective of the data type, Kafka always converts messages into byte arrays.
  • Producers - or publisher clients that produce data
  • Consumers - are subscribers or readers that read the data. Unlike subscribers in MOM, Kafka consumers are stateful, which means they are responsible for remembering the cursor position, which is called as an offset. The consumer is also a client of Kafka cluster. Each consumer may belong to a consumer group, which will be introduced in the later sections.
    The fundamental difference between MOM and Kafka is that the clients will never receive messages automatically. They have to explicitly ask for a message when they are ready to handle.
  • Topics - logical collection of messages. Data sent by producers are stored in topics. Consumers subscribe to a specific topic that they are interested in.
  • Partition - Each topic is split into one or more partitions. They are like shards and Kafka may use the message key to automatically group similar messages into a partition. This scheme enables Kafka to dynamically scale the messaging infrastructure. Partitions are redundantly distributed across the Kafka cluster. Messages are written to one partition but copied to at least two more partitions maintained on different brokers with the cluster.
  • Consumer groups - consumers belong to at least one consumer group, which is typically associated with a topic. Each consumer within the group is mapped to one or more partitions of the topic. Kafka will guarantee that a message is only read by a single consumer in the group. Each consumer will read from a partition while tracking the offset. If a consumer that belongs to a specific consumer group goes offline, Kafka can assign the partition to an existing consumer. Similarly, when a new consumer joins the group, it balances the association of partitions with the available consumers.
    It is possible for multiple consumer groups to subscribe to the same topic. For example, in the IoT use case, a consumer group might receive messages for real-time processing through an Apache Storm cluster. A different consumer group may also receive messages from the same topic for storing them in HBase for batch processing.
    The concept of partitions and consumer groups allows horizontal scalability of the system.
  • Broker - Each Kafka instance belonging to a cluster is called a broker. Its primary responsibility is to receive messages from producers, assigning offsets, and finally committing the messages to the disk. Based on the underlying hardware, each broker can easily handle thousands of partitions and millions of messages per second.
    The partitions in a topic may be distributed across multiple brokers. This redundancy ensures the high availability of messages.
  • Cluster - A collection of Kafka broker forms the cluster. One of the brokers in the cluster is designated as a controller, which is responsible for handling the administrative operations as well as assigning the partitions to other brokers. The controller also keeps track of broker failures.
  • Zookeeper - Kafka uses Apache ZooKeeper as the distributed configuration store. It forms the backbone of Kafka cluster that continuously monitors the health of the brokers. When new brokers get added to the cluster, ZooKeeper will start utilizing it by creating topics and partitions on it.
Kafka in docker - https://github.com/spotify/docker-kafka
http://docs.confluent.io/3.0.0/quickstart.html#quickstart 

Book Notes: I love logs by Jay Kreps


This book is by Jay Kreps who is a Software architect at LinkedIn and primary author of Apache Kafka and Apache Samza open source software.

Following are some of the salient points from the book:
  1. The use of logs in much of the rest of this book will be variations on the two uses in database internals: 
    1. The log is used as a publish/subscribe mechanism to transmit data to other replicas 
    2. The log is used as a consistency mechanism to order the updates that are applied to multiple replicas
  2. So this is actually a very intuitive notion: if you feed two deterministic pieces of code the same input log, they will produce the same output in the same order.
  3. You can describe the state of each replica by a single number: the timestamp for the maximum log entry that it has processed. Two replicas at the same time will be in the same state. Thus, this timestamp combined with the log uniquely capture the entire state of the replica. This gives a discrete, event-driven notion of time that, unlike the machine’s local clocks, is easily comparable between different machines.
  4. In each case, the usefulness of the log comes from the simple function that the log provides: producing a persistent, replayable record of history.
  5. Surprisingly, at the core of the previously mentioned log uses is the ability to have many machines play back history at their own rates in a deterministic manner.
  6. It’s worth noting the obvious: without a reliable and complete data flow, a Hadoop cluster is little more than a very expensive and difficult-to-assemble space heater.
  7. Take all of the organization’s data and put it into a central log for real-time subscription.
  8. Each logical data source can be modeled as its own log. A data source could be an application that logs events (such as clicks or page views), or a database table that logs modifications. Each subscribing system reads from this log as quickly as it can, applies each new record to its own store, and advances its position in the log. Subscribers could be any kind of data system: a cache, Hadoop, another database in another site, a search system, and so on.
  9. The log also acts as a buffer that makes data production asynchronous from data consumption.
  10. Of particular importance: the destination system only knows about the log and does not know any details of the system of origin.
  11. use the term “log” here instead of “messaging system” or “pub sub” because it is much more specific about semantics and a much closer description of what you need in a practical implementation to support data replication.
  12. You can think of the log as acting as a kind of messaging system with durability guarantees and strong ordering semantics.
  13. we needed to isolate each consumer from the source of the data. The consumer should ideally integrate with just a single data repository that would give her access to everything.
  14. Amazon has offered a service that is very similar to Kafka called Kinesis - it is the piping that connects all their distributed systems — DynamoDB, RedShift, S3 — as well as the basis for distributed stream processing using EC2.
  15. Google has followed with a data stream and processing framework, and Microsoft has started to move in the same direction with their Azure Service Bus offering.
  16. ETL is really two things. First, it is an extraction and data cleanup process, essentially liberating data locked up in a variety of systems in the organization and removing any system-specific nonsense. Secondly, that data is restructured for data warehousing queries (that is, made to fit the type system of a relational database, forced into a star or snowflake schema, perhaps broken up into a high performance column format, and so on). Conflating these two roles is a problem. The clean, integrated repository of data should also be available in real time for low-latency processing, and for indexing in other real-time storage systems.
  17. A better approach is to have a central pipeline, the log, with a well-defined API for adding data. The responsibility of integrating with this pipeline and providing a clean, well-structured data feed lies with the producer of this data feed. This means that as part of their system design and implementation, they must consider the problem of getting data out and into a well-structured form for delivery to the central pipeline.
  18. benefit of this architecture: it enables decoupled, event-driven systems.
  19. In order to allow horizontal scaling, we chop up our log into partitions
    1. Each partition is a totally ordered log, but there is no global ordering between partitions. 
    2. The writer controls the assignment of the messages to a particular partition, with most users choosing to partition by some kind of key (such as a user ID). Partitioning allows log appends to occur without coordination between shards, and allows the throughput of the system to scale linearly with the Kafka cluster size while still maintaining ordering within the sharding key.
    3. Each partition is replicated across a configurable number of replicas, each of which has an identical copy of the partition’s log. At any time, a single partition will act as the leader; if the leader fails, one of the replicas will take over as leader.
    4. each partition is order preserving, and Kafka guarantees that appends to a particular partition from a single sender will be delivered in the order they are sent.
  20. Kafka uses a simple binary format that is maintained between in-memory log, on-disk log, and in-network data transfers.
  21. It turns out that “log” is another word for “stream” and logs are at the heart of stream processing - see stream processing as something much broader: infrastructure for continuous data processing.
  22. Data collected in batch is naturally processed in batch. When data is collected continuously, it is naturally processed continuously.
  23. many data transfer processes still depend on taking periodic dumps and bulk transfer and integration. The only natural way to process a bulk dump is with a batch process. As these processes are replaced with continuous feeds, we naturally start to move towards continuous processing to smooth out the processing resources needed and reduce latency.
  24. This means that a stream processing system produces output at a user-controlled frequency instead of waiting for the “end” of the data set to be reached.
  25. The final use of the log is arguably the most important, and that is to provide buffering and isolation to the individual processes.
  26. The log acts as a very, very large buffer that allows the process to be restarted or fail without slowing down other parts of the processing graph. This means that a consumer can come down entirely for long periods of time without impacting any of the upstream graph; as long as it is able to catch up when it restarts, everything else is unaffected.
  27. An interesting application of this kind of log-oriented data modeling is the Lambda Architecture. This is an idea introduced by Nathan Marz, who wrote a widely read blog post describing an approach to combining stream processing with offline processing.
  28. For event data, Kafka supports retaining a window of data. The window can be defined in terms of either time (days) or space (GBs), and most people just stick with the one week default retention.
  29. Instead of simply throwing away the old log entirely, we garbage-collect obsolete records from the tail of the log. Any record in the tail of the log that has a more recent update is eligible for this kind of cleanup. By doing this, we still guarantee that the log contains a complete backup of the source system, but now we can no longer recreate all previous states of the source system, only the more recent ones. We call this feature log compaction.
  30. data infrastructure could be unbundled into a collection of services and application-facing system API.
    1. Zookeeper handles much of the system coordination (perhaps with a bit of help from higher-level abstractions like Helix or Curator).
    2. Mesos and YARN process virtualization and resource management.
    3. Embedded libraries like Lucene, RocksDB, and LMDB do indexing.
    4. Netty, Jetty, and higher-level wrappers like Finagle and rest.li handle remote communication.
    5. Avro, Protocol Buffers, Thrift, and umpteen zillion other libraries handle serialization. 
    6. Kafka and BookKeeper provide a backing log.
  31. If you stack these things in a pile and squint a bit, it starts to look like a LEGO version of distributed data system engineering. You can piece these ingredients together to create a vast array of possible systems.
  32. Here are some things a log can do: 
    1. Handle data consistency (whether eventual or immediate) by sequencing concurrent updates to nodes 
    2. Provide data replication between nodes 
    3. Provide “commit” semantics to the writer (such as acknowledging only when your write is guaranteed not to be lost) 
    4. Provide the external data subscription feed from the system 
    5. Provide the capability to restore failed replicas that lost their data or bootstrap new replicas 
    6. Handle rebalancing of data between nodes

Book notes: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems, by Martin Kleppmann

My notes from the excellent book on how software has evolved to handle data from hierarchical databases to the NoSQL -  https://www.goodrea...