Kafka Streams API

A Step Beyond Hello World

Ivan Ponomarev, Synthesized/MIPT

me
  • Staff Engineer at Synthesized

  • ERP systems & Java background

  • Speaker at JPoint, Devoops, Heisenbug, JUG. MSK, PermDevDay, DevopsForum, Stachka etc.

  • Heisenbug Program Committee member.

  • Current project: Real-time Webscraping

Everything I show is on GitHub

octocat

Our plan

kafka

Lecture 1.

  1. Kafka (brief reminder) and Data Streaming

  2. Application configuration. Stateless transformations

  3. Transformations with local state

Lecture 2.

  1. Stream-table dualism and Table joins

  2. Time and window operations

kafka

Kafka is

kafka logo

In Kafka you can

okay
  • Write something to a named log (topic)

  • Read entries from the topic in FIFO order (within a partition)

  • Commit the offset of the processed entries

You can’t do it in Kafka

noway
  • Erase a record

  • Edit a record

  • Extract a random record from the log except by offset

Topics, partitions and messages

topics partitions

Topics, partitions and messages

topics partitions1

Topics, partitions and messages

topics partitions2

Anatomy of a message

message anatomy

Anatomy of a message

message anatomy2
// hash the keyBytes to choose a partition
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;

Reading from Kafka

ConsumerG0

Reading from Kafka

ConsumerG

Reading from Kafka

ConsumerG2

Reading from Kafka

ConsumerG3

Offset Commit

offcommit1

Offset Commit

offcommit2

Offset Commit

offcommit3

Offset Commit

offcommit4

Offset Commit

offcommit5

Offset Commit

offcommit6

Offset Commit

offcommit7

Compacted topics

log compaction
Source: Kafka Documentation

How Retention works

tapeloop

Stream data processing: architecture

streaming arch1

Where are streaming systems needed?

  • Monitoring! Logs!

  • Track user activity

  • Anomaly detection (including fraud detection)

okay
streams ok
noway
streams noway

Existing stream processing frameworks

spark logo
samza logo
storm logo
flink logo
kafka logo

Our plan

kafka

Lecture 1.

  1. Kafka (brief reminder) and Data Streaming

  2. Application configuration. Stateless transformations

  3. Transformations with local state

Lecture 2.

  1. Stream-table dualism and Table joins

  2. Time and window operations

kafka

Kafka Streams API: the general structure of a KStreams application

StreamsConfig config = ...;
// Here we set all sorts of options

Topology topology = new StreamsBuilder()
//Here we build the topology
....build();

Kafka Streams API: the general structure of a KStreams application

Топология — конвейер обработчиков:

topology sample

Convert a stream to a stream

blockStream

map

squashedStream

List<Block> blocks = ...;

Stream<Block> blocksStream = blocks.stream();

Stream<SquashedBlock> squashedStream =
  blocksStream.map(Block::squash);

(The author of the animations is Tagir Valeev, see moving pictures here)

Filtering

squashedStream

filter

filteredStream

Stream<SquashedBlock> filteredStream =
  squashedStream.filter(block >
         block.getColor() != YELLOW);

Display to the console (terminal operation)

filteredStream

display
filteredStream
  .forEach(System.out::println);

All together in one line

fuse
blocks.stream()
      .map(Block::squash)
      .filter(block >
         block.getColor() != YELLOW)
      .forEach(System.out::println);

It reminds us of something!

"Concatenate two files, convert their lines to lowercase, sort, display the last three lines in alphabetical order"

cat file1 file2 | tr "[A-Z]" "[a-z]" | sort | tail -3

Kafka Streams API: the general structure of a KStreams application

StreamsConfig config = ...;
// Here we set all sorts of options

Topology topology = new StreamsBuilder()
//Here we build the topology
....build();


//Spring-KAFKA does it for us
KafkaStreams streams = new KafkaStreams(topology, config);
streams.start();
...
streams.close();

In Spring, it is enough to define two things

  • @Bean KafkaStreamsConfiguration

  • @Bean Topology

Our story

betting
  • There are football matches (the score changes)

  • Bets are placed: H, D, A.

  • Bet stream, key: 'Cyprus-Belgium:A'

  • Bet stream, value:

class Bet {
  String bettor;   John Doe
  String match;    Cyprus-Belgium
  Outcome outcome; A (or H or D)
  long amount;     100
  double odds;     1.7
  long timestamp;  1554215083998
}

@Bean KafkaConfiguration

IMPORTANTLY!
@Bean(name =
    KafkaStreamsDefaultConfiguration
                . DEFAULT_STREAMS_CONFIG_BEAN_NAME)
public KafkaStreamsConfiguration getStreamsConfig() {
    Map<String, Object> props = new HashMap<>();
    // IMPORTANT!
    props.put(StreamsConfig.APPLICATION_ID_CONFIG,
        "stateless-demo-app");
    // IMPORTANT!
    props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
    props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
    ...
    KafkaStreamsConfiguration streamsConfig =
            new KafkaStreamsConfiguration(props);
    return streamsConfig;
}

@Bean NewTopic

@Bean
NewTopic getFilteredTopic() {
    Map<String, String> props = new HashMap<>();
    props.put(
      TopicConfig.CLEANUP_POLICY_CONFIG,
      TopicConfig.CLEANUP_POLICY_COMPACT);
    return new NewTopic("mytopic", 10, (short) 1).configs(props);
}

@Bean Topology

yelling topology
@Bean
public Topology createTopology(StreamsBuilder streamsBuilder) {
    KStream<String, Bet> input = streamsBuilder.stream(...);
    KStream<String, Long> gain
            = input.mapValues(v -> Math.round(v.getAmount() * v.getOdds()));
    gain.to(GAIN_TOPIC, Produced.with(Serdes.String(),
                new JsonSerde<>(Long.class)));
    return streamsBuilder.build();
}

Three lines of code, and what’s the big deal?

  • More messages per second? — more instances with the same 'application.id'!

w1

Adding nodes

w2

Limited only by the number of partitions

w4

TopologyTestDriver: creating

KafkaStreamsConfiguration config = new KafkaConfiguration()
                                        .getStreamsConfig();
StreamsBuilder sb = new StreamsBuilder();
Topology topology = new TopologyConfiguration().createTopology(sb);
TopologyTestDriver topologyTestDriver =
        new TopologyTestDriver(topology,
                               config.asProperties());

TestInput/OutputTopic: creating

TestInputTopic<String, Bet> inputTopic =
        topologyTestDriver.createInputTopic(BET_TOPIC,
            Serdes.String().serializer(),
            new JsonSerde<>(Bet.class).serializer());
TestOutputTopic<String, Long> outputTopic =
        topologyTestDriver.createOutputTopic(GAIN_TOPIC,
            Serdes.String().deserializer(),
            new JsonSerde<>(Long.class).deserializer());

TopologyTestDriver: Usage

Bet bet = Bet.builder()
            .bettor("John Doe")
            .match("Germany-Belgium")
            .outcome(Outcome.H)
            .amount(100)
            .odds(1.7).build();

inputTopic.pipeInput(bet.key(), bet);

TopologyTestDriver: Usage

TestRecord<String, Long> record = outputTopic.readRecord();

assertEquals(bet.key(), record.key());
assertEquals(170L, record.value().longValue());

If something goes wrong…​

  • default.deserialization.exception.handler — could not deserialize

  • default.production.exception.handler — the broker rejected the message (for example, it is too large)

failure

If everything fell apart completely

streams.setUncaughtExceptionHandler(
  (Thread thread, Throwable throwable) -> {
    . . .
   });
uncaughtexception

In Spring, things are more complicated (see code)

KafkaStreams application states

kstreamsstates

What else do I need to know about stateless transformations?

Easy branching of streams

Java streams can’t do that:

KStream<..> foo = ...
KStream<..> bar = foo.mapValues(...). map... to...
Kstream<..> baz = foo.filter(...). map... forEach...
simplebranch

Branching streams by condition

From Version 2.8:

gain.split()
    .branch((key, value) -> key.contains("A"),
        Branched.withConsumer(ks -> ks.to("A")))
    .branch((key, value) -> key.contains("B"),
        Branched.withConsumer(ks -> ks.to("B")));
switchbranch

Simple merge

KStream<String, Integer> foo = ...
KStream<String, Integer> bar = ...
KStream<String, Integer> merge = foo.merge(bar);
merge

Our plan

kafka

Lecture 1.

  1. Kafka (brief reminder) and Data Streaming

  2. Application configuration. Stateless transformations

  3. Transformations with local state

Lecture 2.

  1. Stream-table dualism and Table joins

  2. Time and window operations

kafka

Local state

Facebook’s RocksDB — what is it and what is it for?

rocksdb
  • Embedded key/value storage

  • LSM Tree (Log-Structured Merge-Tree)

  • High-performant (data locality)

  • Persistent, optimized for SSD

RocksDB is similar to 'TreeMap<K,V>'

  • Save K,V in binary format

  • Lexicographic sorting

  • Iterator (snapshot view)

  • Delete Range (deleteRange)

Writing "Bet Totalling App"

What is the payout amount for bets made if the outcome is happened?

counting topology

@Bean Topology

KStream<String, Bet> input = streamsBuilder.
    stream(BET_TOPIC, Consumed.with(Serdes.String(),
                      new JsonSerde<>(Bet.class)));

KStream<String, Long> counted =
    new TotallingTransformer()
        .transformStream(streamsBuilder, input);

Bets totalling

@Override
public KeyValue<String, Long> transform(String key, Bet value,
                    KeyValueStore<String, Long> stateStore) {
    long current = Optional
        .ofNullable(stateStore.get(key))
        .orElse(0L);
    current += value.getAmount();
    stateStore.put(key, current);
    return KeyValue.pair(key, current);
}

StateStore is available in tests

@Test
void testTopology() {
    topologyTestDriver.pipeInput(...);
    topologyTestDriver.pipeInput(...);

    KeyValueStore<String, Long> store =
        topologyTestDriver
        .getKeyValueStore(TotallingTransformer.STORE_NAME);

    assertEquals(..., store.get(...));
    assertEquals(..., store.get(...));
}

Demo: Rebalancing / Replication

  • Rebalance/replication of partitions when starting/stopping workers.

Replication of local state into a topic

$kafka-topics --zookeeper localhost --describe

Topic:bet-totalling-demo-app-totalling-store-changelog
PartitionCount:10
ReplicationFactor:1
Configs:cleanup.policy=compact
counting topology changelog

Partitioning and local state

local partitioning oneworker

Partitioning and local state

local partitioning 1

Partitioning and local state

local partitioning 2

Partitioning and local state

local partitioning 25

Partitioning and local state

local partitioning 3

Partitioning and local state

local partitioning 4

Partitioning and local state

local partitioning 5

Partitioning and local state

local partitioning 6

Repartition

through
  • Explicit using
    repartition(Repartitioned<K, V> repartitioned)

  • Implicit in key change operations + stateful operations

Duplicate implicit repartitioning

KStream source = builder.stream("topic1");
KStream mapped = source.map(...);
KTable counts = mapped.groupByKey().aggregate(...);
KStream sink = mapped.leftJoin(counts, ...);
doublethrough

Getting rid of duplicate repartitioning

KStream source = builder.stream("topic1");
KStream shuffled = source.map(...). repartition(...);
KTable counts = shuffled.groupByKey().aggregate(...);
KStream sink = shuffled.leftJoin(counts, ...);
implicitthrough

It is better not to touch the key when not needed

Key only: selectKey

Key and Value

Value Only

map

mapValues

flatMap

flatMapValues

transform

transformValues

flatTransform

flatTransformValues