I'm very surprised by this: > [with the default enable.auto.commit=true] Kafka c...

aphyr · on Nov 12, 2024

It is a little surprising, and I agree, the docs here are not doing a particularly good job of explaining it. It might help to ask: if you don't explicitly commit, how does Kafka know when you've processed the messages it gave you? It doesn't! It assumes any message it hands you is instantaneously processed.

Auto-commit is a bit like handing someone an ice cream cone, then immediately walking away and assuming they ate it. Sometimes people drop their ice cream immediately after you hand it to them, and never get a bite.

justinsaccount · on Nov 12, 2024

Weird, I would have guessed that it auto commits the previous batch when it polls for the next batch, meaning it would be like

  loop:
    messages = poll() # poll returns new messages and commits previous batch
    process(messages)

but it sounds like it "poll returns new messages and immediately commits them."

williamdclt · on Nov 12, 2024

Information on the internet about this seems unreliable, confusing and contradictory... It's crazy for something so critical, especially when it's enabled by default.

dangoodmanUT · on Nov 12, 2024

This, it has no idea that you processed the message. It assumes processing is successful by default which is cosmically stupid.

frant-hartm · on Nov 13, 2024

> how does Kafka know when you've processed the messages it gave you?

By calling `poll()` again. It doesn't commit the records returned from poll until auto commit interval expires AND you call poll again.

At least this is what the javadoc says quite clearly: https://kafka.apache.org/39/javadoc/org/apache/kafka/clients...

Note: Using automatic offset commits can also give you "at-least-once" delivery, but the requirement is that you must consume all data returned from each call to poll(Duration) before any subsequent calls, or before closing the consumer.

E.g. the following commits every 10s - on each call to `poll`, it doesn't automagically commit every 5 s.

        Properties props = new Properties();
        props.setProperty("bootstrap.servers", "localhost:9092");
        props.setProperty("group.id", "test");
        props.setProperty("enable.auto.commit", "true");
        props.setProperty("auto.commit.interval.ms", "5000");
        props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Arrays.asList("my-topic"));
        while (true) {
            ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
            for (ConsumerRecord<String, String> record : records) {
                System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
            }
            Thread.sleep(10_000);
        }

frant-hartm · on Nov 13, 2024

Just a note: I am not claiming it is working correctly, only saying there is a clear and documented way how the client knows when to commit, and that it works as expected in a simple scenario.

williamdclt · on Nov 12, 2024

> if you don't explicitly commit, how does Kafka know when you've processed the messages it gave you?

I did expect that auto-commit still involved an explicit commit. I expected that it meant that the consumer side would commit _after_ processing a message/batch _if_ it had been >= autocommit_interval since the last commit. In other words, that it was a functionality baked into the Kafka client library (which does know when a message has been processed by the application). I don't know if it really makes sense, I never really thought hard about it before!

I'm still a bit skeptical... I'm pretty sure (although not positive) that I've seen consumers with autocommit being stuck because of timeouts that were much greater than the autocommit interval, and yet retrying the same message in a loop

aphyr · on Nov 12, 2024

Here's a good article from New Relic on the problem, if you'd like more detail: https://newrelic.com/blog/best-practices/kafka-consumer-conf...

Or here, you can reproduce it yourself using the Bufstream or Redpanda/Kafka test suite. Here's a real quick run I just dashed off. You can watch it skip over writes: https://gist.github.com/aphyr/1af2c4eef9aacde7f08f1582304908...

lein run test --enable-auto-commit --bin bufstream-0.1.3-rc.12 --time-limit 30 --txn --final-time-limit 1/10000

jakewins · on Nov 12, 2024

Auto commit has always seemed super shady. Manual commit I have assumed is safe though - something something vector clocks - and it’d be really interesting to know if that trust is misplaced.

What is the process and cost for having you do a Jepsen test for something like that?

aphyr · on Nov 12, 2024

You'll find lots about the Jepsen analysis process here: https://jepsen.io/services/analysis

aphyr · on Nov 13, 2024

I (and apparently the Confluent docs?) may be wrong about this. I've added an update to the report.

kevstev · on Nov 12, 2024

It is a bit of splitting hairs in some sense, but the key concept here is just because the message was delivered to the Kafka client successfully, does not mean it was processed by the application.

You will have to explicitly ack if you want that guarantee. For a concrete example, lets say all you do with a message is write it to a database. As soon as that message is in your client handler callback, that message is ack'ed. But you probably only want that ack to happen after a successful insert into the DB. The most likely scenario here to cause unprocessed messages is that the DB is down for whatever reason (maybe a network link is down, or k8s or even a firewall config now prevents you from accessing), and at some point during this your client goes down, maybe by an eng attempting a restart to see if the problem goes away.

th0ma5 · on Nov 12, 2024

It is my understanding that the reason why this is is high performance situations. You have some other system that can figure out if something fail, but with this feature you can move the high water mark so that you don't have to redo as much. But if you got the timing right and there is a failure you can go ahead and assume that when you restart again you'll be getting some stuff that you already processed. The problem is when you don't have this for mailing before the auto commit. It is meant to be done far after processing in my reading of it, but it does certainly seem like there's a contradiction that it should auto commit but only stuff so many milliseconds before the auto commit time?

necubi · on Nov 12, 2024

I can maybe give some justification for why this feature exists. It's designed for synchronous, single-threaded consumers which do something like this:

  loop {
    1. Call poll
    2. Durably process the messages
  }

I think a point of confusion here is that the auto-commit check happens on the next call to poll—not asynchronously after the timeout. So you should only be able to drop writes if you are storing the messages without durably processing them (which includes any kind of async/defer/queues/etc.) before calling poll again.

(I should say—this is the documented behavior for the Java client library[0]—it's possible that it's not actually the behavior that's implemented today.)

The Kafka protocol is torn between being high-level and low-level, and as a result it does neither particularly well. Auto commit is a high-level feature that aims to make it easier to build simple applications without needing to really understand all of the moving pieces, but obviously can fail if you don't use it as expected.

I'd argue that today end users shouldn't be using the Kafka client directly—use a proper high level implementation that will get the details right for you (for data use cases this is probably a stream processing engine, for application use cases it's something like a duration execution engine).

[0] https://kafka.apache.org/32/javadoc/org/apache/kafka/clients... —