1. In MPK, each task is mapped to an individual SM. The amount of work handled b...

1. In MPK, each task is mapped to an individual SM. The amount of work handled by a task is similar to that of a thread block in the traditional kernel-per-operator approach.

2. TL;DR: MPK automatically analyzes inter-task dependencies by tracking the input and output tensors associated with each task. A longer version: Longer version: MPK uses imap, omap, and fmap (see Section 2 of the Mirage paper) to determine each task’s input and output tensors. A dependency is introduced between task A and task B if A produces any tensor elements that B consumes—that is, if A's outputs overlap with B's inputs.

> Again taking matmul as an example: a given output output tile requires the correspond M_BLOCK rows of the A matrix. If the A matrix was itself an output of a prior matmul (+ nonlinearity), the dependees would be all of output tile tasks corresponding to those M_BLOCK rows of the operator that produced A?

Exactly. In this case, all output tile tasks that consume those M_BLOCK rows of A will depend on all tasks responsible for producing the corresponding parts of A in the previous operator.