<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://elijahoyekunle.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://elijahoyekunle.com/" rel="alternate" type="text/html" /><updated>2025-09-05T04:24:34+00:00</updated><id>https://elijahoyekunle.com/feed.xml</id><title type="html">Elijah Oyekunle</title><subtitle>Software Engineer</subtitle><entry><title type="html">Notes on “Multi-Resource Packing for Cluster Schedulers” (Tetris)</title><link href="https://elijahoyekunle.com/blog/2022/04/14/Paper-Summary-Tetris.html" rel="alternate" type="text/html" title="Notes on “Multi-Resource Packing for Cluster Schedulers” (Tetris)" /><published>2022-04-14T00:00:00+00:00</published><updated>2022-04-14T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/04/14/Paper-Summary-Tetris</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/04/14/Paper-Summary-Tetris.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf">https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf</a></p>

<p><strong>Problem</strong></p>

<p>Modern data analysis makes use of large clusters which execute <em>jobs</em> consisting of several <em>tasks</em>. There can be different kinds of jobs running in a cluster at a time, each of them having varying resource requirements along CPU, memory, disk, and network. Tasks can be constrained to just one or multiple resource types at a time.</p>

<p>When scheduling tasks in a cluster, a scheduler must consider the task’s requirements on all the resources in order to maximize the task throughput and speed up job completion. If a task is given a lot of one resource but less of another, then it takes longer to finish and this is inadequate.</p>

<p>The authors show that existing schedulers have a limited ability to pack tasks because they typically consider just one of the CPU and memory resources, ignoring network and disk requirements. They also typically allocate these resources based only on fairness. They show that this can lead to resource <em>over-allocation</em> or <em>fragmentation</em> which delays job completion and increase makespan.</p>

<p><strong>Approach</strong></p>

<p>The problem of efficiently allocating multiple resources to tasks is similar to the multi-dimensional bin packing problem, but with some important differences such as the need to accommodate varying task requirements, task elasticity (tasks can function with less than peak demand), online arrival of jobs in the cluster, dependencies between tasks, and other cluster activity such as evacuation and ingestion of new data.</p>

<p>The authors present Tetris, a cluster scheduler that aims to solve this problem, and which packs tasks to machines based on their requirements along multiple resources. Tetris adaptively learns task requirements and monitors available cluster resources and uses a packing heuristic to select a task to machine allocation that improves makespan. Tetris also uses a multi-resource version of the <em>smallest remaining time</em> (SRTF) to reduce the average job completion time.</p>

<p>The authors show that prior work on <em>pareto-efficient</em> and <em>work conserving</em> fair allocations do not necessarily yield the best completion time and makespan. They demonstrate that with a little unfairness, much better performance can be achieved in resource schedulers and expose this trade-off with a knob.</p>

<p>Tetris is incorporated into the Yarn framework in Hadoop. It estimates task demands from previous executions of the job and previously completed tasks in the job, and also uses resource trackers to monitor available resources at each node. Performance evaluations show that Tetris improves makespan and job completion by 30% in deployment and up to 40% in simulations over Facebook traces.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>The paper is well-written, with detailed discussions of the background and problem to be solved which aids understanding of the paper.</p>

<p>Tetris runs a resource tracker process on each node which then reports to a central resource manager that handles scheduling. This leads to a central point of failure and possible performance bottleneck for the entire system. Evaluating how this part of the system scales with increasing cluster size is a possible follow-up idea. Also, while this estimation can work well for most tasks, it may not work so well for tasks with varying resource demands and usage over the course of their execution.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://www.cs.cmu.edu/~xia/resources/Documents/grandl_sigcomm14.pdf]]></summary></entry><entry><title type="html">Notes on “FairCloud: Sharing the Network in Cloud Computing”</title><link href="https://elijahoyekunle.com/blog/2022/04/13/Paper-Summary-FairCloud.html" rel="alternate" type="text/html" title="Notes on “FairCloud: Sharing the Network in Cloud Computing”" /><published>2022-04-13T00:00:00+00:00</published><updated>2022-04-13T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/04/13/Paper-Summary-FairCloud</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/04/13/Paper-Summary-FairCloud.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://www.mosharaf.com/wp-content/uploads/faircloud-sigcomm12.pdf">https://www.mosharaf.com/wp-content/uploads/faircloud-sigcomm12.pdf</a></p>

<p><strong>Problem</strong></p>

<p>In cloud environments, CPU and memory are two important shared resources provided to tenants, and the amount shared is proportional to how much a tenant pays. Cloud providers also provide guarantees on these resources (performance etc). The network is also an important shared resource, but cloud providers rarely offered guarantees on the network bandwidth, and it is not scaled to payment.</p>

<p>Networks are difficult to share because a VM’s network allocation depends on the other VMs running on the same machine, as well as other VMs that communicate over each link used by the VM. In contrast, CPUs and memory do not have this complication.</p>

<p><strong>Approach</strong></p>

<p>The authors of this paper present a study of the problem of network sharing in cloud environments. They present three key requirements that network sharing should meet:</p>

<ul>
  <li><strong>min-guarantee</strong>: Tenants should have guarantees on the minimum network bandwidth.</li>
  <li><strong>high utilization</strong>: When network demands are low, active applications should be able to scale up bandwidth consumption, thus improving network utilization and boosting the applications’ performance.</li>
  <li><strong>payment proportionality</strong>: Network resources should be divided among tenants in proportion to their payments.</li>
</ul>

<p>Using several examples, the authors show that there is a fundamental tradeoff between min-guarantee and network proportionality, and another tradeoff between network proportionality and high utilization. In order to properly navigate these tradeoffs, they define five useful properties: work conservation, strategy-proofness, utilization incentives, communication-pattern independence, and symmetry.</p>

<p>Building upon these, the authors propose three allocation policies that have different tradeoffs:</p>

<ul>
  <li><strong>Proportional Sharing at Link-level (PS-L)</strong>: Achieves link proportionality and can satisfy the five network properties listed above except strategy-proofness.</li>
  <li><strong>Proportional Sharing at Network-level (PS-N)</strong>: Provides better proportionality at the network level, but it does not fully provide utilization incentives.</li>
  <li><strong>Proportional Sharing on Proximate Links (PS-P)</strong>: Provides minimum bandwidth guarantees in tree-based topologies.</li>
</ul>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>The paper convincingly argues that there cannot be a one-size-fits-all approach to network sharing while trying to balance between the three main requirements - min-guarantee, proportionality, and high utilization. It is well-written, with a detailed explanation of the problem which aids understanding of the paper.</p>

<p>Some examples used to explain the problem seem to assume tenants’ knowledge of the cluster network topology and try to prevent behavior intended to exploit this knowledge (strategy-proofness). For most public cloud deployments, tenants don’t have much control over where their VMs are deployed in the cloud and where they are in relation to other tenants’ VMs. For data centers operated by single organizations, strategy-proofness may not be a priority since all “tenants” (which may be different teams) are part of the same organization.</p>

<p>Overall, I think customer demand will be important to determine whether or not the implementation and deployment effort of these allocation policies in a data center will be worth it. A follow-up idea would be a concrete, deployable system containing the different network properties this paper presents, with options to create new policies by selecting among them.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://www.mosharaf.com/wp-content/uploads/faircloud-sigcomm12.pdf]]></summary></entry><entry><title type="html">Notes on “THEMIS: Fair and Efficient GPU Cluster Scheduling”</title><link href="https://elijahoyekunle.com/blog/2022/04/13/Paper-Summary-THEMIS.html" rel="alternate" type="text/html" title="Notes on “THEMIS: Fair and Efficient GPU Cluster Scheduling”" /><published>2022-04-13T00:00:00+00:00</published><updated>2022-04-13T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/04/13/Paper-Summary-THEMIS</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/04/13/Paper-Summary-THEMIS.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://www.usenix.org/system/files/nsdi20-paper-mahajan.pdf">https://www.usenix.org/system/files/nsdi20-paper-mahajan.pdf</a></p>

<p><strong>Problem</strong></p>

<p>An increasing number of organizations are incorporating machine learning (ML) models into their products in order to unlock new business opportunities. However, ML models training can be time- and resource-intensive, with jobs executing in parallel across several GPUs. Due to the management overhead and cost of running GPU clusters, it is much more efficient for organizations to consolidate all GPU resources onto a single shared cluster. While this can lead to a more efficient setup, this does not necessarily lead to fairness in sharing these GPU resources.</p>

<p>Cluster users want their tasks to finish as fast as possible, and the authors describe this requirement as <em>sharing incentive (SI)</em>. SI implies that if there are N users sharing a cluster C, then each user’s performance should be no worse than using a private cluster of size <em>C/N</em>. If SI is absent, users sacrifice performance and will prefer to deploy their own private clusters. The authors show that existing state-of-the-art schedulers (e.g. DRF, Quincy) violate this key requirement because they use techniques designed for big data workloads and ignore unique characteristics of ML workloads such as their <em>long durations</em> and <em>placement sensitivity</em>.</p>

<p><strong>Approach</strong></p>

<p>In addition to SI, the authors show that ignoring placement sensitivity affects the Pareto Efficiency (PE) and Envy Freeness (EF) properties. They introduce a new metric called Finish Time Fairness, which is a ratio of an application’s independent finish-time to its shared finish-time. Sharing incentive is attained when this ratio is at most 1. An application’s finish-time fairness is a function of the GPU allocation that it receives.</p>

<p>To perform these allocations, the authors propose a multi-round partial allocation algorithm with the strategy proofness (SP) property which also satisfies PE and EF properties.</p>

<p>At the beginning of a round (<em>visibility phase</em>), the <em>arbiter</em> requests apps for their finish-time fairness metrics estimate, and then selects a (tunable) fraction of the active apps with the greatest estimates, which are also the apps at risk of not meeting SI. Each app scheduler contains an <em>agent</em> which submits <em>bids</em> to the arbiter that reflects their new finish-time fairness metric from acquiring different GPU subsets.</p>

<p>Then (<em>allocation phase</em>), the arbiter picks the winning bids based on the partial allocation algorithm and leftover allocation scheme and notifies the agents which in turn notifies the ML app scheduler which then allocates the GPU resources among the constituent jobs.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>The paper provides a very detailed evaluation of the performance of THEMIS across different kinds of workloads and shows that it increases fairness and efficiency compared to other schedulers, even under high cluster contention. However, the performance and resource usage of the Agent and Arbiter as cluster size grows are not evaluated. Since the arbiter evaluates bids from the different ML app scheduler agents, it will be helpful to see how the system itself scales, especially since the resources consumed by the system are resources that cannot be offered to customers for additional revenue. Evaluating this can also serve as a good follow-up idea.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://www.usenix.org/system/files/nsdi20-paper-mahajan.pdf]]></summary></entry><entry><title type="html">Notes on “Scaling Distributed Machine Learning with In-Network Aggregation” (SwitchML)</title><link href="https://elijahoyekunle.com/blog/2022/04/06/Paper-Summary-SwitchML.html" rel="alternate" type="text/html" title="Notes on “Scaling Distributed Machine Learning with In-Network Aggregation” (SwitchML)" /><published>2022-04-06T00:00:00+00:00</published><updated>2022-04-06T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/04/06/Paper-Summary-SwitchML</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/04/06/Paper-Summary-SwitchML.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://www.usenix.org/system/files/nsdi21-sapio.pdf">https://www.usenix.org/system/files/nsdi21-sapio.pdf</a></p>

<p><strong>Problem</strong></p>

<p>Machine learning solutions these days rely on sophisticated models which train on increasingly large data sets. To cope with the increased training time these models require, ML practitioners now use distributed training which makes use of large clusters consisting of terabytes in storage, hundreds of nodes equipped with hardware accelerators, and superfast networking connecting them.</p>

<p>Training jobs now make use of dozens to hundreds of workers which may even be globally distributed due to edge-based data collection and processing, and training alternates between <em>computation</em> phases where each node runs algorithms on locally-present data, and <em>communication</em> phases where nodes synchronize the models by sending and receiving updates to each other. This alternation typically continues for the life of the ML model.</p>

<p>As computation speed has increased over the years, the communication phase now has an increasing impact on overall training time, bottlenecked by the network performance.</p>

<p><strong>Approach</strong></p>

<p>The authors of this paper try to alleviate the network bottleneck by placing an aggregation primitive in the network to accelerate distributed ML workloads. During synchronization phases, aggregating the amount of data that needs to be transmitted can help to increase throughput and speed up the training time. Programmable switches are able to perform integer aggregation but due to their limited computation power, the authors implement a combined switch-host architecture where end-hosts are responsible for managing reliability and performing more complex computations.</p>

<p>The authors decouple a simple aggregation operation into an addition phase (on switch) and a division phase (on end-host due to efficiency). Since addition is commutative and associate, the order of packet arrivals at the switch does not matter while still preserving correctness.</p>

<p>Each arriving packet carries a pool index that identifies a particular aggregator to be used, and a vector of integers to be aggregated. Once all workers have sent vectors for the same pool, the switch sends the result to the workers who are then able to deterministically reuse the pool index for a new set of vectors.</p>

<p>This mechanism helps implement a simplistic form of flow control since if all workers have not received results from a previous step, the next step cannot finish processing. A packet loss recovery mechanism is also built on this by keeping track of one older version and the current version of results on the switch to facilitate retransmissions.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>This paper attempts to optimize the communication phase of distributed machine learning algorithms in <em>rack-scale architectures</em>. It is well-presented and easy to read, with helpful algorithm pseudocode to aid understanding. The authors suggest scaling beyond a rack by aggregating on multiple levels; the implications of this on model accuracy were not evaluated. Accuracy could degrade after one or two aggregation steps. Evaluating this is a possible follow-up idea.</p>

<p>There’s a lot of research focused on such optimizations, such as Gaia (Hsieh et al, 2017) which focuses on optimizing communication across <em>globally distributed</em> data centers but uses the <em>parameter server</em> architecture. Other approaches include Google’s Federated Learning which brings model training to the network edge and only sends aggregated updates to the cloud. Cartel (Daga et al, 2019) also improves ML model communication in geo-distributed data centers.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://www.usenix.org/system/files/nsdi21-sapio.pdf]]></summary></entry><entry><title type="html">Notes on “Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN” (RMT)</title><link href="https://elijahoyekunle.com/blog/2022/04/05/Paper-Summary-RMT.html" rel="alternate" type="text/html" title="Notes on “Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN” (RMT)" /><published>2022-04-05T00:00:00+00:00</published><updated>2022-04-05T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/04/05/Paper-Summary-RMT</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/04/05/Paper-Summary-RMT.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://people.cs.rutgers.edu/~sn624/552-F20/papers/rmt.pdf">https://people.cs.rutgers.edu/~sn624/552-F20/papers/rmt.pdf</a></p>

<p><strong>Problem</strong></p>

<p>Software-Defined Networking (SDN) physically separates the roles of the control and forwarding planes in a switch and provides an open interface between them. Providing programmatic control to the forwarding plane allows network operators to add new functionality to their network.</p>

<p>OpenFlow is an open interface that allows the control plane to program the forwarding plane, and is based on an approach known as “Match-Action”. A subset of the packet bytes is matched against a table that specifies the corresponding action to be performed on a particular matched entry.</p>

<p>Existing match-action hardware is not easily reconfigurable, so supporting new features frequently requires replacing the hardware. If we could have more flexibility in the hardware, then we could support new types of packet processing at run-time. However, there is a natural trade-off between programmability and speed, so more flexibility may come at the cost of speed.</p>

<p><strong>Approach</strong></p>

<p>The simplest approach to match-table processing is a Single Match Table (SMT) model where sets of packet header fields are matched against entries in a single match table. An improvement on this model is the Multiple Match Table (MMT) which allows smaller match tables to be matched by a subset of packet fields.</p>

<p>The authors explore how to achieve programmability without giving up performance by using a Match-Table model known as the Reconfigurable Match-Action Table Model (RMT) which is a refinement of the MMT. The four key improvements that RMT makes over MMT for data planes are:</p>

<ul>
  <li>Enabling field definitions to be altered and new fields added,</li>
  <li>The number, topology, widths, and depths of match tables can be specified,</li>
  <li>Ability to define new actions, and</li>
  <li>Arbitrarily modified packets can be placed in specified queues for output at any subset of ports with a queueing disciple specified for each queue.</li>
</ul>

<p>The hardware architecture is a 640 Gb/s switch chip that has an aggregate throughput of 960M packets/s. The design provides 32 physical match stages at both ingress and egress, 16 parsers, and 1 egress and ingress pipeline.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>The paper is well-structured and goes from high-level ideas down to very low-level details in different places. It combines ideas from networking, hardware design, algorithms, and compilers and applies these ideas to switch programmability. However, this makes some parts of the paper hard to read.</p>

<p>In RMT, each pipeline stage can only access local memory, so RMT must allocate memory for a table in the same match+action stage. This conflates memory allocation with match/action processing, which makes table placement challenging and can result in poor resource utilization. Also, the serial pipelined execution order can lead to the under-utilization of hardware resources for programs where matches and actions are imbalanced.</p>

<p>These two problems are addressed in dRMT (S. Chole et al, 2017) whose key idea is to disaggregate the hardware resources of a programmable switch. dRMT separates table memory from the processing stages and also replaces the sequentially-wired pipeline stages with a set of match-action processors.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://people.cs.rutgers.edu/~sn624/552-F20/papers/rmt.pdf]]></summary></entry><entry><title type="html">Notes on “P4: Programming Protocol-Independent Packet Processors”</title><link href="https://elijahoyekunle.com/blog/2022/04/04/Paper-Summary-P4.html" rel="alternate" type="text/html" title="Notes on “P4: Programming Protocol-Independent Packet Processors”" /><published>2022-04-04T00:00:00+00:00</published><updated>2022-04-04T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/04/04/Paper-Summary-P4</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/04/04/Paper-Summary-P4.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://courses.grainger.illinois.edu/ece598hpn/fa2020/papers/p4.pdf">https://courses.grainger.illinois.edu/ece598hpn/fa2020/papers/p4.pdf</a></p>

<p><strong>Problem</strong></p>

<p>Software-Defined Networking (SDN) provides data center operators with programmatic control over their networks. SDN physically separates routers into a <em>control</em> plane and a <em>forwarding</em> plane, and a single control plane can control multiple forwarding planes. The control plane configures the forwarding plane, while the forwarding plane actually processes the data request.</p>

<p>There are multiple forwarding devices from different hardware and software vendors, so the OpenFlow standard has been developed to provide an open, vendor-agnostic interface for programming these forwarding devices. However, the OpenFlow standard has gotten more complicated over the years as a result of trying to expose more switch capabilities to the controller. Therefore, there are a lot more header fields and multiple stages in the specification.</p>

<p><strong>Approach</strong></p>

<p>Instead of repeatedly extending the OpenFlow specification and making it more complicated, the authors of this paper argue for a more general, extensible approach by providing future switches with flexible mechanisms for parsing packets and matching header fields. Exposing these capabilities to controller applications would be a simpler, elegant, and more future-proof approach than the current OpenFlow standard.</p>

<p>For this, the authors introduced a new programming language for Programming Protocol-independent Packet Processors (P4). The three goals of P4 are:</p>

<ul>
  <li>Reconfigurability - Making it possible for programmers to redefine packet parsing and processing after deployment.</li>
  <li>Protocol independence - Making no assumptions about packet formats, but instead making it possible for programmers to define the format to support by specifying a packet parser for recognizing and extracting fields from the packet headers, and then processing these fielders with a collection of match+action tables.</li>
  <li>Target independence - Helping programmers create a target-independent description of the packet processing functionality without knowing the specifics of the underlying hardware, and then compiling this into target-dependent programs.</li>
</ul>

<p>Whereas OpenFlow assumes serial execution of match+action stages, P4 supports parallel or serial execution. The language allows programmers to express serial dependencies between header fields, which lets it determine which tables can be executed in parallel. To facilitate dependency analysis P4 has a two-step compilation process: at the first level is the P4 program itself, which is then compiled to Table Dependency Graphs (TDGs).</p>

<p>A P4 program consists of four key components:</p>

<ul>
  <li>Headers - This describes the sequence and structure of a series of fields, including their widths in bits and constraints on their values.</li>
  <li>Parsers - Specifies how to identify headers</li>
  <li>Tables - Defines the fields on which a table should match and the actions it may execute.</li>
  <li>Actions - Helps to construct actions from a set of simpler protocol-independent primitives.</li>
</ul>

<p>The authors provide a simple example that demonstrates P4 in action and discuss how each of the four key components listed above makes it work.</p>

<p>Whereas OpenFlow assumes a fixed parser, P4 also supports programmable parsers. For devices with programmable parsers, the compiler translates the parser description into a parsing state machine, while for fixed parsers, the compiler only verifies that the parser description is consistent with the target’s parser.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>The approach presented in this paper can help accelerate the development of SDN by separating target device capability from the required effort by programmers, and also future-proofing the current OpenFlow interface. As described by the authors, this was also intended as a proposed idea for a future OpenFlow standard, and integrating this into the OpenFlow protocol itself will be a good next step.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://courses.grainger.illinois.edu/ece598hpn/fa2020/papers/p4.pdf]]></summary></entry><entry><title type="html">Notes on “Sonata: Query-Driven Streaming Network Telemetry”</title><link href="https://elijahoyekunle.com/blog/2022/03/29/Paper-Summary-Sonata.html" rel="alternate" type="text/html" title="Notes on “Sonata: Query-Driven Streaming Network Telemetry”" /><published>2022-03-29T00:00:00+00:00</published><updated>2022-03-29T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/03/29/Paper-Summary-Sonata</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/03/29/Paper-Summary-Sonata.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://www.cs.princeton.edu/~jrex/papers/sonata.pdf">https://www.cs.princeton.edu/~jrex/papers/sonata.pdf</a></p>

<p><strong>Problem</strong></p>

<p>Existing telemetry systems do not allow operators to express queries to perform complex analytics on network traffic data, or scale to large traffic volumes and rates. Some of them can collect and analyze the traffic data in real-time but have limited query expressiveness, while some incur high processing and storage costs that don’t scale to high traffic rates and queries.</p>

<p>Existing telemetry systems either rely on programmable switches or stream processors, and this usually determines whether they are trading off scalability for expressiveness or vice versa. This is because programmable switches can scale to high data rates but the queries they can support are limited by the capabilities and memory in the data plane. On the other hand, stream processors can express more complex queries but cannot scale to high data rates.</p>

<p><strong>Approach</strong></p>

<p>The authors introduce Sonata (Streaming Network Traffic Analysis) which is a query-driven network telemetry system. The key idea is that programmable switches and stream processors share a common processing model, which is applying an ordered set of transformations over structured data in a pipeline. What Sonata tries to do is to combine the strengths of both technologies into a single system, and thus is able to operate at line rate for high traffic volumes and rates, while still supporting expressive queries.</p>

<p>Sonata provides a declarative interface that can express queries for a wide range of telemetry tasks. To enable scalable and real-time execution, Sonata partitions each query across the programmable switches and stream processors, while trying to run as much of the query as possible on the programmable switches at line rates. This helps to reduce the load on the stream processor.</p>

<p>Since a lot of queries typically try to find “needles in a haystack”, Sonata implements dynamic query refinement based on the query and workload in an attempt to reduce the stream processor workload. Sonata uses historical packet traces to refine the input queries dynamically. To do this, Sonata modifies the input queries to start at a coarser level of granularity than requested, and then subsequently chooses finer granularities that reduce the load on the stream processor. However, this can introduce additional delay in detecting the traffic, but this is an acceptable tradeoff.</p>

<p>Evaluation of Sonata reveals that it is able to express different kinds of queries while using fewer lines of code than P4, and about the same as Spark. The performance evaluation also shows that it significantly reduces the workload on the stream processors for single-query and multiple-query scenarios.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>Sonata is able to transparently execute data analytics queries and dynamically adjust execution parameters where possible, without requiring operator awareness of how/where the query is eventually processed. This makes it a very powerful way to execute these queries. Sonata dynamically refines input queries based on historical packet traces and a further evaluation into how accurate these work for different kinds of queries, and their performance can be a good follow-up idea. It might also be helpful to provide Sonata libraries that can be integrated into existing programming languages as libraries, to improve ease of adoption.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://www.cs.princeton.edu/~jrex/papers/sonata.pdf]]></summary></entry><entry><title type="html">Notes on “Data Center TCP (DCTCP)”</title><link href="https://elijahoyekunle.com/blog/2022/03/14/Paper-Summary-DCTCP.html" rel="alternate" type="text/html" title="Notes on “Data Center TCP (DCTCP)”" /><published>2022-03-14T00:00:00+00:00</published><updated>2022-03-14T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/03/14/Paper-Summary-DCTCP</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/03/14/Paper-Summary-DCTCP.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://people.csail.mit.edu/alizadeh/papers/dctcp-sigcomm10.pdf">https://people.csail.mit.edu/alizadeh/papers/dctcp-sigcomm10.pdf</a></p>

<p><strong>Problem</strong></p>

<p>Three important requirements applications have for the datacenter network are low latency for short flows, high burst tolerance, and high utilization for long flows. The first two requirements are a result of the Partition/Aggregate design pattern in web applications, where requests are broken into pieces and their processing is distributed to workers in lower layers, and then aggregated up the tree to produce a result. Since the end results have tight deadlines, each individual task has a tighter deadline. Tasks that don’t make the deadline get canceled, and this can affect the accuracy of the end result. The last requirement is due to the need to continuously update the internal data structures of these applications.</p>

<p>First, the authors measured a production cluster to understand the nature of DC traffic and help understand impairments that cause high application latencies. Their analysis split cluster traffic into three - query traffic, short message traffic, and continuous background traffic.</p>

<p>Most switches have shallow packet buffers, and they discovered that this causes three specific problems: incast, queue buildup, and buffer pressure.</p>

<p><strong>Approach</strong></p>

<p>In order to solve the above problems, the authors introduced Data Center TCP (DCTCP), with the goal of fulfilling application requirements and using commodity shallow buffered switches. It is designed to work with small queues without loss of throughput.</p>

<p>DCTCP works by reacting to congestion in proportion to the extent of congestion. At each switch, it sets the Congestion Experienced (CE) codepoint of arriving packets if the queue occupancy is greater than a threshold K upon its arrival. When the receiver sees a marked packet, it sets the ECN-Echo flag in the corresponding ACK sent to the sender. It also uses Delayed ACKs by sending one cumulative ACK for every <em>m</em> consecutively received packets that have the same codepoint value.</p>

<p>The sender tries to estimate the probability that the queue size at the switch is greater than K based on the received ACKs. This allows the senders to gently start reducing their windows when the queue exceeds K.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>DCTCP only requires a very minimal change to TCP and only a parameter change to the switches, and this simplicity has helped to improve its adoption in data center environments. Evaluations also show significant performance benefits when compared to TCP.</p>

<p>The authors state that if the traffic sent in the first RTT overflows the buffers, then this can lead to timeouts. Another side effect can be a severe oscillation of the actual queue size before the sender reacts to the RTT delay.</p>

<p>One attempt at resolving these weaknesses was presented by Chen et al (2013). They propose splitting the single threshold K into two thresholds K1 &lt; K &lt; K2, such that the CE flag is set when queue size is K1 before actual congestion is experienced, and stops at K2 before the queue size gets too low. Analysis of this double-threshold DCTCP (DT-DCTCP) shows that it achieves a smaller queue, and has a queue length less sensitive to the growing number of flows. It also achieves a lower tail latency.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://people.csail.mit.edu/alizadeh/papers/dctcp-sigcomm10.pdf]]></summary></entry><entry><title type="html">Notes on “SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks”</title><link href="https://elijahoyekunle.com/blog/2022/03/11/Paper-Summary-SIMON.html" rel="alternate" type="text/html" title="Notes on “SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in Data Center Networks”" /><published>2022-03-11T00:00:00+00:00</published><updated>2022-03-11T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/03/11/Paper-Summary-SIMON</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/03/11/Paper-Summary-SIMON.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://www.usenix.org/system/files/nsdi19-geng.pdf">https://www.usenix.org/system/files/nsdi19-geng.pdf</a></p>

<p><strong>Problem</strong></p>

<p>In order to improve network performance and easily debug problems in distributed applications and networks, it is important to take network measurements and implement monitoring. However, monitoring networks at scale and at near real-time speeds is still very challenging to do.</p>

<p>There are two main approaches to network telemetry - switch-based and end-host-based.</p>

<p>In switch-based telemetry, measurements are captured at each switch and they can be accurate or approximate. Approximate measurements only provide an approximate count of packets/bytes passing through one switch, and they require extra bandwidth to move the measurements to the edge where they undergo a lot of extra processing to generate the network-wide views. Accurate measurements generate a lot of data, which can limit its scalability. Also, it requires each node to be able to perform in-band network telemetry which isn’t always the case. One general downside of switch-based telemetry is the difficulty in matching network performance to application performance since measurements are taken at the switches.</p>

<p>In edge-based telemetry, events are recorded at the end-hosts. From these, attempts are made to infer the internal network state. One approach to edge-based is known as <em>network tomography</em>, which attempts to determine internal network state quantities such as queue delays and backlogs based on probes and data packets collected at the network edge. When deployed in wide-area networks, typical approaches to network tomography do not yield much accuracy due to a lack of knowledge of the underlying network topology, and end-to-end traversal times are much longer than queuing delay times.</p>

<p><strong>Approach</strong></p>

<p>Some factors can make network tomography if we restrict its use to data centers. The network topology is known to the operators, and the typical network topology of the Clos type provides multiple paths between nodes which makes packet wire times negligible in the overall end-to-end traversal time, which is instead dominated by queueing times. Therefore, without requiring any modifications to the existing switching infrastructure, it is possible to have accurate, scalable, and near real-time network measurements which can be easily matched to application performance. This is exactly what the researchers set out to do.</p>

<p>The authors introduce SIMON which uses a network tomography-based approach to reconstruct the full network state variables such as queuing times, link utilization, and queue and link compositions.</p>

<p>First, they introduce a signal processing framework to quantify an appropriate balance between the amount of data collected and accuracy, which they try to keep at 97.5%. Their analysis suggested that optimal reconstruction intervals vary inversely with link speeds.</p>

<p>SIMON uses a mesh of probes to obtain information about the queue sizes and wait times. Probe packets were chosen over data packets because data packets do not yield an accurate network queue reconstruction across multiple queues. To speed up SIMON, they leverage the hierarchical structure of the network topologies, and also use neural networks. Both methods enable SMON to run in near real-time.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>The paper has well-presented arguments and reasonings. Getting it actually deployed in data centers will still require some development work on the part of the DC operators since the code is not open-sourced. Making the data format compatible with existing visualization and analysis will be important to make adoption easier. A follow-up idea is also to explore possible actions that can be taken automatically based on the collected data, e.g. altering packet routes based on real-time network states.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://www.usenix.org/system/files/nsdi19-geng.pdf]]></summary></entry><entry><title type="html">Notes on “PRISM: Rethinking the RDMA Interface for Distributed Systems”</title><link href="https://elijahoyekunle.com/blog/2022/03/10/Paper-Summary-Rethinking-RDMA.html" rel="alternate" type="text/html" title="Notes on “PRISM: Rethinking the RDMA Interface for Distributed Systems”" /><published>2022-03-10T00:00:00+00:00</published><updated>2022-03-10T00:00:00+00:00</updated><id>https://elijahoyekunle.com/blog/2022/03/10/Paper-Summary-Rethinking%20RDMA</id><content type="html" xml:base="https://elijahoyekunle.com/blog/2022/03/10/Paper-Summary-Rethinking-RDMA.html"><![CDATA[<p><strong>Link to Paper:</strong> <a href="https://irenezhang.net/papers/prism-sosp21.pdf">https://irenezhang.net/papers/prism-sosp21.pdf</a></p>

<p><strong>Problem</strong></p>

<p>RDMA is now an important tool to achieve high throughput and reduce latency in modern data centers. However, some applications are unable to efficiently express their distributed protocols with the interfaces provided by RDMA. The efficiency and better performance provided by RDMA quickly disappear when applications need to make extra network round trips or bring in the CPU for some stage of the computation. In such cases, it would simply be better not to bypass the CPU at all.</p>

<p>Examples of distributed systems functionality that need to be supported are: navigating data structures (requires multiple RDMA reads), supporting out-of-place writes (challenging due to multiple concurrent reads), optimistic concurrency control (updates require synchronization), and chaining operations, which involves performing compound operations where subsequent operations depend on previous ones.</p>

<p><strong>Approach</strong></p>

<p>The authors of this paper believe that the best way to support today’s distributed systems functionalities is to extend the basic RDMA interface with additional primitives.</p>

<p>For this, they propose PRISM (Primitives for Remote Interaction with System Memory) which was created to adhere to three core principles: (i) generality: The interfaces should not include any application-specific functionality. This will also make it possible to express a wide domain of applications using a small set of primitives, (ii) minimal interface complexity, and (iii) minimal implementation complexity.</p>

<p>PRISM adds four new features to the RDMA interface:</p>

<ol>
  <li>Indirection: RDMA applications frequently need to traverse remote data structures, which involves following pointers. PRISM allows READ, WRITE, and compare-and-swap (CAS) operations to take indirect arguments which allows an address to be interpreted as the address of a pointer to the actual target.</li>
  <li>Memory Allocation: A server-side process can register a queue of buffers with the NIC. Upon an ALLOCATE request, the NIC pops a buffer from this free list, writes the provided data into it, and returns the address. Used with operation chaining, this can be a quite useful feature.</li>
  <li>Enhanced compare-and-swap: The existing CAS operations provided by the RDMA standard is insufficient to implement performant applications and thus are rarely used in practice. PRISM extends the atomics interface and incorporates indirect addressing as described above, and also supports arithmetic operators in the compare phase.</li>
  <li>Operation chaining: Distributed applications frequently need to perform sequences of data-dependent operations. PRISM enables conditional operations, which delays the execution of an operation until previous operations are completed successfully. It also supports output redirection which allows the output of an operation to be written to memory instead of returned to the client.</li>
</ol>

<p>In order to demonstrate the wide applicability of the new APIs, the authors enhance three common distributed applications with these new primitives: a key-value store, a replicated block store, and a transactional key-value store. These implementations showed latency and throughput improvements compared to the baseline.</p>

<p><strong>Strengths &amp; Weaknesses and Possible Follow-Up Ideas</strong></p>

<p>This work presents what should be the next step in the evolution of RDMA APIs by trying to make it much easier to take advantage of RDMA’s capabilities. While the software implementation provides a convincing argument, it would be great to see hardware manufacturers implement some of these primitives in their interfaces to improve adoption.</p>]]></content><author><name></name></author><category term="Blog" /><category term="papers" /><summary type="html"><![CDATA[Link to Paper: https://irenezhang.net/papers/prism-sosp21.pdf]]></summary></entry></feed>