Wednesday, April 2, 2025

GitHub’s scalable architecture enables seamless growth with the help of Azure’s flexible infrastructure. By harnessing the power of cloud-based services, GitHub can dynamically allocate resources to meet increasing demands for its popular collaboration platform. This strategic partnership empowers GitHub to rapidly deploy new features, ensure high availability, and provide a best-in-class experience for developers worldwide.

Flex Consumption enables rapid and large-scale deployment options on a serverless framework, facilitating prolonged performance execution times, personalized networking, event sizing choices, and concurrent management capabilities.

GitHub is home to over 100 million developers and 420 million total repositories across its platform, serving as a hub for the world’s software creators. To ensure seamless and secure operation, GitHub aggregates vast amounts of data through a proprietary pipeline comprising multiple components. Notwithstanding its initial design for fault tolerance and scalability, as GitHub’s progress continued, the company felt compelled to reassess the pipeline to ensure it remained aligned with evolving demands. 

The safety and security of our community is of utmost importance to us.

GitHub’s Senior Director of Platform Safety Stephan Miehe —

GitHub worked closely with its parent company, Microsoft, to find a solution. To successfully course-stream an event at scale, the GitHub team developed a performance-optimized application that runs on AWS Lambda, a service recently launched for public preview. Flex Consumption offers rapid and massive scalability options on a serverless model, enabling long-running performance executions, customizable networking, event-driven metrics, and concurrent management.

Recently, GitHub successfully handled 1.6 million transactions per second through a single Flex Consumption application initiated by a network-constrained event hub, showcasing its scalability and reliability in high-pressure scenarios.

The open-source community has always been a bastion of innovation and collaboration; yet, we must acknowledge that the rise of social media has introduced new vulnerabilities to our collective ecosystem. As the Senior Director of Platform Safety at GitHub, I’ve witnessed firsthand the devastating consequences of unchecked online interactions.

chart, histogram

A glance again

One of GitHub’s limitations stemmed from its internal messaging system governing the flow of data between telemetry providers and consumers. The application was first launched with a foundation in Java-based executables and. As the app struggled to manage an unprecedented influx of data, exceeding 460 gigabytes daily, it began to falter under the strain, its performance slowly degrading as a result.

To achieve optimal performance, each user of the legacy system demanded a customized setup and labor-intensive manual calibration process. The Java codebase proved vulnerable to disruption and tedious to diagnose, with increasing compute overhead driving up costs for managing people environments.

Miehe says. They began deliberating on their choices.

Accustomed to improving serverless code, the group focused on leveraging native Azure capabilities and achieved.

—Stephan Mieche, GitHub’s Senior Director of Platform Safety.

A performance app can automatically scale its queue primarily based on the volume of logging visitors. Scaling potential of this system is substantial, considering the flexibility inherent in its modular design. With further development and refinement, it’s likely to accommodate an increasing number of users with ease, without sacrificing performance or reliability. When GitHub began collaborating with the Azure Capabilities team, the Flex Consumption plan was in a private preview phase. Based on a novel foundation, Flex Consumption empowers up to 1,000 partitions, delivering accelerated target-driven scaling capabilities. The product team successfully developed a proof-of-concept that scaled more than twice the size of the legacy platform’s biggest dataset at the time, demonstrating Flex Consumption’s ability to handle the pipeline.

The power of open-source software. It’s a double-edged sword. On one hand, it enables collaboration on a massive scale. On the other hand, it means that vulnerabilities can spread like wildfire. As GitHub’s Senior Director of Platform Safety, I’ve seen firsthand how quickly a single vulnerability can be exploited and wreak havoc across an entire ecosystem.

— Stephan Miehe, GitHub Senior Director of Platform Safety

Setting clear intentions for personal growth and success?

GitHub collaborated closely with the Azure Capabilities product group to explore the full potential of Flex Consumption. A brand-new performance app is developed in Python to consume events from Event Hubs. The system aggregates numerous individual communications into a single comprehensive message, which is then transmitted to clients for further handling.

Determining the optimal quantity for each batch required some trial and error, as each performance run inherently involves a minimal amount of overhead to be factored in. During periods of maximum usage, the platform processes over one million events per second. Seeking to optimize performance, the GitHub team endeavored to identify the sweet spot in execution. There is an excessive quantity present, but insufficient memory capacity exists to process the batch. When processing a batch in small quantities, numerous iterations are required, which hinders productivity and reduces overall efficiency.

The optimal quantity was found to be 5,000 messages per batch. Miehe reviews.

This resolution has built-in flexibility. The group has the flexibility to tailor message batches to diverse scenarios, trusting that the target-based scaling capabilities will adapt effectively across a wide range of situations. Here is the rewritten text:

The Azure Capabilities scaling mechanism monitors the number of unprocessed messages in the event hub and automatically adjusts the instance count based on the batch size and partition count, ensuring seamless scalability. In larger instances, the performance app scales proportionally with each occasion hub partition, potentially resulting in a massive 1,000 deployments for extremely large-scale hubs.

—Stephan Miehe, GitHub Senior Director of Platform Safety

Azure Capabilities enables event-driven architecture by integrating various event sources, including Event Hubs, Azure Queues, and Service Bus topics.

Reaching behind the digital community

As a service mannequin, it liberates builders from the burden of overseeing numerous infrastructure-related responsibilities. Although serverless code may operate independently, it is still bounded by the limitations inherent in the network environments where it executes. Consumers of flexible infrastructure solutions face challenges when leveraging enhanced digital networking capabilities within a virtual network environment. Applications will be securely isolated within a virtual network (VNet), allowing them to seamlessly communicate with other applications located within separate, secure VNets without compromising performance.

As a pioneering user of Flex Consumption, GitHub capitalized on the improvements being continually rolled out to the Azure Capabilities platform in the background. The Flex Consumption offering leverages Legion, a cutting-edge, internally developed platform-as-a-service (PaaS) backbone, designed to enhance community functionality and streamline processes during peak demand scenarios. Legion can seamlessly integrate into an existing virtual network (VNet) in milliseconds. When a performance-intensive application scales up, each newly allocated compute instance boots up and is ready for execution, including outbound VNet connectivity, within 624 milliseconds at the 50th percentile and 1,022 milliseconds at the 90th percentile. By leveraging GitHub’s messaging processing app, developers can seamlessly integrate with Azure’s Event Hubs, securing sensitive data behind a digital perimeter without compromising performance or incurring significant latency. For the past 18 months, the Azure Capabilities platform has experienced a significant growth of around 53%, with improvements seen across all areas, languages, and platforms.

Working by challenges

The project significantly expanded the scope of both the GitHub and Azure capabilities engineering teams’ expertise. With perseverance and determination, the team overcame numerous hurdles to achieve this milestone in production efficiency.

  • During its initial deployment, GitHub encountered a significant backlog of pending messages requiring processing, ultimately triggering an integer overflow in the Azure Capacity scaling logic, which swiftly triggered a scale-out event.
  • During the second iteration, the system’s performance was hindered significantly due to the lack of connection pooling mechanisms in place. The team revised the performance code to efficiently leverage existing connections across successive executions.
  • The system experienced a bottleneck of approximately 800,000 occurrences per second during the community stage, but the underlying cause remained ambiguous. The Azure Capabilities team has identified and investigated a critical flaw in the Azure SDK’s AMQP transport implementation, specifically within the obtain buffer configuration, following an extensive probe. This achievement enabled GitHub to process over one million events per second, a significant milestone.

Establishing throughput milestones requires meticulous planning, precise execution, and strategic communication. Effective milestones are built upon a foundation of well-defined objectives, realistic timelines, and measurable key performance indicators (KPIs). To achieve this, consider the following best practices:

Identify clear goals and KPIs: Establish specific, achievable goals and corresponding KPIs to track progress.

Set realistic timelines: Establish realistic deadlines for milestone completion to avoid undue stress and ensure timely achievement.

Define roles and responsibilities: Clearly assign tasks and accountability to team members to foster collaboration and minimize confusion.

Establish communication protocols: Develop a communication strategy that ensures timely updates, feedback, and issue resolution.

Monitor and adjust: Regularly track progress, identify deviations from plan, and make adjustments as needed to maintain momentum.

Develop contingency plans: Prepare for potential setbacks by creating backup plans and identifying alternative solutions.

As energy levels surged, so did the responsibilities, a reality acknowledged by Miehe, whose team was granted “many dials to twiddle” in the context of Flex Consumption.

Prior to deploying changes, he advises conducting sporadic and early testing, an approach in line with GitHub’s established best practices for pull requests. Effective adoption of best practices enabled GitHub to achieve its targets:

  • Receiving multiple messages at once significantly enhances productivity. Processing thousands of occasion hub messages in a single execute operation significantly boosts the system’s overall performance.
  • Miehe’s team analyzed batches ranging from 100,000 occurrences to 100 instances, ultimately optimizing batch sizes up to 5,000 for swift processing.
  • GitHub leverages Terraform to build the production app and Azure Event Hubs environments. By provisioning all elements collectively, the need for manual intervention in handling the ingestion pipeline is significantly reduced. Miehe’s team could potentially respond with remarkable alacrity to feedback from the product group.

As the GitHub group runs the newly launched platform concurrently with its legacy counterpart, they are carefully monitoring performance and setting a target date for migration. 

Miehe explains.

The group was delighted. As Miehe says,

Discover options with Azure Capabilities

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles