Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ prerequisites:

author: Zach Lasiuk

generate_summary_faq: true
generate_summary_faq: false
rerun_summary: false
rerun_faqs: false

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,55 @@ learning_objectives:
prerequisites:
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from a cloud service provider or an on-premise Arm server.

# START generated_summary_faq
generated_summary_faq:
template_version: summary-faq-v3
generated_at: '2026-06-30T21:36:51Z'
generator: ai
ai_assisted: true
ai_review_required: true
model: gpt-5
prompt_template: summary-faq-v3
source_hash: a4cf1d9161b3a32e29694415762eda419752e1c3144662d5e131b6553f0a58e3
summary_generated_at: '2026-06-30T21:36:51Z'
summary_source_hash: a4cf1d9161b3a32e29694415762eda419752e1c3144662d5e131b6553f0a58e3
faq_generated_at: '2026-06-30T21:36:51Z'
faq_source_hash: a4cf1d9161b3a32e29694415762eda419752e1c3144662d5e131b6553f0a58e3
summary: >-
You'll deploy Hugging Face Sentiment Analysis models
with PyTorch on Arm servers and measure how they perform. Starting from a working Ubuntu
22.04 Arm environment, you'll run three NLP models with the Sentiment Analysis
pipeline, record baseline results, and enable BFloat16 fast math kernels to assess the
impact on inference performance. By
the end, you'll compare before-and-after measurements to confirm the effect of BFloat16
on this workload.
faqs:
- question: Which environment do the instructions assume?
answer: >-
The instructions target an Arm server running Ubuntu 22.04 LTS. They've been tested on
AWS Graviton3 (c7g) instances.
- question: What system resources should I provision before running the steps?
answer: >-
Use an Arm server instance with at least four CPU cores and 8 GB of RAM. This capacity supports
running the three sentiment analysis models and collecting measurements.
- question: Which framework and model source will I use in this Learning Path?
answer: >-
You'll uses PyTorch to run NLP Sentiment Analysis models sourced from Hugging Face.
- question: How should I measure the performance uplift from BFloat16 fast math kernels?
answer: >-
First, run the models to collect a baseline using the same Sentiment Analysis pipeline.
Then enable BFloat16 fast math kernels on supported Arm Neoverse-based AWS Graviton3 processors,
rerun the same workload, and compare measurements.
- question: Which models will I evaluate and what should I have at the end?
answer: >-
You'll evaluate three NLP models with the Sentiment Analysis pipeline. By the end, you
should have deployed the models on your Arm server and recorded baseline and BFloat16-enabled
performance results for comparison.
# END generated_summary_faq

author: Pareena Verma

generate_summary_faq: true
generate_summary_faq: false
rerun_summary: false
rerun_faqs: false

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,59 @@ learning_objectives:
- Measure performance improvements on Graviton4 instances

prerequisites:
- An [Arm based instance](/learning-paths/servers-and-cloud-computing/csp/) from an appropriate
cloud service provider.
- An [Arm-based instance](/learning-paths/servers-and-cloud-computing/csp/) from an appropriate cloud service provider.

# START generated_summary_faq
generated_summary_faq:
template_version: summary-faq-v3
generated_at: '2026-06-30T21:37:19Z'
generator: ai
ai_assisted: true
ai_review_required: true
model: gpt-5
prompt_template: summary-faq-v3
source_hash: 1701b37580fe5d012a5e6fd322307742656a748dfb766fd48914011167386e95
summary_generated_at: '2026-06-30T21:37:19Z'
summary_source_hash: 1701b37580fe5d012a5e6fd322307742656a748dfb766fd48914011167386e95
faq_generated_at: '2026-06-30T21:37:19Z'
faq_source_hash: 1701b37580fe5d012a5e6fd322307742656a748dfb766fd48914011167386e95
summary: >-
You'll implement and benchmark bitmap scanning for database-style
workloads on Arm Neoverse V2–based servers, such as AWS Graviton4. First, you'll build a compact
bit vector in C and add baseline and improved scalar scanning routines. Then, you'll implement Neon
and SVE vectorized versions to process data in wider chunks. You'll use a benchmarking harness
that measures each approach so the relative behavior of scalar, Neon, and SVE implementations can
be compared on an Arm-based Linux instance. By the end, you'll run a single C program
that exercises all variants and produces timing results suitable for side-by-side evaluation.
faqs:
- question: Where should I place the code as I follow the steps?
answer: >-
Use a single source file named `bitvector_scan_benchmark.c`. Add the bit vector type, helper
functions, scalar scan routines, Neon and SVE implementations, and the benchmarking code
into this file as directed.
- question: What must the bitmap data structure contain before I can add the scan functions?
answer: >-
The data structure must include a byte array that holds the bits, the physical size in bytes, and the logical
size in bits. The same file must also add helpers to generate and analyze test bitmaps.
- question: In what order should I implement and test the scanning approaches?
answer: >-
Start with the per-bit scalar baseline, then the optimized scalar version, followed by the
Neon implementation, and finally SVE. After each addition, run the benchmark to compare
against the previous versions.
- question: What result should I expect from the benchmarking step?
answer: >-
The framework measures elapsed time for each scan function over a chosen number of iterations
and tracks how many set-bit positions were found. Use the same input bitmap and iteration
count when comparing implementations.
- question: How can I exercise different workload characteristics when benchmarking?
answer: >-
Use the provided bitmap generation helpers to create datasets with varying densities. Sparse
and dense bitmaps highlight different behaviors across the scalar, Neon, and SVE implementations.
# END generated_summary_faq

author: Pareena Verma

generate_summary_faq: true
generate_summary_faq: false
rerun_summary: false
rerun_faqs: false

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,59 @@ prerequisites:
- GCC version 13.3 or later to compile the example program ([GCC](/install-guides/gcc/) )
- A system with with sufficient hardware performance counters to use the [TopDown](/install-guides/topdown-tool) methodology. This typically requires running on bare metal rather than a virtualized environment.

# START generated_summary_faq
generated_summary_faq:
template_version: summary-faq-v3
generated_at: '2026-06-30T21:37:54Z'
generator: ai
ai_assisted: true
ai_review_required: true
model: gpt-5
prompt_template: summary-faq-v3
source_hash: 8654b656131bf1d529e11d85f874f7a81c01f7207340d2a606b6fd2d80bfad04
summary_generated_at: '2026-06-30T21:37:54Z'
summary_source_hash: 8654b656131bf1d529e11d85f874f7a81c01f7207340d2a606b6fd2d80bfad04
faq_generated_at: '2026-06-30T21:37:54Z'
faq_source_hash: 8654b656131bf1d529e11d85f874f7a81c01f7207340d2a606b6fd2d80bfad04
summary: >-
You'll apply LLVM BOLT post-link optimization to AArch64 binaries
using profile-guided code layout. Starting with a deliberately inefficient BubbleSort workload
to make instruction locality issues visible, you'll install a suitable BOLT
release, set up a working directory, and gather profiles with BRBE, SPE, instrumentation,
or PMU sampling. Using a small set of Arm TopDown indicators, you'll judge
whether a program is front-end bound and a good candidate for BOLT. You'll then run BOLT with
collected profiles to reorganize code layout and evaluate the impact using performance
metrics and profiling data to confirm improvements in instruction delivery and locality.
faqs:
- question: Which BOLT version should I use if my package manager installs an older release?
answer: >-
Use LLVM BOLT 22.1.0 or later. If your distribution provides an older version, install a
prebuilt LLVM release instead (for example, LLVM 22.1.5) to match the required features.
- question: Where do the example’s build and profiling outputs go?
answer: >-
You'll find outputs in three directories: out for binaries, prof for profile data,
and heatmap for visualization artifacts. Keeping these separate makes it easier to rerun
steps and compare results.
- question: How do I know if my program is a good candidate for BOLT?
answer: >-
Check a small set of Arm TopDown indicators related to instruction delivery and code locality.
Programs that appear front-end bound, with inefficient instruction fetch and poor locality,
are strong candidates for code layout optimization with BOLT.
- question: What should I use if my kernel doesn't meet the BRBE or SPE requirements?
answer: >-
If your kernel is older than the BRBE requirement, use SPE if the kernel meets the SPE version
requirement. If neither is available, you can use instrumentation or PMU
event sampling to collect profiles.
- question: What result should I expect after running BOLT with profiles?
answer: >-
You should be able to evaluate changes using performance metrics and profiling data. Look
for improvements in instruction delivery indicators and evidence of better code locality
in the optimized binary.
# END generated_summary_faq

author: Paschalis Mpeis

generate_summary_faq: true
generate_summary_faq: false
rerun_summary: false
rerun_faqs: false

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,59 @@ learning_objectives:
prerequisites:
- An Arm-based Linux system with [BOLT](/install-guides/bolt/) and [Linux Perf](/install-guides/perf/) installed

# START generated_summary_faq
generated_summary_faq:
template_version: summary-faq-v3
generated_at: '2026-06-30T21:38:23Z'
generator: ai
ai_assisted: true
ai_review_required: true
model: gpt-5
prompt_template: summary-faq-v3
source_hash: 84a8b96fe7df302e0a2a6e4645bbb6170b45a3e0b55e0ea3682ec47663d34819
summary_generated_at: '2026-06-30T21:38:23Z'
summary_source_hash: 84a8b96fe7df302e0a2a6e4645bbb6170b45a3e0b55e0ea3682ec47663d34819
faq_generated_at: '2026-06-30T21:38:23Z'
faq_source_hash: 84a8b96fe7df302e0a2a6e4645bbb6170b45a3e0b55e0ea3682ec47663d34819
summary: >-
You'll use BOLT with Linux Perf profiles to optimize an Arm application
and its shared libraries. First, you'll instrument a MySQL server build to generate workload-specific
profiles, create separate traces for read-heavy and write-heavy runs, and merge them to broaden
code layout guidance. Then, you'll rebuild OpenSSL to make `libssl.so` and
`libcrypto.so` suitable for BOLT, collect profiles, and apply optimizations independently
from the main binary. Finally, you'll compare results across baseline, isolated, and merged
scenarios using a consistent Sysbench configuration to assess the
impact of application and library-level optimizations on throughput and latency.
faqs:
- question: What output should I expect after running an instrumented workload with BOLT?
answer: >-
BOLT produces a profile file in `.fdata` format, such as `profile-writeonly.fdata`. These files
are later used to optimize the binary and can be merged to improve coverage.
- question: Should I reuse the BOLT-instrumented mysqld binary for additional workloads or create
a new one?
answer: >-
Either approach works. The steps allow reusing the previously instrumented binary or generating
a new instrumented variant as long as you produce a new `.fdata` profile for each workload.
- question: Which shared libraries are targeted for optimization, and what if the system copies
are stripped?
answer: >-
The path optimizes `libssl.so` and `libcrypto.so`. If system libraries are stripped, rebuild
OpenSSL from source with relocations enabled so BOLT can instrument and optimize them.
- question: Do I need to rebuild the application to benefit from optimized shared libraries?
answer: >-
The shared libraries are optimized independently of the application binary. The path focuses
on rebuilding OpenSSL for symbol information and then integrating the optimized libraries
with the application.
- question: What test configuration is used to compare baseline and BOLT-optimized results?
answer: >-
Sysbench is run with `--time=0 --events=10000` to complete exactly 10,000 requests per thread.
Use this consistent configuration to compare baseline, application-only, and merged-with-library
optimization scenarios.
# END generated_summary_faq

author: Gayathri Narayana Yegna Narayanan

generate_summary_faq: true
generate_summary_faq: false
rerun_summary: false
rerun_faqs: false

Expand Down
50 changes: 48 additions & 2 deletions content/learning-paths/servers-and-cloud-computing/bolt/_index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Learn how to optimize an application with BOLT
title: Optimize an application with BOLT
description: Learn how to build, profile, and optimize Arm executables using BOLT post-link binary optimization to improve application performance through code layout improvements.

minutes_to_complete: 30
Expand All @@ -15,9 +15,55 @@ prerequisites:
- An Arm based system running Linux with [BOLT](/install-guides/bolt/) and [Linux Perf](/install-guides/perf/) installed. The Linux kernel should be version 5.15 or later. Earlier kernel versions can be used, but some Linux Perf features may be limited or not available. For [SPE](./bolt-spe) the version should be 6.14 or later.
- (Optional) A second, more powerful Linux system to build the software executable and run BOLT.

# START generated_summary_faq
generated_summary_faq:
template_version: summary-faq-v3
generated_at: '2026-06-30T21:38:54Z'
generator: ai
ai_assisted: true
ai_review_required: true
model: gpt-5
prompt_template: summary-faq-v3
source_hash: 2e9ac8a3c73b7d3d59fe6ba20fb6d61fc2b7e5e9320aaadc20af0a8bbb3ff959
summary_generated_at: '2026-06-30T21:38:54Z'
summary_source_hash: 2e9ac8a3c73b7d3d59fe6ba20fb6d61fc2b7e5e9320aaadc20af0a8bbb3ff959
faq_generated_at: '2026-06-30T21:38:54Z'
faq_source_hash: 2e9ac8a3c73b7d3d59fe6ba20fb6d61fc2b7e5e9320aaadc20af0a8bbb3ff959
summary: >-
You'll use BOLT to post-link optimize an Arm Linux executable
based on real execution profiles. First, you'll prepare a target system for profiling and optionally
a separate build/BOLT system, then choose a profiling method — Perf samples, ETM, or SPE — to
collect runtime behavior into a `perf.data` file. You'll convert the profile for BOLT, and run BOLT to reorder code layout and emit a new optimized executable. Finally, you'll compare the resulting binary against
the original to observe improvements.
faqs:
- question: How should I choose between Perf samples, ETM, and SPE for profiling?
answer: >-
Use the dedicated sections for each method. Perf samples provide general sampling data,
while ETM and SPE record richer branch information. Follow the method that best fits your
availability and profiling detail needs.
- question: Can I profile on one Arm Linux system and run BOLT on another?
answer: >-
Yes. The target system runs the application and collects the profile, and a separate Linux
system can build the application and run BOLT. Transfer the executable and the collected
profile files between systems as needed.
- question: What file should exist after recording with Perf before converting for BOLT?
answer: >-
Expect a `perf.data` file. Perf prints sample counts or data size when recording completes,
which indicates that profiling output was captured and is ready for conversion.
- question: What version of Perf do I need for the SPE workflow?
answer: >-
Use Linux Perf version 6.14 or later for SPE to capture the required branch stack information.
Verify the version before recording so the profile contains all needed fields.
- question: How do I check results after BOLT creates the optimized executable?
answer: >-
Run the same workload with the original and the optimized executables and compare outcomes.
The optimized executable should show improved performance relative to the original after
the steps are completed.
# END generated_summary_faq

author: Jonathan Davies

generate_summary_faq: true
generate_summary_faq: false
rerun_summary: false
rerun_faqs: false

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,57 @@ prerequisites:
- Familiarity with [Docker](https://docs.docker.com/get-started/) and container concepts
- A [GitHub account](https://github.com/join) to host your application repository

# START generated_summary_faq
generated_summary_faq:
template_version: summary-faq-v3
generated_at: '2026-06-30T21:39:35Z'
generator: ai
ai_assisted: true
ai_review_required: true
model: gpt-5
prompt_template: summary-faq-v3
source_hash: 4bda206717eef380430009f859826d9bcf820442d13492cd3c22a114561e2917
summary_generated_at: '2026-06-30T21:39:35Z'
summary_source_hash: 4bda206717eef380430009f859826d9bcf820442d13492cd3c22a114561e2917
faq_generated_at: '2026-06-30T21:39:35Z'
faq_source_hash: 4bda206717eef380430009f859826d9bcf820442d13492cd3c22a114561e2917
summary: >-
You'll provision an Arm-based Google Cloud C4A virtual machine powered by Google
Axion, install Docker, Docker Buildx, and the Buildkite agent, and connect the agent to a
Buildkite queue. First, you'll create a small Flask application and Dockerfile in a GitHub repository,
then configure a Buildkite pipeline that uses Buildx to build a multi-architecture container
image, and push it to Docker Hub. You'll use Ubuntu or SUSE on the VM
and validate that the agent is online. By the end, you'll have a published
image and a running Flask service to confirm the build.
faqs:
- question: Which Google Cloud instance type and OS should I use for the VM?
answer: >-
Use a Google Axion C4A Arm VM, specifically `c4a-standard-4` with 4 vCPUs and 16
GB memory. You can select either Ubuntu or SUSE Linux Enterprise Server as the OS.
- question: Where do I create the Buildkite agent token, and when do I use it?
answer: >-
Create an agent token in your Buildkite organization after signing in (GitHub sign-in is
supported). You use this token during the agent installation and configuration on the C4A
VM.
- question: How do I confirm the Buildkite agent is connected and assigned to the right queue?
answer: >-
After configuring the agent and queue, check the Agents page in Buildkite; the agent should
appear online with the expected queue. If it doesn't, check the agent configuration and
queue name, then repeat the verification step.
- question: What files should my GitHub repository contain for the example application?
answer: >-
Add a Dockerfile and a Python file named `app.py`. The provided Dockerfile uses `python:3.12-slim`,
installs Flask, exposes port 5000, and runs the app.
- question: What result should I expect after the pipeline runs successfully?
answer: >-
A multi-architecture Docker image for Arm and x86 is built with Docker Buildx and pushed
to Docker Hub. You then start the containerized Flask application and verify that it runs
as the final validation step.
# END generated_summary_faq

author: Jason Andrews

generate_summary_faq: true
generate_summary_faq: false
rerun_summary: false
rerun_faqs: false

Expand Down
Loading
Loading