Learning Rust #5 — Shipping a Real CLI (Args, Files, HTTP, Concurrency) - Pedro Ribeiro

Time to ship something concrete. We’ll build a small CLI that:

reads a CSV of URLs,
fetches them concurrently with a configurable limit,
collects status code + content length,
and writes a pretty JSON report.

We’ll use clap for args, anyhow for errors, tokio + reqwest for async HTTP, and tracing for logs.

Project setup

Cargo.toml

[package]
name = "url-audit"
version = "0.1.0"
edition = "2021"

[dependencies]
clap = { version = "4", features = ["derive"] }
anyhow = "1"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
reqwest = { version = "0.12", features = ["json", "gzip", "brotli"] }
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["fmt", "env-filter"] }
csv = "1"

If you’re on Windows behind a proxy or on a slow link, start with a lower concurrency (e.g. -c 8).

The CLI

src/main.rs

use anyhow::{Context, Result};
use clap::Parser;
use serde::{Deserialize, Serialize};
use tokio::time::{timeout, Duration};
use tracing::{info, warn, Level};
use tracing_subscriber::EnvFilter;

#[derive(Parser, Debug)]
#[command(version, about = "Audit a list of URLs from a CSV and output JSON")]
struct Args {
    /// CSV path with a header 'url'
    #[arg(short, long)]
    input: String,

    /// Output JSON path
    #[arg(short, long, default_value = "report.json")]
    output: String,

    /// Max number of concurrent requests
    #[arg(short = 'c', long, default_value_t = 32)]
    concurrency: usize,

    /// Per-request timeout in seconds
    #[arg(short = 't', long, default_value_t = 10u64)]
    timeout: u64,

    /// Optional custom User-Agent header
    #[arg(long, default_value = "url-audit/0.1")]
    user_agent: String,
}

#[derive(Debug, Deserialize)]
struct InRow {
    url: String,
}

#[derive(Debug, Serialize)]
struct OutRow {
    url: String,
    status: Option<u16>,
    len: Option<u64>,
    error: Option<String>,
}

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt()
        .with_env_filter(EnvFilter::from_default_env().add_directive(Level::INFO.into()))
        .init();

    let args = Args::parse();
    info!("reading input = {}", &args.input);

    let client = reqwest::Client::builder()
        .user_agent(args.user_agent.clone())
        .tcp_nodelay(true)
        .build()
        .context("building HTTP client")?;

    // Read CSV eagerly; fine for small/medium lists. For huge lists, stream lines.
    let mut rdr = csv::Reader::from_path(&args.input)
        .with_context(|| format!("opening CSV: {}", &args.input))?;

    let mut urls: Vec<String> = Vec::new();
    for rec in rdr.deserialize::<InRow>() {
        let row = rec.with_context(|| "parsing CSV row")?;
        if !row.url.trim().is_empty() {
            urls.push(row.url);
        }
    }
    info!(count = urls.len(), "loaded URLs");

    // Concurrency gate
    let sem = std::sync::Arc::new(tokio::sync::Semaphore::new(args.concurrency));
    let mut tasks = Vec::with_capacity(urls.len());

    for url in urls {
        let client = client.clone();
        let permit = sem.clone().acquire_owned().await?; // Owned permit drops with task
        let tmo = Duration::from_secs(args.timeout);
        tasks.push(tokio::spawn(async move {
            let _permit = permit; // keep until the end of this task
            fetch_row(&client, url, tmo).await
        }));
    }

    let mut out = Vec::with_capacity(tasks.len());
    for t in tasks {
        match t.await {
            Ok(row) => out.push(row),
            Err(e) => out.push(OutRow {
                url: "<join-error>".into(),
                status: None,
                len: None,
                error: Some(format!("join error: {e}")),
            }),
        }
    }

    std::fs::write(&args.output, serde_json::to_vec_pretty(&out)?)
        .with_context(|| format!("writing {}", &args.output))?;

    info!("wrote {} rows to {}", out.len(), &args.output);
    Ok(())
}

async fn fetch_row(client: &reqwest::Client, url: String, tmo: Duration) -> OutRow {
    let fut = async {
        match client.get(&url).send().await {
            Ok(resp) => {
                let status = resp.status().as_u16();
                let len = resp
                    .headers()
                    .get(reqwest::header::CONTENT_LENGTH)
                    .and_then(|v| v.to_str().ok())
                    .and_then(|s| s.parse::<u64>().ok());
                OutRow { url, status: Some(status), len, error: None }
            }
            Err(e) => OutRow { url, status: None, len: None, error: Some(e.to_string()) },
        }
    };

    match timeout(tmo, fut).await {
        Ok(row) => row,
        Err(_) => OutRow { url, status: None, len: None, error: Some("timeout".into()) },
    }
}

Key details:

We gate concurrency with a Semaphore. Each task holds a permit until it completes.
We wrap each request in a per‑request timeout so slow hosts don’t stall the batch.
We don’t read bodies; we only inspect headers for Content-Length. (Servers may omit the header; in that case len is null.)

Sample input and run

urls.csv

url
https://www.rust-lang.org
https://example.com
https://httpbin.org/status/404

Run it:

RUST_LOG=info cargo run --release -- \
  -i urls.csv -o report.json -c 32 -t 10 --user-agent "url-audit/0.1"

report.json (snippet):

[
  { "url": "https://www.rust-lang.org", "status": 200, "len": 12345, "error": null },
  { "url": "https://example.com", "status": 200, "len": 648, "error": null },
  { "url": "https://httpbin.org/status/404", "status": 404, "len": null, "error": null }
]

Optional: stream results as they arrive

If the CSV is huge, you can stream results to disk instead of accumulating in memory. Replace the task join loop with a futures::stream::FuturesUnordered and write each row as soon as it resolves. For simplicity, this first version buffers in memory.

Tests (tiny but useful)

Add a small helper in src/lib.rs just to demonstrate unit tests:

pub fn parse_len(s: &str) -> Option<u64> {
    s.parse().ok()
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn parses_len() {
        assert_eq!(parse_len("123"), Some(123));
        assert_eq!(parse_len("x"), None);
    }
}

Run:

cargo test

Packaging and release builds

Build a release binary:

cargo build --release

Linux/macOS: target/release/url-audit
Windows: target\\release\\url-audit.exe

If you need a static Linux build, add the target and rebuild:

rustup target add x86_64-unknown-linux-gnu
cargo build --release --target x86_64-unknown-linux-gnu

(For fully static MUSL builds and cross‑compilation, explore x86_64-unknown-linux-musl and tools like cross.)

Troubleshooting

Many timeouts: increase -t, lower -c, or verify network/DNS.
Proxy: configure environment variables (HTTP_PROXY, HTTPS_PROXY).
Memory spikes: stream CSV and results incrementally instead of buffering.

Exercises (15–30 minutes)

HEAD first: try a HEAD request and fall back to GET if the server returns 405 Method Not Allowed.
Retry policy: add --retries N with exponential backoff for transient errors.
CSV enrichment: add input columns (label, category) and include them in the output JSON.
Metrics: print a summary table with counts per status class (2xx/3xx/4xx/5xx) and average content length.
Stream writer: write one JSON object per line (NDJSON) as tasks complete, so memory usage stays flat.

What I learned shipping this

Concurrency wants a gate: a semaphore makes back‑pressure explicit and easy to reason about.
Timeouts are non‑negotiable for robust network tools.
clap + anyhow + tracing yields CLIs that are friendly to users and maintainers.

That’s it for this mini‑series! Next, I’ll likely explore either FFI + unsafe (just enough to be safe) or a small Web API service with actix or axum to apply the same error handling/testing patterns to HTTP servers.

# Learning Rust #5 — Shipping a Real CLI (Args, Files, HTTP, Concurrency)