CI/CD Pipeline for Pytest, Vitest, Playwright & DeepEval

1. The Problem

I hit this exact problem when my pipeline took 18 minutes per push. My stack required running Vitest for the Next.js frontend, pytest with testcontainers for the FastAPI backend, Playwright for end-to-end testing, and DeepEval for LLM output quality. Initially, I threw them all into a single workflow triggered on every commit.

The result was unacceptable. Testcontainers pulled heavy Docker images repeatedly. Playwright ground through headless browser UI flows. Worst of all, every minor typo fix triggered a full RAGAS evaluation, costing real API credits just to merge a README update. The feedback loop was broken. Developers avoided committing frequently because the CI tax was too high, both in wall-clock time and financial cost. A pipeline that runs everything, everywhere, all at once is not continuous integration—it is a bottleneck. The architecture had to change.

2. The Core Principle — Not All Tests Deserve The Same Trigger

Tests must be staged by execution cost, financial cost, and relevance. A robust pipeline does not treat a unit test and an LLM evaluation equally. Fast, deterministic unit tests via Vitest and pytest should execute on every push to provide immediate feedback. Integration tests requiring databases via testcontainers also run on every push, but strictly in parallel to the unit tests.

End-to-end testing via Playwright is deferred until a pull request targets the main branch. AI evaluation via DeepEval represents the highest cost—both in time and API usage. It runs exclusively when AI-related files are modified, on a nightly schedule, or during a release merge. Applying this staging logic decouples the fast feedback loop from the heavy, expensive validation gates, eliminating redundant compute cycles and wasted API credits.

3. Job Structuring with `needs` and Path Filters

To orchestrate this staging, the GitHub Actions workflow relies heavily on path filters and job dependencies. We define a single workflow file but use conditional execution to route the logic.

name: CI Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: "0 2 * * *"
  workflow_dispatch:

jobs:
  changes:
    runs-on: ubuntu-latest
    outputs:
      frontend: ${{ steps.filter.outputs.frontend }}
      backend: ${{ steps.filter.outputs.backend }}
      ai: ${{ steps.filter.outputs.ai }}
    steps:
      - uses: actions/checkout@v4

      - uses: dorny/paths-filter@v3
        id: filter
        with:
          filters: |
            frontend:
              - 'frontend/**'
            backend:
              - 'backend/**'
            ai:
              - 'backend/app/ai/**'
              - 'backend/prompts/**'

  frontend-unit-tests:
    runs-on: ubuntu-latest
    timeout-minutes: 5
    defaults:
      run:
        working-directory: ./frontend
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
          cache-dependency-path: ./frontend/package-lock.json
      - name: Install dependencies
        run: npm ci
      - name: Run Vitest
        run: npm run test:unit

  backend-unit-tests:
    runs-on: ubuntu-latest
    timeout-minutes: 5
    defaults:
      run:
        working-directory: ./backend
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
          cache: 'pip'
          cache-dependency-path: ./backend/requirements.txt
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run Pytest Unit
        run: pytest tests/unit

  integration-tests:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    defaults:
      run:
        working-directory: ./backend
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
          cache: 'pip'
          cache-dependency-path: ./backend/requirements.txt
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run Pytest Integration (Testcontainers)
        run: pytest tests/integration

  e2e-tests:
    needs: [frontend-unit-tests, backend-unit-tests, integration-tests]
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    timeout-minutes: 15
    defaults:
      run:
        working-directory: ./frontend
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
          cache-dependency-path: ./frontend/package-lock.json
      - name: Install dependencies
        run: npm ci
      - name: Install Playwright Browsers
        run: npx playwright install --with-deps
      - name: Run Playwright
        run: npx playwright test

  ai-eval:
    needs: changes
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule' || needs.changes.outputs.ai == 'true'
    defaults:
      run:
        working-directory: ./backend
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
          cache: 'pip'
          cache-dependency-path: ./backend/requirements.txt
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run DeepEval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: deepeval test run tests/ai

In this dependency graph, frontend-unit-tests, backend-unit-tests, and integration-tests start immediately and run in parallel on separate GitHub runners. The e2e-tests job depends on all three via needs, ensuring browser-level validation only runs after the foundational application logic has passed. Because the job is guarded by if: github.event_name == 'pull_request', Playwright executes only during pull request validation and is skipped on direct pushes to main.

The ai-eval job follows a separate execution path. It is triggered only for pull requests, where a file-change filter inspects whether AI-related directories such as backend/app/ai or backend/prompts were modified. If no relevant files changed, the expensive DeepEval execution step is skipped, avoiding unnecessary LLM API usage while still allowing the workflow to complete successfully.

4. Optimizing Testcontainers Performance

Integration tests powered by Testcontainers run against real Dockerized services, so they provide much higher confidence than mocks. The tradeoff is startup overhead: CI may need to pull images, create containers, wait for readiness, and tear them down afterward.

One practical optimization is using smaller, pinned images such as postgres:15-alpine, which can reduce download time. Another is using workflow-level path filtering so backend integration tests only run when backend or integration-test files change. You can also pre-pull the image before pytest starts so Testcontainers can reuse the local Docker cache instead of waiting on the pull during test execution.

name: Backend Integration Tests

on:
  push:
    paths:
      - 'backend/**'
      - 'tests/integration/**'

jobs:
  integration-tests:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: "pip"

      # Pre-pull the optimized image so Testcontainers 
      # can start the container instantly during pytest
      - name: Pre-pull Database Image
        run: docker pull postgres:15-alpine

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run Integration Tests (Testcontainers)
        run: pytest tests/integration

For many teams, pinning a lightweight image and skipping irrelevant runs saves more time than trying to over-optimize Docker layer caching on ephemeral CI runners.

5. Caching Node Modules and pip Dependencies

Beyond Docker image caching, the two most impactful caches in a heterogeneous pipeline are node_modules and pip. Without caching, every job reinstalls all dependencies from scratch—adding 60-120 seconds per job depending on the project size.

# For Node.js
- uses: actions/setup-node@v4
  with:
    node-version: 20
    cache: 'npm'

# For Python
- uses: actions/setup-python@v5
  with:
    python-version: '3.12'
    cache: 'pip'
    cache-dependency-path: requirements.txt

Both setup actions automatically generate cache keys from your dependency manifests. When dependencies remain unchanged, npm and pip can reuse previously downloaded packages from the cache, significantly reducing installation time while still executing the normal install commands. This is the highest ROI caching optimization in most pipelines because it requires zero additional configuration beyond adding the cache parameter.

6. Making DeepEval and RAGAS Cost-Aware

AI evaluation frameworks such as DeepEval and RAGAS differ fundamentally from traditional test suites. A failed unit test costs only compute time. A failed LLM evaluation costs both compute time and API credits because the evaluation process itself requires calls to external language models to evaluate non-deterministic metrics like contextual relevancy or faithfulness.

Treating AI evaluations like ordinary tests quickly becomes financially unsustainable. Running a full evaluation suite on every commit means paying for prompt-quality validation even when modifying unrelated code such as UI layout styles, static assets, or documentation.

The optimal strategy separates AI evaluation into a distinct two-tier topology: lightweight validation triggered selectively by code changes, and comprehensive regression evaluation executed on a scheduled cron cadence.

name: AI Evaluation

on:
  pull_request:
    branches: [main]
    paths:
      - 'app/ai/**'
      - 'prompts/**'

  schedule:
    - cron: '0 2 * * *'

jobs:
  deep-eval:
    runs-on: ubuntu-latest

    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      EVAL_MODEL: ${{ github.event_name == 'pull_request' && 'gpt-4o-mini' || 'gpt-4o' }}

    steps:
      - uses: actions/checkout@v4

      - name: Run DeepEval Suite
        run: deepeval test run tests/ai

This architecture introduces an intentional cost-versus-coverage tradeoff. Pull requests receive rapid, cost-effective feedback using a lower-tier evaluator model (gpt-4o-mini), while the nightly schedule performs a comprehensive regression sweep using the flagship model (gpt-4o). The result is significantly lower day-to-day API spending without sacrificing long-term confidence in prompt quality, retrieval accuracy, or agent behavior.

7. Parallelizing Playwright (and Cypress) E2E

End-to-end tests are inevitably the slowest segment of any pipeline. If you are evaluating E2E frameworks, this is where Playwright and Cypress diverge significantly in pipeline architecture.

Cypress natively gates seamless parallelization and load balancing behind its paid Cypress Cloud service (utilizing the --record --parallel flags). While open-source workarounds exist, it requires maintaining third-party plugins. Playwright, conversely, provides native, free sharding directly through its CLI.

Rather than executing a massive Playwright suite sequentially, you can leverage this built-in sharding combined with GitHub Actions matrix strategies to split the workload across multiple parallel runners.

jobs:
  e2e-tests:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        shard: [1/3, 2/3, 3/3]
    steps:
      - uses: actions/checkout@v4
      - name: Install dependencies
        run: npm ci
      - name: Install Playwright Browsers
        run: npx playwright install --with-deps
      - name: Run Playwright tests
        run: npx playwright test --shard ${{ matrix.shard }}

This matrix configuration provisions three separate virtual machines simultaneously. Each runner executes precisely one-third of the test suite. If your sequential E2E suite requires twelve minutes, sharding across three runners reduces the wall-clock bottleneck to roughly four minutes—without requiring a paid cloud subscription to orchestrate the split.

8. The Result — Before and After

Prior to restructuring, the pipeline executed sequentially. Every push forced the runner to pull heavy database images, sequentially execute Playwright browser flows, and ping LLM APIs for DeepEval metrics. The total wall-clock time routinely exceeded 18 minutes, and the financial cost scaled linearly with every minor commit.

After implementing dependency staging, the pipeline behaves dynamically. Standard commits receive a fast feedback loop driven by parallel unit and integration tests, finalizing in under five minutes. End-to-end tests only block pull requests, completing in under eight minutes due to matrix sharding. Most importantly, API-intensive AI evaluations run strictly when prompt engineering files are modified, or during the deep nightly regression sweep. This architecture secures the code base without taxing developer velocity or the engineering budget.

Pro Tip: Test Your Pipeline Locally with act Designing an intricate pipeline usually involves a lot of trial and error. To avoid trashing your Git commit history with endless "fix yaml" commits, use act. act is an open-source CLI tool by nektos that runs GitHub Actions workflows locally using Docker. You run act in your repository root, and it simulates the exact GitHub Actions environment, allowing you to debug path filters and needs dependencies locally. You can install it with brew install act or from the GitHub releases page.

8. Closing

A CI pipeline is not an instruction to run everything always. It is a deliberate architecture decision regarding what feedback you need, how fast you need it, and at what cost. Whether you use Pytest, Vitest, Playwright, or DeepEval, the frameworks themselves do not dictate pipeline efficiency. The staging logic between them does.

One Pipeline, Four Test Frameworks — Designing CI/CD That Doesn't Slow You Down

1. The Problem

2. The Core Principle — Not All Tests Deserve The Same Trigger

3. Job Structuring with `needs` and Path Filters

4. Optimizing Testcontainers Performance

5. Caching Node Modules and pip Dependencies

6. Making DeepEval and RAGAS Cost-Aware

7. Parallelizing Playwright (and Cypress) E2E

8. The Result — Before and After

8. Closing

Comments

More from this blog

DeepEval in Production — Real Lessons

What Your E2E Tests Don't Tell You About Session Security

Your Database Tests Are Lying to You — Here's How to Fix That

Fat Service Layer, Brittle Tests — The Repository Pattern Is the Fix

Command Palette

1. The Problem

2. The Core Principle — Not All Tests Deserve The Same Trigger

3. Job Structuring with needs and Path Filters

4. Optimizing Testcontainers Performance

5. Caching Node Modules and pip Dependencies

6. Making DeepEval and RAGAS Cost-Aware

7. Parallelizing Playwright (and Cypress) E2E

8. The Result — Before and After

8. Closing

Comments

More from this blog

3. Job Structuring with `needs` and Path Filters