emily/video2document

Fork 0

mirror of https://gitlab.rlp.net/proj-wise2526-video2document/video2document.git synced 2026-06-15 18:01:52 +02:00

Files

T

MikeHughes-BIN 565caebdd8 complete documentation package

2026-02-22 15:35:19 +01:00

56 KiB

Raw Blame History

Video2Document (V2D) — Software Documentation Package

Document Version: 1.0.0 Conformance: ISO/IEC/IEEE 26514:2022 — Systems and Software Engineering — Requirements for Designers and Developers of User Documentation Status: Release Candidate Last Updated: 2026-02-22 Target Audience: End Users · Developers · Administrators Technical Depth: Intermediate → Expert

Executive Summary
Product Overview
System Architecture
Technology Stack
Installation & Setup Guide
Configuration Guide
Usage Guide — End User Perspective
API Documentation
Data Model Description
Security Considerations
Performance & Scalability Notes
Deployment Guide
Testing Strategy
Maintenance & Support Plan
Limitations & Known Issues
Future Improvements / Roadmap
Glossary

1. Executive Summary

Video2Document (V2D) is a cross-platform desktop application that automates the conversion of video and audio recordings into structured, professional documents. It combines AI-based speech transcription, speaker diarisation, and large language model (LLM) document generation into a single, guided end-to-end pipeline.

The system addresses a concrete operational need: organisations and individuals frequently produce recordings of meetings, lectures, and collaborative sessions that remain inaccessible as raw media files. V2D transforms this latent content into actionable, shareable documents — meeting reports, agendas, sprint planning notes, and custom-format outputs — without requiring users to possess programming or AI expertise.

The application is delivered as an Electron desktop executable, runs on Windows, macOS, and Linux, and integrates with three independent LLM providers (Google Gemini, OpenAI GPT via SAIA, and Alibaba Qwen3) and two transcription backends (AssemblyAI cloud and Whisper local). Its modular architecture permits the addition of new AI providers or output formats with minimal modification to the core codebase.

Key capabilities at a glance:

Capability	Detail
Input formats	MP4, MOV, WAV, MP3, FLAC (any FFmpeg-compatible source)
Output formats	PDF, DOCX, HTML, TXT
Transcription	AssemblyAI (cloud) · Whisper.cpp (local/offline)
LLM providers	Google Gemini · OpenAI GPT · Alibaba Qwen3
Document types	Follow-up report · Agenda · Result protocol · Sprint planning note · Custom
Speaker handling	Automatic diarisation with post-hoc name substitution
Deployment	Cross-platform Electron desktop app
UI languages	English · German (extensible)

2. Product Overview

2.1 Problem Statement

Recording meetings and lectures is widespread practice. However, converting those recordings into usable documents is time-intensive, error-prone, and requires significant cognitive effort. Manually transcribing a one-hour meeting and reformatting it as a professional report can require two to four hours of skilled effort.

V2D reduces this effort to a guided six-step workflow completed in minutes, with the document quality governed by the selected LLM and document-type prompt.

2.2 Target Users

User Type	Description	Primary Need
End User (non-technical)	Meeting participants, students, knowledge workers	Generate a document from a recording with no setup
Developer	Engineers extending or integrating V2D	Understand architecture, add modules, contribute
Administrator	IT staff deploying V2D internally	Manage API keys, configure keyserver, maintain deployments

2.3 Functional Requirements

ID	Requirement	Priority
FR-01	The system shall accept video and audio files as input	Must
FR-02	The system shall extract audio from video files	Must
FR-03	The system shall transcribe audio to text with speaker labels	Must
FR-04	The system shall generate structured documents from transcripts using an LLM	Must
FR-05	The system shall allow the user to assign real names to speakers	Must
FR-06	The system shall export documents in PDF, DOCX, HTML, and TXT formats	Must
FR-07	The system shall provide at least three standard document type templates	Must
FR-08	The system shall allow users to create custom document type templates	Should
FR-09	The system shall support at least two independent transcription backends	Should
FR-10	The system shall support at least two independent LLM backends	Should
FR-11	The system shall display processing progress to the user	Should
FR-12	The system shall provide playable speaker audio previews	Could
FR-13	The system shall support UI localisation to at least English and German	Could

2.4 Non-Functional Requirements

ID	Requirement	Category
NFR-01	The application shall run on Windows 10+, macOS 12+, and Ubuntu 20.04+	Portability
NFR-02	API keys shall never be stored in source-controlled files	Security
NFR-03	The UI shall be operable by a non-technical user without documentation	Usability
NFR-04	Individual modules shall be replaceable without modifying the core pipeline	Maintainability
NFR-05	The addition of a new LLM or transcription module shall require changes to exactly one directory	Extensibility
NFR-06	Intermediate artefacts (audio, transcripts, HTML) shall be persisted on disk	Reliability
NFR-07	The application shall report errors to the user via the UI	Usability
NFR-08	All secrets shall be injected at runtime via environment variables or keyserver	Security
NFR-09	The system shall complete processing of a 30-minute recording within 10 minutes (cloud transcription)	Performance
NFR-10	The test suite shall cover all major module types	Testability

2.5 Scope and Constraints

In scope:

Desktop application for video/audio-to-document conversion
Modular plugin system for transcription and LLM providers
Multi-format document export
Speaker diarisation and name mapping

Out of scope (current version):

Real-time (streaming) transcription
Web-based or SaaS deployment
Voice biometric speaker identification
Video editing or preview
Multi-user collaboration or cloud storage

3. System Architecture

3.1 Architectural Overview

V2D follows a layered, modular pipeline architecture running within a single Electron process pair (main process + renderer process). There is no external server dependency for the core workflow; all orchestration is local.

┌─────────────────────────────────────────────────────────────────────┐
│                        Electron Application                         │
│                                                                     │
│  ┌──────────────────────┐        ┌──────────────────────────────┐  │
│  │   Renderer Process   │  IPC   │      Main Process            │  │
│  │  (Chromium / HTML)   │◄──────►│  (Node.js / Orchestrator)    │  │
│  │                      │        │                              │  │
│  │  index.html          │        │  main.js                     │  │
│  │  script.js           │        │  module-handler.js           │  │
│  │  renderer.js         │        │  mapFunctions (global Map)   │  │
│  │  languages.js        │        │                              │  │
│  └──────────────────────┘        └──────────────┬───────────────┘  │
│           │                                     │                   │
│      preload.js                    ┌────────────▼────────────┐     │
│   (contextBridge / IPC)            │   Module Registry        │     │
│                                    │  (mapFunctions Map)       │     │
│                                    └────────────┬────────────┘     │
│                                                 │                   │
└─────────────────────────────────────────────────┼───────────────────┘
                                                  │
              ┌───────────────────┬───────────────┼───────────────┐
              │                   │               │               │
    ┌─────────▼──────┐  ┌─────────▼──────┐  ┌────▼──────┐  ┌────▼──────┐
    │  Extraction    │  │ Transcription  │  │   LLM     │  │ Converter │
    │  ffmpegExtract │  │  assembly.js   │  │ gemini.js │  │ convert.js│
    │                │  │  whisperLocal  │  │ chatgpt.js│  │           │
    └────────┬───────┘  └───────┬────────┘  │ qwen3.js  │  └───────────┘
             │                  │           └────────────┘
             ▼                  ▼
      ┌─────────────────────────────────┐
      │        /storage/ (disk)         │
      │  /audio   /transcripts          │
      │  /transcriptionSummaries        │
      │  /documents   /documentType     │
      └─────────────────────────────────┘

3.2 Process Architecture

Electron provides two isolated JavaScript contexts:

Process	Runtime	Responsibilities
Main Process	Node.js	File system, pipeline execution, module loading, IPC handlers, native dialogs
Renderer Process	Chromium (sandboxed)	UI rendering, user interaction, progress display
Preload Script	Node.js (bridge)	Exposes a controlled API surface to the renderer via `contextBridge`

Node integration in the renderer is explicitly disabled. All privileged operations are routed through the preload bridge.

3.3 Module System

The module system is the central extensibility mechanism. On startup, main.js traverses all subdirectories of services/modules/, dynamically requires each .js file, and registers the exported module object in a global Map<string, Module> keyed by module.name.

services/modules/
├── extraction/            ffmpegExtractor.js
├── transcription-remote/  assembly.js
├── transcription-local/   whisperLocal.ts
├── jsonTools/             transcriptionSummarizer2.js
├── llm-gemini/            gemini.js
├── llm-chat_gpt/          chatgpt.js
├── quen3/                 qwen3.js
├── convert/               convert.js
├── audioSnippets/         extract-speaker-snippets.js
├── replace_speaker/       replaceSpeaker.js
└── utility/               @startup.js · module-handler.js

Each module conforms to the following interface:

module.exports = {
  name:        "unique-module-identifier",   // key in mapFunctions
  type:        "llm|transcription|...",      // category label
  displayname: "Human-Readable Name",        // shown in UI dropdowns
  description: "What this module does",
  audioformat: "mp3",                        // (transcription modules only)
  async function(parameter) {               // primary entry point
    // Implementation
  }
}

Modules are invoked as: mapFunctions.get("module-name").function(params).

3.4 Pipeline Execution Flow

The six-step user workflow maps to the following internal pipeline stages:

Stage 1 │ VIDEO INPUT          User selects file via native file dialog
        │                      IPC: file_submit → main.js
        ▼
Stage 2 │ AUDIO EXTRACTION     ffmpegExtractor extracts audio stream
        │                      Output: /storage/audio/<file>.<format>
        ▼
Stage 3 │ TRANSCRIPTION        AssemblyAI or Whisper.cpp processes audio
        │                      Output: /storage/transcripts/<id>.json
        ▼
Stage 4 │ SUMMARISATION        Transcript compacted to sentence-level objects
        │                      Output: /storage/transcriptionSummaries/<id>.json
        │                      Speaker audio snippets: /storage/audio/speakerSnippets/
        ▼
Stage 5 │ LLM GENERATION       Prompt + summarised transcript → LLM → HTML
        │                      Output: /storage/documents/<id>.html
        ▼
Stage 6 │ SPEAKER SUBSTITUTION Regex replace speakerA/B/C with real names
        │                      + Format conversion (HTML → PDF/DOCX/TXT)
        ▼
        │ EXPORT               Native OS save dialog → user's chosen location

3.5 IPC Communication Map

The following IPC channels are defined between the renderer and main processes:

Channel	Direction	Trigger	Payload
`file_submit`	Renderer → Main	User submits step 1 form	Pipeline configuration object
`file_download`	Renderer → Main	User clicks Download	`{ type, speakerMap }`
`speaker_submit`	Renderer → Main	User confirms speaker names	Speaker name map
`progress`	Main → Renderer	After each pipeline stage	Stage number (1–4)
`speakerAudios`	Main → Renderer	After snippet extraction	Speaker audio path map
`error`	Main → Renderer	Any pipeline exception	Error message string
`get-module-names`	Renderer ↔ Main	UI init	Module list
`get-txt-files`	Renderer ↔ Main	UI init	Document type filenames
`save-txt-file`	Renderer ↔ Main	Custom type editor save	`{ filename, content }`
`read-txt-file`	Renderer ↔ Main	Custom type editor load	Filename
`delete-txt-file`	Renderer ↔ Main	Custom type editor delete	Filename

4. Technology Stack

4.1 Core Runtime and Framework

Component	Technology	Version	Purpose
Runtime	Node.js	19 – 24	JavaScript execution environment
Desktop Framework	Electron	39.1.1	Cross-platform desktop application shell
HTTP Framework	Express	5.1.0	Local HTTP server (IPC supplement; API Skeleton)
Primary Language	JavaScript (ES2022)	—	All production modules
Secondary Language	TypeScript	5.9.3	Local transcription module, pipeline job examples
TypeScript Runtime	ts-node	10.9.2	Execute TypeScript without pre-compilation

4.2 AI and LLM Providers

Provider	Module	Environment Variable	Model	Mode
Google Gemini	`llm-gemini`	`GOOGLE_API_KEY`	gemini-2.0-flash (default)	Cloud
OpenAI GPT (via SAIA)	`llm-saia_openai_gpt`	`SAIA_API_KEY`	OSS 120B	Cloud
Alibaba Qwen3	`qwen3-235b-a22b`	`SAIA_API_KEY`	Qwen3-235B-A22B	Cloud

4.3 Transcription Backends

Provider	Module	Environment Variable	Mode	Speaker Labels
AssemblyAI	`assembly`	`ASSEMBLYAI_API_KEY`	Cloud (remote)	Yes
Whisper.cpp	`whisperLocal`	None required	Local (offline)	Limited

4.4 Audio and Video Processing

Library	Version	Purpose
`fluent-ffmpeg`	Latest	Node.js wrapper for FFmpeg CLI
`ffmpeg-static`	Latest	Bundled static FFmpeg binary

Supported input container formats: any FFmpeg-compatible format (MP4, MOV, AVI, MKV, WebM for video; WAV, MP3, FLAC, M4A for audio).

Extraction output formats: WAV (16kHz, mono), MP3, FLAC — determined per transcription module's audioformat property.

4.5 Document Generation and Export

Library	Purpose	Output
`html-to-docx`	Convert HTML to Word document	`.docx`
`puppeteer`	Headless Chromium for PDF rendering	`.pdf`
Native file write	Plain text export	`.txt`
Native file write	HTML export	`.html`

4.6 Networking and Utilities

Library	Purpose
`axios`	HTTP client for external API calls
`dotenv`	Load environment variables from `.env` file

4.7 Development and Test Dependencies

Library	Purpose
`mocha` (v11.7.5)	Unit and integration test runner
`@types/node` (v24.9.2)	TypeScript Node.js type definitions
`@types/fluent-ffmpeg`	TypeScript FFmpeg type definitions
`typescript` (v5.9.3)	TypeScript compiler
`ts-node` (v10.9.2)	Direct TypeScript execution

5. Installation & Setup Guide

5.1 Prerequisites

Before installing V2D, ensure the following are present on the target machine:

Requirement	Minimum Version	Notes
Node.js	19.x	Versions 19 through 24 are supported. Use `node -v` to verify.
npm	9.x	Bundled with Node.js.
Git	Any recent	Required to clone the repository.
Internet connection	—	Required for cloud transcription and LLM services.
FFmpeg	Any recent	Bundled via `ffmpeg-static`; a separate system install is not required.

Note: A separate system-level FFmpeg installation is not required. The ffmpeg-static package provides a platform-appropriate binary that is automatically used by fluent-ffmpeg.

5.2 Obtaining the Source Code

git clone <repository-url>
cd video2document

Replace <repository-url> with the actual repository URL provided by your administrator.

5.3 Installing Dependencies

From the project root directory:

npm install

This installs all dependencies declared in package.json, including Electron, Puppeteer, and all service libraries. The installation may take several minutes on the first run due to Puppeteer downloading a Chromium binary.

5.4 Environment Configuration

V2D requires API keys for external services. These are provided via environment variables.

Option A — Local .env file (development / personal use):

Create a file named .env in the project root:

GOOGLE_API_KEY=your_google_gemini_key
ASSEMBLYAI_API_KEY=your_assemblyai_key
SAIA_API_KEY=your_saia_platform_key

The .env file is Git-ignored and will never be committed to the repository.

Option B — Keyserver (organisational deployment):

If your organisation operates the V2D keyserver, set the following variables instead:

auth_username=your_keyserver_username
auth_password=your_keyserver_password

On startup, the application contacts keyserver.dommymommy.xyz:443 and retrieves all required API keys automatically. See Section 6.2 for details.

5.5 Starting the Application

Linux / macOS:

./start.sh

Or manually:

npm install   # if not already done
npm start

Windows:

start.bat

Or manually from Command Prompt or PowerShell:

npm install
npm start

The Electron window will open automatically after a brief initialisation period.

5.6 Verifying the Installation

After launch, the V2D main window should appear showing Step 1 (file selection). If the window does not appear:

Check the terminal / command prompt for Node.js error output.
Confirm Node.js version with node -v.
Confirm npm install completed without errors.
Confirm at least one set of API credentials is present in .env or via keyserver.

6. Configuration Guide

6.1 Environment Variables Reference

Variable	Required By	Description
`GOOGLE_API_KEY`	`llm-gemini` module	Google Cloud API key with Generative Language API enabled
`ASSEMBLYAI_API_KEY`	`assembly` transcription module	AssemblyAI account API key
`SAIA_API_KEY`	`llm-saia_openai_gpt`, `qwen3` modules	SAIA platform API key for hosted LLM access
`auth_username`	Keyserver auth	Username credential for automatic key retrieval
`auth_password`	Keyserver auth	Password credential for automatic key retrieval

Only the variables corresponding to the modules you intend to use are strictly required. The application will report a clear error if a module is invoked without its required key.

6.2 Keyserver Authentication

The keyserver is a proprietary key distribution service hosted at keyserver.dommymommy.xyz:443. When credentials are provided, the startup module (@startup.js) performs the following sequence:

Issues a GET request to the keyserver endpoint with auth_username and auth_password as HTTP headers.
The server responds with a JSON object containing API key–value pairs.
Keys are merged into process.env within the main process.
Keys are never forwarded to the renderer process.

This mechanism allows teams to rotate keys centrally without redistributing .env files to individual users.

6.3 Document Type Templates

Document type templates control the instructions given to the LLM when generating output. They are stored as plain text files in /storage/documentType/.

Built-in templates:

Filename	Purpose
`followup_report.txt`	Professional post-meeting follow-up report
`agenda.txt`	Reconstructed meeting agenda from transcript
`result_protocol.txt`	Formal meeting minutes / result protocol
`sprint_planning_note.txt`	Agile sprint planning session documentation
`custom_document.txt`	Placeholder for user-defined format

Creating a custom template:

Templates are editable directly through the V2D UI (Custom Type Editor, accessible from the document type step) or by editing the .txt files directly in /storage/documentType/.

Template authoring guidelines:

Write instructions for the LLM in clear, imperative sentences.
Reference speaker tokens using the exact placeholder strings speakerA, speakerB, speakerC, etc.
Specify the desired output structure (headings, bullet lists, tables).
Include language instructions if multilingual output is required.
Do not include the transcript itself — it is automatically appended by the pipeline.

6.4 Module Configuration

Modules are self-describing. To enable or disable a module, add or remove its .js file from its directory under services/modules/. The module will be included or excluded from all UI dropdowns automatically on next application start.

To change the default Gemini model, edit the model constant in services/modules/llm-gemini/gemini.js.

6.5 Storage Paths

All artefact directories are located under /storage/ at the project root. These paths are currently hard-coded and are not configurable without modifying the relevant module source files. Future versions may expose these as configuration parameters.

7. Usage Guide — End User Perspective

7.1 Workflow Overview

V2D presents users with a linear six-step wizard. Steps must be completed in order. Each step requires a selection or action before the next step becomes available.

Step 1 → Step 2 → Step 3 → Step 4 → [ Processing ] → Step 5 → Step 6
File     Transcrip- Document  Output                  Preview   Download
Select   tion Svc   Type      Format                  & Names

7.2 Step-by-Step Instructions

Step 1 — Select Video or Audio File

Click Browse or drag and drop a file onto the designated area.
Accepted formats: any FFmpeg-compatible video or audio file (MP4, MOV, WAV, MP3, FLAC, etc.).
The selected filename is displayed for confirmation.
Click Next to proceed.

Step 2 — Select Transcription Service

A dropdown lists all registered transcription modules.
Select the desired backend:
- AssemblyAI — cloud-based, highest accuracy, requires internet and API key.
- Whisper Local — offline, no API key, lower accuracy for multi-speaker content.
Click Next to proceed.

Step 3 — Select Document Type

A dropdown lists all document type templates found in /storage/documentType/.
Select the template matching your desired output format (e.g., "Follow-up Report").
Optionally, click Edit / Create Custom Type to open the template editor.
Click Next to proceed.

Step 4 — Select Output Format and LLM

Select the export format: PDF, DOCX, HTML, or TXT.
Select the LLM provider from the dropdown.
Click Generate Document to begin processing.

Processing Phase

A progress bar advances through four stages:

Stage	Description
1 / 4	Extracting audio from video
2 / 4	Transcribing audio (cloud or local)
3 / 4	Generating document with LLM
4 / 4	Converting to selected output format

Processing time depends on recording length and network speed. A 30-minute recording typically completes in 3–8 minutes with cloud services.

Step 5 — Review Document and Assign Speaker Names

A preview of the generated document is displayed.
For each detected speaker (labelled Speaker A, Speaker B, etc.):
- A short audio clip of that speaker is playable for identification.
- A text field allows entry of the speaker's real name.
Click Apply Names to replace all speaker placeholders in the document.
Review the updated preview.

Step 6 — Download / Export

Click Download.
A native OS save dialog opens.
Choose the destination path and filename.
The file is saved in the format selected in Step 4.

7.3 UI Language Selection

The UI language can be changed via the language selector (flag icons). Currently supported languages: English, German. The selection applies immediately to all UI text.

7.4 Help Page

A built-in help page is accessible from the main toolbar. It provides concise per-step guidance and explains the purpose of each configuration option.

7.5 Error Messages

If the pipeline encounters an error, a notification is displayed in the UI with a description of the failure. Common causes include:

Missing or invalid API key for the selected service
Network connectivity loss during cloud transcription
Unsupported or corrupted input file
Insufficient disk space in the /storage/ directory

8. API Documentation

8.1 Status

The REST API component (/API-Skeleton/) is a non-integrated template. It defines the intended endpoint structure for a future server-mode version of V2D. The current production application communicates exclusively via Electron IPC.

This section documents both the existing IPC interface (authoritative) and the planned REST API (informational).

8.2 IPC API (Production)

The IPC interface is the authoritative communication mechanism between the renderer and main processes. It is accessed via the window.electron object injected by the preload script.

`window.electron.submitFile(payload)`

Initiates the full processing pipeline.

Direction: Renderer → Main

Parameters:

interface SubmitFilePayload {
  video: {
    module: string;          // e.g. "extraction-video-to-audio"
    inputVideoPath: string;  // Absolute path to input file
  };
  transcription: {
    module: string;          // e.g. "assembly"
  };
  document: {
    module: string;          // e.g. "llm-gemini"
    type: string;            // Document type template name (without .txt)
    outputType: "pdf" | "docx" | "html" | "txt";
  };
}

Example:

window.electron.submitFile({
  video: {
    module: "extraction-video-to-audio",
    inputVideoPath: "/Users/name/recordings/meeting.mp4"
  },
  transcription: { module: "assembly" },
  document: {
    module: "llm-gemini",
    type: "followup_report",
    outputType: "pdf"
  }
});

`window.electron.downloadFile(payload)`

Triggers the final document export.

Direction: Renderer → Main

Parameters:

interface DownloadPayload {
  type: "pdf" | "docx" | "html" | "txt";
  speakerMap: Record<string, string>;  // e.g. { "speakerA": "Alice", "speakerB": "Bob" }
}

`window.electron.onProgress(callback)`

Registers a listener for pipeline progress events.

Direction: Main → Renderer

Callback parameter: (stage: number) => void — stage is an integer 1 through 4.

`window.electron.onSpeakerAudios(callback)`

Receives speaker audio clip paths after diarisation.

Direction: Main → Renderer

Callback parameter:

(speakerMap: Record<string, { src: string; name: string }>) => void

`window.electron.onError(callback)`

Registers a listener for pipeline error events.

Direction: Main → Renderer

Callback parameter: (message: string) => void