56 KiB
Video2Document (V2D) — Software Documentation Package
Document Version: 1.0.0 Conformance: ISO/IEC/IEEE 26514:2022 — Systems and Software Engineering — Requirements for Designers and Developers of User Documentation Status: Release Candidate Last Updated: 2026-02-22 Target Audience: End Users · Developers · Administrators Technical Depth: Intermediate → Expert
Table of Contents
- Executive Summary
- Product Overview
- System Architecture
- Technology Stack
- Installation & Setup Guide
- Configuration Guide
- Usage Guide — End User Perspective
- API Documentation
- Data Model Description
- Security Considerations
- Performance & Scalability Notes
- Deployment Guide
- Testing Strategy
- Maintenance & Support Plan
- Limitations & Known Issues
- Future Improvements / Roadmap
- Glossary
1. Executive Summary
Video2Document (V2D) is a cross-platform desktop application that automates the conversion of video and audio recordings into structured, professional documents. It combines AI-based speech transcription, speaker diarisation, and large language model (LLM) document generation into a single, guided end-to-end pipeline.
The system addresses a concrete operational need: organisations and individuals frequently produce recordings of meetings, lectures, and collaborative sessions that remain inaccessible as raw media files. V2D transforms this latent content into actionable, shareable documents — meeting reports, agendas, sprint planning notes, and custom-format outputs — without requiring users to possess programming or AI expertise.
The application is delivered as an Electron desktop executable, runs on Windows, macOS, and Linux, and integrates with three independent LLM providers (Google Gemini, OpenAI GPT via SAIA, and Alibaba Qwen3) and two transcription backends (AssemblyAI cloud and Whisper local). Its modular architecture permits the addition of new AI providers or output formats with minimal modification to the core codebase.
Key capabilities at a glance:
| Capability | Detail |
|---|---|
| Input formats | MP4, MOV, WAV, MP3, FLAC (any FFmpeg-compatible source) |
| Output formats | PDF, DOCX, HTML, TXT |
| Transcription | AssemblyAI (cloud) · Whisper.cpp (local/offline) |
| LLM providers | Google Gemini · OpenAI GPT · Alibaba Qwen3 |
| Document types | Follow-up report · Agenda · Result protocol · Sprint planning note · Custom |
| Speaker handling | Automatic diarisation with post-hoc name substitution |
| Deployment | Cross-platform Electron desktop app |
| UI languages | English · German (extensible) |
2. Product Overview
2.1 Problem Statement
Recording meetings and lectures is widespread practice. However, converting those recordings into usable documents is time-intensive, error-prone, and requires significant cognitive effort. Manually transcribing a one-hour meeting and reformatting it as a professional report can require two to four hours of skilled effort.
V2D reduces this effort to a guided six-step workflow completed in minutes, with the document quality governed by the selected LLM and document-type prompt.
2.2 Target Users
| User Type | Description | Primary Need |
|---|---|---|
| End User (non-technical) | Meeting participants, students, knowledge workers | Generate a document from a recording with no setup |
| Developer | Engineers extending or integrating V2D | Understand architecture, add modules, contribute |
| Administrator | IT staff deploying V2D internally | Manage API keys, configure keyserver, maintain deployments |
2.3 Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | The system shall accept video and audio files as input | Must |
| FR-02 | The system shall extract audio from video files | Must |
| FR-03 | The system shall transcribe audio to text with speaker labels | Must |
| FR-04 | The system shall generate structured documents from transcripts using an LLM | Must |
| FR-05 | The system shall allow the user to assign real names to speakers | Must |
| FR-06 | The system shall export documents in PDF, DOCX, HTML, and TXT formats | Must |
| FR-07 | The system shall provide at least three standard document type templates | Must |
| FR-08 | The system shall allow users to create custom document type templates | Should |
| FR-09 | The system shall support at least two independent transcription backends | Should |
| FR-10 | The system shall support at least two independent LLM backends | Should |
| FR-11 | The system shall display processing progress to the user | Should |
| FR-12 | The system shall provide playable speaker audio previews | Could |
| FR-13 | The system shall support UI localisation to at least English and German | Could |
2.4 Non-Functional Requirements
| ID | Requirement | Category |
|---|---|---|
| NFR-01 | The application shall run on Windows 10+, macOS 12+, and Ubuntu 20.04+ | Portability |
| NFR-02 | API keys shall never be stored in source-controlled files | Security |
| NFR-03 | The UI shall be operable by a non-technical user without documentation | Usability |
| NFR-04 | Individual modules shall be replaceable without modifying the core pipeline | Maintainability |
| NFR-05 | The addition of a new LLM or transcription module shall require changes to exactly one directory | Extensibility |
| NFR-06 | Intermediate artefacts (audio, transcripts, HTML) shall be persisted on disk | Reliability |
| NFR-07 | The application shall report errors to the user via the UI | Usability |
| NFR-08 | All secrets shall be injected at runtime via environment variables or keyserver | Security |
| NFR-09 | The system shall complete processing of a 30-minute recording within 10 minutes (cloud transcription) | Performance |
| NFR-10 | The test suite shall cover all major module types | Testability |
2.5 Scope and Constraints
In scope:
- Desktop application for video/audio-to-document conversion
- Modular plugin system for transcription and LLM providers
- Multi-format document export
- Speaker diarisation and name mapping
Out of scope (current version):
- Real-time (streaming) transcription
- Web-based or SaaS deployment
- Voice biometric speaker identification
- Video editing or preview
- Multi-user collaboration or cloud storage
3. System Architecture
3.1 Architectural Overview
V2D follows a layered, modular pipeline architecture running within a single Electron process pair (main process + renderer process). There is no external server dependency for the core workflow; all orchestration is local.
┌─────────────────────────────────────────────────────────────────────┐
│ Electron Application │
│ │
│ ┌──────────────────────┐ ┌──────────────────────────────┐ │
│ │ Renderer Process │ IPC │ Main Process │ │
│ │ (Chromium / HTML) │◄──────►│ (Node.js / Orchestrator) │ │
│ │ │ │ │ │
│ │ index.html │ │ main.js │ │
│ │ script.js │ │ module-handler.js │ │
│ │ renderer.js │ │ mapFunctions (global Map) │ │
│ │ languages.js │ │ │ │
│ └──────────────────────┘ └──────────────┬───────────────┘ │
│ │ │ │
│ preload.js ┌────────────▼────────────┐ │
│ (contextBridge / IPC) │ Module Registry │ │
│ │ (mapFunctions Map) │ │
│ └────────────┬────────────┘ │
│ │ │
└─────────────────────────────────────────────────┼───────────────────┘
│
┌───────────────────┬───────────────┼───────────────┐
│ │ │ │
┌─────────▼──────┐ ┌─────────▼──────┐ ┌────▼──────┐ ┌────▼──────┐
│ Extraction │ │ Transcription │ │ LLM │ │ Converter │
│ ffmpegExtract │ │ assembly.js │ │ gemini.js │ │ convert.js│
│ │ │ whisperLocal │ │ chatgpt.js│ │ │
└────────┬───────┘ └───────┬────────┘ │ qwen3.js │ └───────────┘
│ │ └────────────┘
▼ ▼
┌─────────────────────────────────┐
│ /storage/ (disk) │
│ /audio /transcripts │
│ /transcriptionSummaries │
│ /documents /documentType │
└─────────────────────────────────┘
3.2 Process Architecture
Electron provides two isolated JavaScript contexts:
| Process | Runtime | Responsibilities |
|---|---|---|
| Main Process | Node.js | File system, pipeline execution, module loading, IPC handlers, native dialogs |
| Renderer Process | Chromium (sandboxed) | UI rendering, user interaction, progress display |
| Preload Script | Node.js (bridge) | Exposes a controlled API surface to the renderer via contextBridge |
Node integration in the renderer is explicitly disabled. All privileged operations are routed through the preload bridge.
3.3 Module System
The module system is the central extensibility mechanism. On startup, main.js traverses all subdirectories of services/modules/, dynamically requires each .js file, and registers the exported module object in a global Map<string, Module> keyed by module.name.
services/modules/
├── extraction/ ffmpegExtractor.js
├── transcription-remote/ assembly.js
├── transcription-local/ whisperLocal.ts
├── jsonTools/ transcriptionSummarizer2.js
├── llm-gemini/ gemini.js
├── llm-chat_gpt/ chatgpt.js
├── quen3/ qwen3.js
├── convert/ convert.js
├── audioSnippets/ extract-speaker-snippets.js
├── replace_speaker/ replaceSpeaker.js
└── utility/ @startup.js · module-handler.js
Each module conforms to the following interface:
module.exports = {
name: "unique-module-identifier", // key in mapFunctions
type: "llm|transcription|...", // category label
displayname: "Human-Readable Name", // shown in UI dropdowns
description: "What this module does",
audioformat: "mp3", // (transcription modules only)
async function(parameter) { // primary entry point
// Implementation
}
}
Modules are invoked as: mapFunctions.get("module-name").function(params).
3.4 Pipeline Execution Flow
The six-step user workflow maps to the following internal pipeline stages:
Stage 1 │ VIDEO INPUT User selects file via native file dialog
│ IPC: file_submit → main.js
▼
Stage 2 │ AUDIO EXTRACTION ffmpegExtractor extracts audio stream
│ Output: /storage/audio/<file>.<format>
▼
Stage 3 │ TRANSCRIPTION AssemblyAI or Whisper.cpp processes audio
│ Output: /storage/transcripts/<id>.json
▼
Stage 4 │ SUMMARISATION Transcript compacted to sentence-level objects
│ Output: /storage/transcriptionSummaries/<id>.json
│ Speaker audio snippets: /storage/audio/speakerSnippets/
▼
Stage 5 │ LLM GENERATION Prompt + summarised transcript → LLM → HTML
│ Output: /storage/documents/<id>.html
▼
Stage 6 │ SPEAKER SUBSTITUTION Regex replace speakerA/B/C with real names
│ + Format conversion (HTML → PDF/DOCX/TXT)
▼
│ EXPORT Native OS save dialog → user's chosen location
3.5 IPC Communication Map
The following IPC channels are defined between the renderer and main processes:
| Channel | Direction | Trigger | Payload |
|---|---|---|---|
file_submit |
Renderer → Main | User submits step 1 form | Pipeline configuration object |
file_download |
Renderer → Main | User clicks Download | { type, speakerMap } |
speaker_submit |
Renderer → Main | User confirms speaker names | Speaker name map |
progress |
Main → Renderer | After each pipeline stage | Stage number (1–4) |
speakerAudios |
Main → Renderer | After snippet extraction | Speaker audio path map |
error |
Main → Renderer | Any pipeline exception | Error message string |
get-module-names |
Renderer ↔ Main | UI init | Module list |
get-txt-files |
Renderer ↔ Main | UI init | Document type filenames |
save-txt-file |
Renderer ↔ Main | Custom type editor save | { filename, content } |
read-txt-file |
Renderer ↔ Main | Custom type editor load | Filename |
delete-txt-file |
Renderer ↔ Main | Custom type editor delete | Filename |
4. Technology Stack
4.1 Core Runtime and Framework
| Component | Technology | Version | Purpose |
|---|---|---|---|
| Runtime | Node.js | 19 – 24 | JavaScript execution environment |
| Desktop Framework | Electron | 39.1.1 | Cross-platform desktop application shell |
| HTTP Framework | Express | 5.1.0 | Local HTTP server (IPC supplement; API Skeleton) |
| Primary Language | JavaScript (ES2022) | — | All production modules |
| Secondary Language | TypeScript | 5.9.3 | Local transcription module, pipeline job examples |
| TypeScript Runtime | ts-node | 10.9.2 | Execute TypeScript without pre-compilation |
4.2 AI and LLM Providers
| Provider | Module | Environment Variable | Model | Mode |
|---|---|---|---|---|
| Google Gemini | llm-gemini |
GOOGLE_API_KEY |
gemini-2.0-flash (default) | Cloud |
| OpenAI GPT (via SAIA) | llm-saia_openai_gpt |
SAIA_API_KEY |
OSS 120B | Cloud |
| Alibaba Qwen3 | qwen3-235b-a22b |
SAIA_API_KEY |
Qwen3-235B-A22B | Cloud |
4.3 Transcription Backends
| Provider | Module | Environment Variable | Mode | Speaker Labels |
|---|---|---|---|---|
| AssemblyAI | assembly |
ASSEMBLYAI_API_KEY |
Cloud (remote) | Yes |
| Whisper.cpp | whisperLocal |
None required | Local (offline) | Limited |
4.4 Audio and Video Processing
| Library | Version | Purpose |
|---|---|---|
fluent-ffmpeg |
Latest | Node.js wrapper for FFmpeg CLI |
ffmpeg-static |
Latest | Bundled static FFmpeg binary |
Supported input container formats: any FFmpeg-compatible format (MP4, MOV, AVI, MKV, WebM for video; WAV, MP3, FLAC, M4A for audio).
Extraction output formats: WAV (16kHz, mono), MP3, FLAC — determined per transcription module's audioformat property.
4.5 Document Generation and Export
| Library | Purpose | Output |
|---|---|---|
html-to-docx |
Convert HTML to Word document | .docx |
puppeteer |
Headless Chromium for PDF rendering | .pdf |
| Native file write | Plain text export | .txt |
| Native file write | HTML export | .html |
4.6 Networking and Utilities
| Library | Purpose |
|---|---|
axios |
HTTP client for external API calls |
dotenv |
Load environment variables from .env file |
4.7 Development and Test Dependencies
| Library | Purpose |
|---|---|
mocha (v11.7.5) |
Unit and integration test runner |
@types/node (v24.9.2) |
TypeScript Node.js type definitions |
@types/fluent-ffmpeg |
TypeScript FFmpeg type definitions |
typescript (v5.9.3) |
TypeScript compiler |
ts-node (v10.9.2) |
Direct TypeScript execution |
5. Installation & Setup Guide
5.1 Prerequisites
Before installing V2D, ensure the following are present on the target machine:
| Requirement | Minimum Version | Notes |
|---|---|---|
| Node.js | 19.x | Versions 19 through 24 are supported. Use node -v to verify. |
| npm | 9.x | Bundled with Node.js. |
| Git | Any recent | Required to clone the repository. |
| Internet connection | — | Required for cloud transcription and LLM services. |
| FFmpeg | Any recent | Bundled via ffmpeg-static; a separate system install is not required. |
Note: A separate system-level FFmpeg installation is not required. The
ffmpeg-staticpackage provides a platform-appropriate binary that is automatically used byfluent-ffmpeg.
5.2 Obtaining the Source Code
git clone <repository-url>
cd video2document
Replace <repository-url> with the actual repository URL provided by your administrator.
5.3 Installing Dependencies
From the project root directory:
npm install
This installs all dependencies declared in package.json, including Electron, Puppeteer, and all service libraries. The installation may take several minutes on the first run due to Puppeteer downloading a Chromium binary.
5.4 Environment Configuration
V2D requires API keys for external services. These are provided via environment variables.
Option A — Local .env file (development / personal use):
Create a file named .env in the project root:
GOOGLE_API_KEY=your_google_gemini_key
ASSEMBLYAI_API_KEY=your_assemblyai_key
SAIA_API_KEY=your_saia_platform_key
The
.envfile is Git-ignored and will never be committed to the repository.
Option B — Keyserver (organisational deployment):
If your organisation operates the V2D keyserver, set the following variables instead:
auth_username=your_keyserver_username
auth_password=your_keyserver_password
On startup, the application contacts keyserver.dommymommy.xyz:443 and retrieves all required API keys automatically. See Section 6.2 for details.
5.5 Starting the Application
Linux / macOS:
./start.sh
Or manually:
npm install # if not already done
npm start
Windows:
start.bat
Or manually from Command Prompt or PowerShell:
npm install
npm start
The Electron window will open automatically after a brief initialisation period.
5.6 Verifying the Installation
After launch, the V2D main window should appear showing Step 1 (file selection). If the window does not appear:
- Check the terminal / command prompt for Node.js error output.
- Confirm Node.js version with
node -v. - Confirm
npm installcompleted without errors. - Confirm at least one set of API credentials is present in
.envor via keyserver.
6. Configuration Guide
6.1 Environment Variables Reference
| Variable | Required By | Description |
|---|---|---|
GOOGLE_API_KEY |
llm-gemini module |
Google Cloud API key with Generative Language API enabled |
ASSEMBLYAI_API_KEY |
assembly transcription module |
AssemblyAI account API key |
SAIA_API_KEY |
llm-saia_openai_gpt, qwen3 modules |
SAIA platform API key for hosted LLM access |
auth_username |
Keyserver auth | Username credential for automatic key retrieval |
auth_password |
Keyserver auth | Password credential for automatic key retrieval |
Only the variables corresponding to the modules you intend to use are strictly required. The application will report a clear error if a module is invoked without its required key.
6.2 Keyserver Authentication
The keyserver is a proprietary key distribution service hosted at keyserver.dommymommy.xyz:443. When credentials are provided, the startup module (@startup.js) performs the following sequence:
- Issues a
GETrequest to the keyserver endpoint withauth_usernameandauth_passwordas HTTP headers. - The server responds with a JSON object containing API key–value pairs.
- Keys are merged into
process.envwithin the main process. - Keys are never forwarded to the renderer process.
This mechanism allows teams to rotate keys centrally without redistributing .env files to individual users.
6.3 Document Type Templates
Document type templates control the instructions given to the LLM when generating output. They are stored as plain text files in /storage/documentType/.
Built-in templates:
| Filename | Purpose |
|---|---|
followup_report.txt |
Professional post-meeting follow-up report |
agenda.txt |
Reconstructed meeting agenda from transcript |
result_protocol.txt |
Formal meeting minutes / result protocol |
sprint_planning_note.txt |
Agile sprint planning session documentation |
custom_document.txt |
Placeholder for user-defined format |
Creating a custom template:
Templates are editable directly through the V2D UI (Custom Type Editor, accessible from the document type step) or by editing the .txt files directly in /storage/documentType/.
Template authoring guidelines:
- Write instructions for the LLM in clear, imperative sentences.
- Reference speaker tokens using the exact placeholder strings
speakerA,speakerB,speakerC, etc. - Specify the desired output structure (headings, bullet lists, tables).
- Include language instructions if multilingual output is required.
- Do not include the transcript itself — it is automatically appended by the pipeline.
6.4 Module Configuration
Modules are self-describing. To enable or disable a module, add or remove its .js file from its directory under services/modules/. The module will be included or excluded from all UI dropdowns automatically on next application start.
To change the default Gemini model, edit the model constant in services/modules/llm-gemini/gemini.js.
6.5 Storage Paths
All artefact directories are located under /storage/ at the project root. These paths are currently hard-coded and are not configurable without modifying the relevant module source files. Future versions may expose these as configuration parameters.
7. Usage Guide — End User Perspective
7.1 Workflow Overview
V2D presents users with a linear six-step wizard. Steps must be completed in order. Each step requires a selection or action before the next step becomes available.
Step 1 → Step 2 → Step 3 → Step 4 → [ Processing ] → Step 5 → Step 6
File Transcrip- Document Output Preview Download
Select tion Svc Type Format & Names
7.2 Step-by-Step Instructions
Step 1 — Select Video or Audio File
- Click Browse or drag and drop a file onto the designated area.
- Accepted formats: any FFmpeg-compatible video or audio file (MP4, MOV, WAV, MP3, FLAC, etc.).
- The selected filename is displayed for confirmation.
- Click Next to proceed.
Step 2 — Select Transcription Service
- A dropdown lists all registered transcription modules.
- Select the desired backend:
- AssemblyAI — cloud-based, highest accuracy, requires internet and API key.
- Whisper Local — offline, no API key, lower accuracy for multi-speaker content.
- Click Next to proceed.
Step 3 — Select Document Type
- A dropdown lists all document type templates found in
/storage/documentType/. - Select the template matching your desired output format (e.g., "Follow-up Report").
- Optionally, click Edit / Create Custom Type to open the template editor.
- Click Next to proceed.
Step 4 — Select Output Format and LLM
- Select the export format: PDF, DOCX, HTML, or TXT.
- Select the LLM provider from the dropdown.
- Click Generate Document to begin processing.
Processing Phase
A progress bar advances through four stages:
| Stage | Description |
|---|---|
| 1 / 4 | Extracting audio from video |
| 2 / 4 | Transcribing audio (cloud or local) |
| 3 / 4 | Generating document with LLM |
| 4 / 4 | Converting to selected output format |
Processing time depends on recording length and network speed. A 30-minute recording typically completes in 3–8 minutes with cloud services.
Step 5 — Review Document and Assign Speaker Names
- A preview of the generated document is displayed.
- For each detected speaker (labelled Speaker A, Speaker B, etc.):
- A short audio clip of that speaker is playable for identification.
- A text field allows entry of the speaker's real name.
- Click Apply Names to replace all speaker placeholders in the document.
- Review the updated preview.
Step 6 — Download / Export
- Click Download.
- A native OS save dialog opens.
- Choose the destination path and filename.
- The file is saved in the format selected in Step 4.
7.3 UI Language Selection
The UI language can be changed via the language selector (flag icons). Currently supported languages: English, German. The selection applies immediately to all UI text.
7.4 Help Page
A built-in help page is accessible from the main toolbar. It provides concise per-step guidance and explains the purpose of each configuration option.
7.5 Error Messages
If the pipeline encounters an error, a notification is displayed in the UI with a description of the failure. Common causes include:
- Missing or invalid API key for the selected service
- Network connectivity loss during cloud transcription
- Unsupported or corrupted input file
- Insufficient disk space in the
/storage/directory
8. API Documentation
8.1 Status
The REST API component (/API-Skeleton/) is a non-integrated template. It defines the intended endpoint structure for a future server-mode version of V2D. The current production application communicates exclusively via Electron IPC.
This section documents both the existing IPC interface (authoritative) and the planned REST API (informational).
8.2 IPC API (Production)
The IPC interface is the authoritative communication mechanism between the renderer and main processes. It is accessed via the window.electron object injected by the preload script.
window.electron.submitFile(payload)
Initiates the full processing pipeline.
Direction: Renderer → Main
Parameters:
interface SubmitFilePayload {
video: {
module: string; // e.g. "extraction-video-to-audio"
inputVideoPath: string; // Absolute path to input file
};
transcription: {
module: string; // e.g. "assembly"
};
document: {
module: string; // e.g. "llm-gemini"
type: string; // Document type template name (without .txt)
outputType: "pdf" | "docx" | "html" | "txt";
};
}
Example:
window.electron.submitFile({
video: {
module: "extraction-video-to-audio",
inputVideoPath: "/Users/name/recordings/meeting.mp4"
},
transcription: { module: "assembly" },
document: {
module: "llm-gemini",
type: "followup_report",
outputType: "pdf"
}
});
window.electron.downloadFile(payload)
Triggers the final document export.
Direction: Renderer → Main
Parameters:
interface DownloadPayload {
type: "pdf" | "docx" | "html" | "txt";
speakerMap: Record<string, string>; // e.g. { "speakerA": "Alice", "speakerB": "Bob" }
}
window.electron.onProgress(callback)
Registers a listener for pipeline progress events.
Direction: Main → Renderer
Callback parameter: (stage: number) => void — stage is an integer 1 through 4.
window.electron.onSpeakerAudios(callback)
Receives speaker audio clip paths after diarisation.
Direction: Main → Renderer
Callback parameter:
(speakerMap: Record<string, { src: string; name: string }>) => void
window.electron.onError(callback)
Registers a listener for pipeline error events.
Direction: Main → Renderer
Callback parameter: (message: string) => void
window.electron.getModuleNames()
Returns the list of registered module names, grouped by type.
Direction: Renderer ↔ Main (invoke/handle)
window.electron.getTxtFiles()
Returns filenames of all document type templates.
Direction: Renderer ↔ Main (invoke/handle)
window.electron.saveTxtFile(filename, content)
Persists a new or updated document type template.
Direction: Renderer ↔ Main (invoke/handle)
window.electron.readTxtFile(filename)
Returns the content of a named document type template.
Direction: Renderer ↔ Main (invoke/handle)
window.electron.deleteTxtFile(filename)
Deletes a named document type template.
Direction: Renderer ↔ Main (invoke/handle)
8.3 REST API Skeleton (Planned — Not Integrated)
Base URL: http://localhost:3000
Authentication: Not yet defined.
Health Check
GET /api/v1/healthcheck
Response: 200 OK — "running"
Video Endpoints
| Method | Path | Description | Status |
|---|---|---|---|
POST |
/videos |
Upload a video file for processing | Planned |
GET |
/videos |
List all processed videos | Planned |
GET |
/videos/:id |
Retrieve a specific video record | Planned |
DELETE |
/videos/:id |
Delete a video record | Planned |
Document Endpoints
| Method | Path | Description | Status |
|---|---|---|---|
GET |
/documents |
List all generated documents | Planned |
GET |
/documents/:id |
Retrieve a specific document | Planned |
DELETE |
/documents/:id |
Delete a document record | Planned |
9. Data Model Description
9.1 Primary Data Artefacts
V2D does not use a relational database in the current version. All persistent data is stored as files in structured directories under /storage/.
9.2 Raw Transcript (AssemblyAI Output)
Stored in: /storage/transcripts/<transcript_id>.json
interface TranscriptWord {
text: string; // Individual word token
speaker: string; // Speaker label, e.g. "A", "B"
start: number; // Word start time, milliseconds
end: number; // Word end time, milliseconds
confidence?: number;
}
interface RawTranscript {
id: string;
status: "completed" | "error";
text: string; // Full transcript as single string
words: TranscriptWord[];
language_code?: string;
utterances?: TranscriptUtterance[];
}
9.3 Summarised Transcript
Stored in: /storage/transcriptionSummaries/<id>.json
Produced by transcriptionSummarizer2.js. Word-level tokens are grouped into complete sentences per speaker.
interface SummarisedSegment {
speaker: string; // Speaker label, e.g. "A"
sentence: string; // Reconstructed sentence text
start: number; // Segment start time, milliseconds
end: number; // Segment end time, milliseconds
}
type SummarisedTranscript = SummarisedSegment[];
9.4 Speaker Audio Snippet Map
Stored in memory; audio files written to /storage/audio/speakerSnippets/.
interface SpeakerSnippetMap {
[speakerKey: string]: { // e.g. "speakerA"
src: string; // Absolute path to WAV snippet file
name: string; // Same as speakerKey
};
}
9.5 Pipeline Input Payload
Passed via IPC from renderer to main process when the user initiates processing.
interface PipelinePayload {
video: {
module: string;
inputVideoPath: string;
};
transcription: {
module: string;
};
document: {
module: string;
type: string;
outputType: "pdf" | "docx" | "html" | "txt";
};
}
9.6 Module Registry Entry
Each entry in the global mapFunctions Map conforms to:
interface ModuleDefinition {
name: string;
type: string;
displayname: string;
description: string;
audioformat?: string;
function: (parameter: unknown) => Promise<unknown>;
}
9.7 Document Type Template
Stored in: /storage/documentType/<name>.txt
Plain text files containing LLM prompt instructions. No structured schema; content is author-defined. Templates are prepended to the summarised transcript before submission to the LLM.
10. Security Considerations
10.1 Secret Management
Threat: API key exposure in source code or version control.
Mitigations in place:
.envis listed in.gitignoreand is never committed.- API keys are loaded at runtime via
dotenvor the keyserver, not hard-coded. - Keys are stored only in
process.envwithin the main process. - The preload script's
contextBridgeexposes no key values to the renderer process. - The GitLab CI/CD pipeline injects keys via protected CI/CD variables, not plain-text files.
Residual risk: If a user manually adds API keys to a tracked source file, they will be committed. Developer education and pre-commit hook enforcement (e.g., git-secrets) are recommended mitigations.
10.2 Electron Security Model
| Control | Status | Notes |
|---|---|---|
nodeIntegration: false |
Enabled | Renderer has no direct Node.js access |
contextIsolation: true |
Enabled | Renderer and preload run in separate contexts |
sandbox (preload) |
Enabled | Limits preload capabilities |
webSecurity |
Default (enabled) | Standard Chromium web security applies |
| Content Security Policy | Not explicitly set | Recommended for production hardening |
Recommendation: Add an explicit Content Security Policy header to the index.html file to prevent XSS attacks and restrict resource loading origins.
10.3 Data Handling
Local data: Transcripts, audio snippets, and generated documents are stored in the local /storage/ directory. These files contain the full text of the processed meeting or lecture.
Implications:
- Files persist after the session ends and are not automatically purged.
- On shared machines, other users with file system access can read these artefacts.
- Organisations handling sensitive content (e.g., confidential meetings) should restrict operating system file permissions on the V2D installation directory.
- A storage clean-up mechanism (scheduled or on-demand) is recommended for future versions.
Cloud data: When using cloud transcription (AssemblyAI) or cloud LLMs (Gemini, SAIA), audio and transcript content is transmitted to external services. Users should review the data retention and privacy policies of each provider before processing sensitive content.
10.4 Network Security
- All external API calls use HTTPS.
- The keyserver endpoint (
keyserver.dommymommy.xyz:443) uses TLS. - No inbound network ports are opened by the application in normal operation.
- The API Skeleton (Express server) would open port 3000 if started; this is not part of the default startup.
10.5 Input Validation
- File path inputs are passed to FFmpeg via the
fluent-ffmpeglibrary, which provides a layer of sanitisation. - Document type template names are used to construct file paths; input from the UI should be validated to prevent path traversal (e.g., filenames containing
../). This is an identified hardening item.
10.6 Risk Assessment Summary
| Risk | Likelihood | Impact | Mitigation Status |
|---|---|---|---|
| API key committed to version control | Low | High | Mitigated (.gitignore, CI variables) |
| Sensitive meeting content transmitted to cloud | Medium | High | Documented; user decision |
| Local artefact exposure on shared machines | Medium | Medium | Partially mitigated (user responsibility) |
| Path traversal in template file naming | Low | Medium | Not yet mitigated — requires input sanitisation |
| XSS in generated HTML document preview | Low | Low | Partially mitigated by Electron sandbox |
| Keyserver unavailability blocking startup | Medium | Medium | Fallback: local .env |
11. Performance & Scalability Notes
11.1 Performance Characteristics
| Stage | Bottleneck | Typical Duration (30-min recording) |
|---|---|---|
| Audio extraction | Disk I/O + FFmpeg CPU | 10–30 seconds |
| Cloud transcription (AssemblyAI) | Network + provider queue | 2–5 minutes |
| Local transcription (Whisper) | CPU / memory | 5–20 minutes |
| Transcript summarisation | CPU (JavaScript) | < 5 seconds |
| LLM document generation | Network + provider latency | 20–90 seconds |
| Document conversion (PDF) | Puppeteer / Chromium | 5–15 seconds |
| Speaker snippet extraction | Disk I/O + FFmpeg | 10–30 seconds |
Total (cloud, typical): 3–8 minutes for a 30-minute recording.
Total (local Whisper): 8–25 minutes depending on hardware.
11.2 Memory Consumption
- Electron base overhead: ~150–300 MB RAM.
- Puppeteer (PDF generation): additional ~200–400 MB during conversion.
- Long transcripts loaded into memory as JSON: typically < 10 MB.
- Speaker audio snippets: 1–5 MB each; total depends on speaker count.
Peak memory usage for a typical session: 500–800 MB. Ensure the host machine has at least 4 GB RAM available for comfortable operation.
11.3 Disk Space Requirements
| Artefact Type | Typical Size per Session |
|---|---|
| Extracted audio (WAV 16kHz mono, 30 min) | ~50 MB |
| Raw transcript JSON | 0.5–2 MB |
| Summarised transcript JSON | 0.1–0.5 MB |
| Speaker audio snippets (WAV) | 0.5–5 MB total |
| Generated HTML document | 10–100 KB |
| PDF export | 50–500 KB |
| DOCX export | 50–200 KB |
Accumulated storage can grow significantly over time if artefacts are not periodically cleared.
11.4 Scalability Constraints
The current architecture is single-user, single-session. It is not designed for concurrent pipeline executions and has the following inherent constraints:
- Only one pipeline job can run at a time within a single Electron instance.
- The
mapFunctionsglobal Map andstorage/directory are shared across all sessions of the same instance. - The application does not implement job queuing.
For multi-user or server-side deployment, the REST API Skeleton provides a foundation, but significant development is required to add persistence, concurrency control, and job management.
11.5 Operational Impact Analysis
| Scenario | Impact |
|---|---|
| AssemblyAI API outage | Transcription unavailable; local Whisper fallback is available |
| SAIA platform outage | Qwen3 and GPT modules unavailable; Gemini fallback is available |
| Google Gemini API outage | Gemini module unavailable; SAIA-hosted models are available |
| Keyserver outage | Application cannot retrieve keys; local .env fallback required |
| Puppeteer Chromium failure | PDF export unavailable; HTML, DOCX, and TXT exports remain functional |
| FFmpeg binary failure | Audio extraction fails; entire pipeline blocked |
12. Deployment Guide
12.1 Local Development Deployment
Follow Section 5 exactly. This is the standard and only fully supported deployment mode.
12.2 Electron Package Build (Distribution)
V2D can be packaged into a distributable executable using electron-builder or electron-forge. Neither is currently configured in package.json; the following steps describe the integration procedure.
Install a build tool:
npm install --save-dev electron-builder
Add build configuration to package.json:
{
"build": {
"appId": "com.your-org.video2document",
"productName": "Video2Document",
"directories": { "output": "dist" },
"files": ["main.js", "requires.js", "electron/", "services/", "storage/documentType/"],
"mac": { "target": "dmg" },
"win": { "target": "nsis" },
"linux": { "target": "AppImage" }
}
}
Build:
npx electron-builder build --mac --win --linux
Important: Do not bundle
.envfiles or API keys into the distribution package. The keyserver mechanism or user-provided.envfiles should be used in distributed builds.
12.3 GitLab CI/CD Pipeline
The .gitlab-ci.yml file defines a single test stage:
image: node:latest
stages:
- test
job-test:
stage: test
script:
- npm install
- echo "ASSEMBLYAI_API_KEY=$apikey_assembly" > .env
- echo "GOOGLE_API_KEY=$apikey_gemini" >> .env
- echo "SAIA_API_KEY=$apikey_saia" >> .env
- npm test
CI/CD variables required (set in GitLab project settings → CI/CD → Variables):
| Variable | Description |
|---|---|
apikey_assembly |
AssemblyAI API key |
apikey_gemini |
Google Gemini API key |
apikey_saia |
SAIA platform API key |
Mark all variables as Protected and Masked in GitLab settings to prevent exposure in job logs.
12.4 Adding the Application to a New System
For end-user distribution without requiring source code access:
- Build the Electron package per Section 12.2.
- Distribute the installer (
*.dmg,*.exe, or*.AppImage). - Instruct users to create a
.envfile in the application's user data directory, or configure keyserver credentials.
Note: The
storage/directory must be writable at runtime. Electron packages typically store user data in the OS-standard application data directory. Path configuration may need adjustment for packaged builds.
12.5 Server-Side Deployment (Future)
The API Skeleton (/API-Skeleton/) provides an Express server template for eventual server-side deployment. To productionise:
- Implement the stub endpoints in
video_Router.jsanddocument_Router.js. - Add a persistent database (e.g., PostgreSQL with an ORM).
- Add authentication middleware.
- Run behind a reverse proxy (nginx, Caddy) with TLS termination.
- Containerise with Docker for consistent deployments.
13. Testing Strategy
13.1 Overview
V2D employs Mocha as the test runner. Tests are located in /test/unit/test.js (comprehensive module tests) and /test/integration/ (service-specific integration tests).
13.2 Test Suite Structure
| Suite | File | Test Count | Coverage Area |
|---|---|---|---|
| Audio Extraction | test/unit/test.js |
4 | FFmpeg extraction, format variants, error cases |
| Audio Transcription | test/unit/test.js |
2 | AssemblyAI integration, invalid input handling |
| Transcript Summarisation | test/unit/test.js |
4 | Summariser v1 and v2, error cases |
| LLM Document Generation | test/unit/test.js |
6 | ChatGPT, Gemini, Qwen3, missing template error |
| Audio Snippets | test/unit/test.js |
3 | Speaker snippet extraction, error handling |
| AssemblyAI Integration | test/integration/assembly.test.js |
Variable | End-to-end transcription |
| Summariser Integration | test/integration/transcriptionSummarizerTest.js |
Variable | Summariser with real transcript |
| Gemini Integration | test/integration/gemini/test-gemini.js |
Variable | End-to-end Gemini generation |
13.3 Running Tests
Full test suite:
npm test
This executes: mocha ./test/unit/test.js
Individual integration tests:
node test/integration/assembly.test.js
node test/integration/gemini/test-gemini.js
13.4 Test Timeouts
| Category | Timeout |
|---|---|
| Audio extraction | 3,000 ms |
| Transcription (cloud) | 120,000 ms |
| LLM generation | 120,000 ms |
| Document conversion | 30,000 ms |
13.5 Test Environment Requirements
All tests that invoke external APIs require valid credentials in .env or via the keyserver. Tests will fail with authentication errors if credentials are absent.
CI/CD injects credentials as GitLab CI variables (see Section 12.3).
13.6 Coverage Gaps
| Area | Status | Notes |
|---|---|---|
| UI / Renderer process | Not tested | No UI automation tests exist |
| IPC channel handling | Not tested | No mock IPC tests |
| Local Whisper transcription | Incomplete | Module is TypeScript, partially implemented |
| Document conversion | Not tested independently | Covered indirectly by LLM tests |
| Speaker name replacement | Not tested | replaceSpeaker.js lacks unit tests |
| API Skeleton endpoints | Not tested | Skeleton only |
13.7 Recommended Additional Tests
- Unit tests for
replaceSpeaker.jswith edge cases (names with special characters, multiple occurrences). - Unit tests for
transcriptionSummarizer2.jswith synthetic transcript JSON. - Mocked IPC tests for the main process handlers in
main.js. - End-to-end UI tests using a framework such as Playwright for Electron.
- Input validation tests for file path injection and template name sanitisation.
14. Maintenance & Support Plan
14.1 Dependency Management
| Activity | Frequency | Responsibility |
|---|---|---|
Review npm outdated for security advisories |
Monthly | Developer/Admin |
| Update Electron to latest stable version | Per major release | Developer |
| Update Puppeteer Chromium dependency | Per major release | Developer |
| Review external API provider changelog (AssemblyAI, Gemini, SAIA) | Monthly | Developer |
| Rotate API keys | Per organisation policy | Administrator |
Run npm audit regularly and address high-severity advisories promptly.
14.2 Storage Maintenance
The /storage/ directory accumulates artefacts indefinitely. A maintenance schedule is recommended:
| Artefact | Retention Recommendation |
|---|---|
| Raw audio extracts | Delete after session or after 7 days |
| Raw transcripts (JSON) | Retain for 30 days for debugging |
| Summarised transcripts | Retain for 30 days |
| Speaker audio snippets | Delete after session |
| Generated HTML documents | Retain until user exports final document |
A clean-up script or UI option to purge artefacts is recommended for future versions.
14.3 Adding a New Module
To add a new LLM or transcription provider:
- Create a directory under
services/modules/<module-category>/. - Create a
.jsfile following the module interface template. - Export the module object with a unique
namefield. - Add the required API key to
.envand to the keyserver. - Restart the application — the module is auto-discovered and appears in UI dropdowns.
- Write unit tests in
test/unit/test.jsfollowing existing patterns.
No changes to main.js, the module handler, or the UI are required.
14.4 Monitoring and Logging
V2D currently logs to console.log / console.error in the main process. In a production distribution, these logs are accessible via:
- macOS:
~/Library/Logs/<AppName>/main.log - Windows:
%APPDATA%\<AppName>\logs\main.log - Linux:
~/.config/<AppName>/logs/main.log
(Exact paths depend on the app.getPath('userData') value set in the packaged application.)
Structured logging (e.g., using the winston library) and log rotation are recommended for operational deployments.
14.5 Support Channels
| Channel | Use |
|---|---|
| GitLab Issues | Bug reports, feature requests |
| GitLab CI/CD | Automated regression testing |
| README.md | End-user setup guidance |
| documentation.md | Developer architectural reference |
| This document | Comprehensive operational reference |
15. Limitations & Known Issues
15.1 Functional Limitations
| ID | Limitation | Affected Feature | Workaround |
|---|---|---|---|
| LIM-01 | Local Whisper transcription is not fully implemented | Transcription | Use AssemblyAI |
| LIM-02 | REST API is a non-integrated skeleton | API access | Use Electron IPC (desktop app) |
| LIM-03 | No database persistence; artefacts are flat files | Session history | Manual file management |
| LIM-04 | Single concurrent pipeline only | Throughput | Process files sequentially |
| LIM-05 | Speaker identification is label-based (A, B, C), not biometric | Speaker accuracy | Manual name assignment in Step 5 |
| LIM-06 | Custom template names are not validated for path traversal | Security | Avoid special characters in template names |
| LIM-07 | Storage paths are hard-coded in module source files | Portability | Edit source files to relocate storage |
| LIM-08 | No automatic artefact clean-up | Disk usage | Manually clear /storage/ directories |
| LIM-09 | Packaged Electron distribution not yet configured | Distribution | Run from source |
| LIM-10 | No Content Security Policy defined | Security | Manual CSP header addition |
15.2 Known Issues
| ID | Issue | Severity | Status |
|---|---|---|---|
| KI-01 | Whisper local module (whisperLocal.ts) is incomplete; invoking it will fail |
High | Open |
| KI-02 | Long transcripts may exceed LLM context window limits, truncating output | Medium | Open |
| KI-03 | Puppeteer may fail on systems with strict sandbox policies (e.g., Docker without --no-sandbox) |
Medium | Workaround: add --no-sandbox flag |
| KI-04 | Speaker diarisation accuracy decreases with more than 4 concurrent speakers | Medium | Open |
| KI-05 | Mixed JS/TS codebase causes inconsistent linting and type coverage | Low | Open |
| KI-06 | No validation that input file is readable/non-corrupt before pipeline start | Low | Open |
16. Future Improvements / Roadmap
The following improvements are identified based on current limitations, security analysis, and operational requirements. Items are grouped by priority.
16.1 Priority 1 — Critical Gaps
| Item | Description |
|---|---|
| Complete Whisper local transcription | Fully implement whisperLocal.ts for offline, privacy-preserving operation |
| Input path validation | Sanitise all user-provided file paths and template names to prevent traversal attacks |
| Content Security Policy | Add explicit CSP headers to index.html |
| Artefact lifecycle management | Implement automatic or on-demand clean-up of /storage/ artefacts |
16.2 Priority 2 — Important Improvements
| Item | Description |
|---|---|
| Electron distribution packaging | Configure electron-builder for one-click installer distribution |
| Structured logging | Replace console.log with a structured logger (winston) with log rotation |
| Context-length handling | Split long transcripts into chunks if they approach LLM token limits |
| TypeScript migration | Progressively convert all modules to TypeScript for type safety and tooling |
| Coverage expansion | Add UI automation tests (Playwright for Electron), IPC mock tests, and replacement module tests |
16.3 Priority 3 — Feature Enhancements
| Item | Description |
|---|---|
| REST API integration | Integrate the API Skeleton into the production application for headless/server operation |
| Session history | Implement a lightweight SQLite database to persist session metadata and enable document retrieval |
| Multi-language transcription | Explicitly expose language selection to the user for improved transcription accuracy |
| Concurrent pipeline support | Add a job queue to allow multiple files to be processed sequentially without UI blocking |
| Streaming progress | Display detailed sub-step progress (e.g., transcription percentage) rather than four discrete stages |
16.4 Priority 4 — Strategic / Long-Term
| Item | Description |
|---|---|
| Web / SaaS mode | Deploy V2D as a web application for browser access without installation |
| Speaker biometrics | Integrate voice-print identification to automatically name known speakers |
| Custom LLM hosting | Support self-hosted LLM endpoints (Ollama, LM Studio) for full on-premise operation |
| Document templates marketplace | Allow sharing and importing community-created document type templates |
| Enterprise SSO | Integrate with SAML/OIDC for organisational user management |
17. Glossary
| Term | Definition |
|---|---|
| API (Application Programming Interface) | A defined interface that allows software components to communicate with each other. |
| AssemblyAI | A cloud-based AI transcription service that converts audio to text with speaker diarisation support. |
| Artefact | An intermediate file produced and consumed by a pipeline stage (e.g., extracted audio, raw transcript, HTML document). |
| Context Bridge | Electron's mechanism (contextBridge) for safely exposing main-process APIs to the renderer without granting full Node.js access. |
| Diarisation (Speaker Diarisation) | The process of partitioning a transcript by speaker identity, labelling segments as "Speaker A", "Speaker B", etc. |
| Dotenv | A utility that loads environment variables from a .env file into process.env at application startup. |
| Electron | An open-source framework by GitHub/Microsoft for building cross-platform desktop applications using web technologies (HTML, CSS, JavaScript). |
| FFmpeg | A widely used open-source multimedia processing library supporting encoding, decoding, and conversion of audio and video. |
| fluent-ffmpeg | A Node.js library providing a fluent, chainable API for invoking FFmpeg operations. |
| IPC (Inter-Process Communication) | The mechanism Electron uses to pass messages between the main process (Node.js) and the renderer process (Chromium). |
| Keyserver | A custom internal service (keyserver.dommymommy.xyz) that distributes API keys to authenticated V2D instances at runtime. |
| LLM (Large Language Model) | A class of AI model trained on large text corpora, capable of generating coherent text given a prompt. Examples: GPT, Gemini, Qwen3. |
| mapFunctions | The global JavaScript Map object in V2D's main process that stores all registered module objects, keyed by module name. |
| Module | A self-contained unit of functionality in V2D that conforms to the standard module interface and is auto-discovered at startup. |
| Module Handler | The utility (module-handler.js) responsible for dynamically loading module files and populating mapFunctions. |
| Mocha | A JavaScript test framework for Node.js that provides test suite structure, assertions, and reporting. |
| npm | Node Package Manager — the default package registry and dependency management tool for Node.js. |
| Preload Script | A Node.js script (preload.js) that runs in a privileged context in Electron and bridges the renderer and main processes via contextBridge. |
| Prompt Template | A plain-text file containing LLM instructions that define the structure and style of the generated document. |
| Puppeteer | A Node.js library that controls a headless Chromium browser, used in V2D for HTML-to-PDF conversion. |
| Renderer Process | The Chromium browser process in Electron that renders the application UI. It runs in a sandboxed environment without direct Node.js access. |
| SAIA | An academic cloud computing platform that hosts selected LLM models (including Qwen3 and GPT variants) accessible via API. |
| Speaker Placeholder | A temporary token (e.g., speakerA, speakerB) inserted by the LLM generation step and later replaced with the speaker's real name. |
| Transcript | The textual output of the transcription step, containing the spoken words with associated speaker labels and timestamps. |
| V2D | Abbreviation for Video2Document — the name of this software system. |
| Whisper / Whisper.cpp | An open-source speech recognition model by OpenAI. Whisper.cpp is a C++ port that runs locally without network access. |
End of Document
Document Control
Field Value Title Video2Document — Software Documentation Package Version 1.0.0 Standard ISO/IEC/IEEE 26514:2022 Generated 2026-02-22 Project video2document (WISE 2025/2026) Status Release Candidate