From 4ade9575f33fd353522a07abe2f60e2d1be6feb4 Mon Sep 17 00:00:00 2001 From: MikeHughes-BIN Date: Thu, 20 Nov 2025 16:30:47 +0100 Subject: [PATCH] Changes to the test --- .../{documentType.txt => documentType.json} | 0 test/integration/gemini/test-gemini.js | 4 +- test/integration/gemini/transcript.json | 122 ++++++++++++++++++ test/integration/gemini/transcript.txt | 53 -------- 4 files changed, 124 insertions(+), 55 deletions(-) rename test/integration/gemini/{documentType.txt => documentType.json} (100%) create mode 100644 test/integration/gemini/transcript.json delete mode 100644 test/integration/gemini/transcript.txt diff --git a/test/integration/gemini/documentType.txt b/test/integration/gemini/documentType.json similarity index 100% rename from test/integration/gemini/documentType.txt rename to test/integration/gemini/documentType.json diff --git a/test/integration/gemini/test-gemini.js b/test/integration/gemini/test-gemini.js index 9df6eb4..7072c43 100644 --- a/test/integration/gemini/test-gemini.js +++ b/test/integration/gemini/test-gemini.js @@ -1,7 +1,7 @@ const gemini = require("../../../services/modules/llm-gemini/gemini.js"); gemini.function({ - inputTranscriptPath: "./transcript.txt", - documentTypePath: "./documentType.txt", + inputTranscriptPath: "./transcript.json", + documentTypePath: "./documentType.json", language: "de" }); \ No newline at end of file diff --git a/test/integration/gemini/transcript.json b/test/integration/gemini/transcript.json new file mode 100644 index 0000000..ca5b9d5 --- /dev/null +++ b/test/integration/gemini/transcript.json @@ -0,0 +1,122 @@ +[ + { + "speaker": "A", + "sentence": "Smoke from hundreds of wildfires in Canada is triggering air quality alerts throughout the US Skylines from Maine to Maryland to Minnesota are gray and smoggy. And in some places, the air quality warnings include the warning to stay inside. We wanted to better understand what's happening here and why, so we called Peter DeCarlo, an associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Good morning, Professor.", + "start": 240, + "end": 26560 + }, + { + "speaker": "B", + "sentence": "Good morning.", + "start": 28060, + "end": 28620 + }, + { + "speaker": "A", + "sentence": "So what is it about the conditions right now that have caused this round of wildfires to affect so many people so far away?", + "start": 29100, + "end": 37100 + }, + { + "speaker": "B", + "sentence": "Well, there's a couple of things. The season has been pretty dry already, and then the fact that we're getting hit in the US is because there's a couple weather systems that are essentially channeling the smoke from those Canadian wildfires through Pennsylvania into the mid Atlantic and the Northeast and kind of just dropping the smoke there.", + "start": 39100, + "end": 55820 + }, + { + "speaker": "A", + "sentence": "So what is it in this haze that makes it harmful? And I'm assuming it is harmful.", + "start": 56590, + "end": 60670 + }, + { + "speaker": "B", + "sentence": "It is, it is. The levels outside right now in Baltimore are considered unhealthy. And most of that is due to what's called particulate matter, which are tiny particles, microscopic, smaller than the width of your hair, that can get into your lungs and impact your respiratory system, your cardiovascular system, and even your neurological, your brain.", + "start": 62350, + "end": 82590 + }, + { + "speaker": "A", + "sentence": "What makes this particularly harmful? Is it the volume of particulate? Is it something in particular? What is it exactly? Can you just drill down on that a little bit more?", + "start": 83630, + "end": 92190 + }, + { + "speaker": "B", + "sentence": "Yeah. So the concentration of particulate matter, I was looking at some of the monitors that we have was reaching levels of what are, in science speak, 150 micrograms per meter cubed, which is more than 10 times what the annual average should be in about four times higher than what you're supposed to have on a 24 hour average. And so the concentrations of these particles in the air are just much, much, much higher than we typically see. And exposure to those high levels can lead to a host of health problems.", + "start": 93550, + "end": 123350 + }, + { + "speaker": "A", + "sentence": "And who is most vulnerable? I noticed that in New York City, for example, they're canceling outdoor activities. And so here it is in the early days of summer and they have to keep all the kids inside. So who tends to be vulnerable in a situation like this?", + "start": 123430, + "end": 135990 + }, + { + "speaker": "B", + "sentence": "It's the youngest. So children, obviously, whose bodies are still developing, the elderly who are, you know, their bodies are more in decline and they're more susceptible to the health impacts of breathing, the poor air quality. And then people who have pre existing health conditions, people with respiratory conditions or heart conditions, can be triggered by high levels of air pollution.", + "start": 137610, + "end": 156650 + }, + { + "speaker": "A", + "sentence": "Could this get worse?", + "start": 157450, + "end": 158650 + }, + { + "speaker": "B", + "sentence": "That's a good question. I mean, I think if in some areas it's much worse than others and it just depends on kind of where the smoke is concentrated. I think New York has some of the higher concentrations right now, but that's going to change as that air moves away from the New York area. But over the course of the next few days, we will see different areas being hit at different times with the highest concentrations.", + "start": 162170, + "end": 183420 + }, + { + "speaker": "A", + "sentence": "I was going to ask you about.", + "start": 183740, + "end": 184660 + }, + { + "speaker": "B", + "sentence": "More fires start burning. I don't expect the concentrations to go up too much higher.", + "start": 184660, + "end": 189020 + }, + { + "speaker": "A", + "sentence": "I was going to ask you how and you started to answer this, but how much longer could this last? Forgive me if I'm asking you to speculate, but what do you think?", + "start": 189100, + "end": 196220 + }, + { + "speaker": "B", + "sentence": "Well, I think the fires are going to burn for a little bit longer. But the key for us in the US Is the weather system changing. Right now it's the weather systems that are pulling that air into our Mid Atlantic and Northeast region. As those weather systems change and shift, we'll see that smoke going elsewhere and not impact us in this region as much. I think that's going to be the defining factor. I think the next couple days we're going to see a shift in that weather pattern and start to push the smoke away from where we are.", + "start": 198280, + "end": 227480 + }, + { + "speaker": "A", + "sentence": "And finally, with the impacts of climate change, we are seeing more wildfires. Will we be seeing more of these kinds of wide ranging air quality consequences or circumstances?", + "start": 227930, + "end": 239850 + }, + { + "speaker": "B", + "sentence": "I mean, that is one of the predictions for climate change. Looking into the future, the fire season is starting earlier and lasting longer and we're seeing more frequent fires. So yeah, this is probably something that we'll be seeing more, more frequently. This tends to be much more of an issue in the western U.S. so the eastern U.S. getting hit right now is a little bit new. But yeah, I think with climate change moving forward, this is something that is going to happen more frequently.", + "start": 241370, + "end": 267570 + }, + { + "speaker": "A", + "sentence": "That's Peter DeCarlo, associate professor in the Department of Environmental Health and Engineering at Johns Hopkins University. Professor DeCarlo, thanks so much for joining us and sharing this expertise with us.", + "start": 267970, + "end": 278210 + }, + { + "speaker": "B", + "sentence": "Thank you for having me.", + "start": 279410, + "end": 280530 + } +] \ No newline at end of file diff --git a/test/integration/gemini/transcript.txt b/test/integration/gemini/transcript.txt deleted file mode 100644 index cae9726..0000000 --- a/test/integration/gemini/transcript.txt +++ /dev/null @@ -1,53 +0,0 @@ -Meeting Transcript - Video2Document Project -Date: November 18, 2025 -Attendees: Mike Hughes, Stefan Heyne, Alice Smith, Bob Johnson, Clara Nguyen - -[09:00 AM] Mike Hughes: Good morning, everyone. Let’s start the weekly project meeting for Video2Document. We have multiple points on the agenda today, including updates from each module, integration challenges, and the next sprint plan. - -[09:02 AM] Alice Smith: I’ve been working on the document formatting module. I’ve implemented support for markdown and PDF outputs. Still need to handle custom templates for clients. - -[09:05 AM] Bob Johnson: Video preprocessing is progressing. I’ve added support for multiple video codecs and automated audio extraction. I found that some videos require normalization before sending to the LLM. - -[09:08 AM] Stefan Heyne: For the LLM integration, I tested a few transcripts. Gemini handles summaries well, but we might need to tune prompts to get consistent headings and formatting. - -[09:12 AM] Clara Nguyen: On the storage side, I’ve configured S3 buckets for document storage. Permissions and versioning are set, but we still need to handle large batch uploads efficiently. - -[09:15 AM] Mike Hughes: Great updates. Let’s discuss some issues I noticed in the last integration test. First, audio extraction sometimes fails with videos longer than 20 minutes. Bob, any insights? - -[09:17 AM] Bob Johnson: Yes, I believe the FFmpeg timeout settings need adjustment. Also, some containerized environments lack the right codec libraries, causing failures. - -[09:20 AM] Stefan Heyne: On the LLM side, we noticed that very long transcripts lead to truncated outputs. We may need to split transcripts or chunk content intelligently before sending to Gemini. - -[09:22 AM] Alice Smith: For document formatting, longer outputs sometimes exceed the template limits. We need to implement pagination or splitting by sections. - -[09:25 AM] Clara Nguyen: For batch uploads, we can implement parallel processing with rate limiting to avoid S3 throttling. - -[09:28 AM] Mike Hughes: Action items from today’s discussion: -1. Bob: Adjust FFmpeg settings for long videos and document required codecs. -2. Stefan: Implement transcript chunking and test Gemini output for longer documents. -3. Alice: Add section splitting and pagination to document formatting. -4. Clara: Optimize batch upload process and test with larger datasets. - -[09:32 AM] Bob Johnson: Also, I propose adding logging for all preprocessing steps. This will help debug failed video conversions quickly. - -[09:35 AM] Stefan Heyne: Agreed. Logging in the LLM pipeline will also help identify failed content generations or prompt issues. - -[09:38 AM] Alice Smith: I can integrate logging hooks into the formatting module. Should include timestamped entries and file references. - -[09:40 AM] Clara Nguyen: I’ll add S3 upload logs and alerting for failed uploads. - -[09:42 AM] Mike Hughes: Perfect. Next sprint planning: we’ll prioritize long-video handling, chunked LLM summarization, and document formatting robustness. Everything else can follow in the subsequent sprint. - -[09:45 AM] Bob Johnson: I’ll provide a small script for testing various video lengths. Can be used to benchmark preprocessing times. - -[09:48 AM] Stefan Heyne: I’ll create example transcripts of different sizes to test Gemini LLM’s handling and summarize consistency. - -[09:50 AM] Alice Smith: I’ll create template variations for large documents and test rendering performance. - -[09:52 AM] Clara Nguyen: I’ll simulate batch uploads and stress-test the S3 storage setup. - -[09:55 AM] Mike Hughes: Excellent. Let’s reconvene next Wednesday for progress review. Make sure to push your updates to the repository beforehand. - -[09:57 AM] All: Agreed. - -Meeting adjourned at 10:00 AM. \ No newline at end of file