Yes, You Still Need RAG

Just Stating the Obvious

Once again, AI hype trains are full steam ahead as people overload DeepSeek to try out DeepSeek-R1. In fact, DeepSeek-R1 has generated so much hype that Nvidia’s stock shed nearly 17% of its value. No matter how impressive DeekSeek-R1 is, yes, you still need RAG.

No, I’m not going to tell you why you still need RAG. I’m going to show you. What’s a good use case for this experiment? We need a highly complex document that describes a topic that requires substantial human effort to understand and is publicly available.

MIL-HDBK-516C to the rescue! Never thought I’d say that. MIL-HDBK-516C is a guide for satisfying airworthiness requirements for military aircraft. It is highly complex, people (like myself in a past life) spend months if not years working with it, and it is publicly available. Check. ✅ 

Some quick stats on the doc:

  • Pages: 527

  • Tokens: ~550k

  • Complexity: Very High

  • Format: PDF

While there are nearly endless questions we could ask about this document, let’s choose one that has a concrete expected response.

Q: How does speech intelligibility impact airworthiness?

The first issue is that if I ask DeepSeek-R1 the test question, the model talks about airworthiness in a broader sense. To help narrow the model’s focus, I appended the questions with “per MIL-HDBK-516C”.

DeepSeek-R1 response:

MIL-HDBK-516C, Department of Defense Handbook: Airworthiness Certification Criteria, establishes standards for ensuring military aircraft safety and mission readiness. Speech intelligibility directly impacts airworthiness by affecting crew coordination, emergency response, and mission execution. Below is a breakdown of how speech intelligibility intersects with MIL-HDBK-516C requirements:

1. Communication System Performance (Section 8.4.2.3.3.1) 

MIL-HDBK-516C mandates that communication systems (e.g., intercoms, radios, public address systems) must:  

   - Operate effectively in all expected environments (e.g., high noise, vibration, electromagnetic interference).  

   - Ensure clear voice transmission between crew members, air traffic control (ATC), and ground personnel.  

   - Meet intelligibility thresholds to prevent misunderstandings during critical operations (e.g., takeoff, landing, emergencies).  

Poor speech intelligibility could lead to non-compliance with these criteria, jeopardizing airworthiness certification.

2. Human Factors Engineering (Section 8.4.1.4.1) 

The handbook emphasizes human-system integration:  

   - Crew workload and situational awareness depend on clear communication.  

   - Systems must account for acoustic environments (e.g., cockpit noise levels) to ensure crew members can hear and interpret commands.  

   - Design must align with MIL-STD-1472 (Human Engineering), which specifies noise limits and speech intelligibility requirements (e.g., using the Articulation Index or Speech Transmission Index).  

3. Audio Alerting Systems (Section 8.4.2.3.3.2) 

   - Voice warnings and alerts must not interfere with other communications.  

   - Intelligibility requirements ensure that critical alerts (e.g., stall warnings, system failures) are distinguishable from routine speech.  

4. Environmental Testing (Section 8.4.1.3.2.1) 

Aircraft systems must be tested under worst-case operational conditions, including:  

   - High noise levels (e.g., engine roar, wind, rotor blades).  

   - Stressful scenarios (e.g., combat, emergencies).  

   - Speech intelligibility tests (e.g., using standardized phrases or metrics like STI/AI) are required to validate performance.  

Safety and Emergency Procedures (Section 8.4.1.5.1) 

   - During emergencies (e.g., fires, cabin decompression), evacuation commands must be clearly understood by all personnel.  

   - MIL-HDBK-516C requires redundant communication paths to ensure intelligibility even if primary systems fail.  

Interoperability (Section 8.4.2.3.3.4) 

   - Communication systems must interface seamlessly with ground control, other aircraft, and NATO allies.  

   - Poor intelligibility due to incompatible systems could violate interoperability requirements.  

Impact on Airworthiness Certification 

If speech intelligibility fails to meet MIL-HDBK-516C standards, the aircraft may be deemed non-airworthy due to:  

   - Increased risk of crew errors or delayed emergency responses.  

   - Non-compliance with human factors or communication system criteria.  

   - Inability to safely execute missions in contested or noisy environments.  

Key Testing & Compliance Tools 

   - Articulation Index (AI) or Speech Transmission Index (STI): Quantitative metrics to measure intelligibility.  

   - MIL-STD-1474 (Noise Limits): Defines maximum permissible noise levels for speech zones.  

   - MIL-STD-882E: Risk assessment for communication-related hazards.  

Conclusion 

Per MIL-HDBK-516C, speech intelligibility is a critical safety factor in airworthiness. Compliance requires rigorous design, testing, and validation of communication systems to ensure clarity in all operational scenarios. Failure to meet these standards directly impacts mission safety and certification eligibility.

If you’ve never worked with MIL-HDBK-516C, you might be shocked at the progress of AI. Unfortunately, this response that looks very good, actually has major problems. Some of the problems:

  • The section numbers referenced don’t actually exist.

  • Nowhere does MIL-HDBK-516C require testing under “worst-case operational conditions” including “combat”. Not only is this statement false, it is dangerously misleading. The airworthiness process exists to fully test an aircraft system prior to engaging in worst-case operational conditions.

  • At no point does MIL-HDBK-516C reference “evacuation commands”. 

  • At no point, anywhere, in MIL-HDBK-516C are “NATO allies” referenced.

  • At no point is speech intelligibility referenced as a “critical safety factor”. In fact, speech intelligibility is merely one of hundreds of criteria necessary to establish airworthiness.

  • The most important part of the speech intelligibility criteria, the Modified Rhyme Test is totally omitted from the response. The Modified Rhyme Test, MRT, is the primary testing method for demonstrating speech intelligibility compliance.

Are we surprised the response is so poor? No. Although the degree of hallucination is a bit concerning. What happens when we try to upload the entire document to DeepSeek-R1?

Well that’s a problem.

Given the large number of tokens, what happens if we dump the entire document on a model like Gemini-2.0-Flash-Exp with it’s 1M token limits? 

Based on the provided text, speech intelligibility is a factor in airworthiness certification, specifically within the context of crew systems. Here's a breakdown of how it impacts airworthiness:

1. Crew Communication:

Clear Communication is Essential: The handbook emphasizes that clear communication is crucial for safe operation of an air system. This includes communication between the pilot, crew members, and potentially ground personnel.

Intelligibility as a Parameter: Speech intelligibility is explicitly mentioned as a parameter that needs to be defined and documented in the context of each unique air system. This means that the quality of voice communication needs to be high enough for the crew to understand each other and any instructions or warnings.

2. Crew Safety and Performance:

Impact on Mission Duties: The text states that any person on the air vehicle who performs mission duties is considered a crewmember. Therefore, clear communication is essential for the performance of these duties.

Avoiding Mishaps: Poor speech intelligibility can lead to misunderstandings, misinterpretations, and ultimately, errors that could cause a mishap. This is why it is a key area of concern for airworthiness.

3. Specific Standards:

ANSI/ASA S3.2 and S3.5: The document references ANSI/ASA S3.2 (Method for Measuring the Intelligibility of Speech over Communication Systems) and ANSI/ASA S3.5 (Methods for Calculation of the Speech Intelligibility Index). These standards are used to measure and evaluate the intelligibility of speech over communication systems. This highlights the importance of using standardized methods to assess speech intelligibility.

4. Testing and Verification:

Measuring Intelligibility: The handbook implies that speech intelligibility should be measured and verified as part of the airworthiness certification process. This could involve using the standards mentioned above, or other accepted methods.

Ensuring Acceptable Levels: The goal is to ensure that the speech intelligibility meets acceptable levels, so that the crew can communicate effectively and safely.

In summary, speech intelligibility is a critical component of airworthiness because it directly impacts the ability of the crew to communicate effectively and safely, which is essential for the safe operation of the air system. The handbook provides specific standards and guidelines for measuring and ensuring adequate speech intelligibility, highlighting its importance in the certification process.

Not surprisingly, this response features far fewer misleading hallucinations since the full text was able to be provided to the model. However, the section on “Intelligibility as a Parameter” is at best confusing and somewhat inaccurate. The statement, “Speech intelligibility is explicitly mentioned as a parameter” is factually false since nowhere in the source is speech intelligibility called a parameter. Instead, testing methods are discussed for verifying the quality of communications systems that satisfy the testing requirements. 

While this response features fewer inaccuracies, it still lacks the necessary conceptual depth to address the question. Just like DeepSeek-R1, Gemini omitted the most important information regarding speech intelligibility - the Modified Rhyme Test. The Modified Rhyme Test, MRT, is the most important information related to speech intelligibility and any response must discuss it to be complete.

How does Agentic Graph RAG compare? Using TrustGraph configured to use Gemini-2.0-Flash-Exp on an extracted cognitive core of MIL-HDBK-516C, the response is:

Speech intelligibility is a critical factor in airworthiness, as it directly impacts the ability of flight crew and other personnel to communicate effectively, especially during critical situations. Several metrics are used to assess speech intelligibility, including the Modified Rhyme Test (MRT), Speech Transmission Index (STI), Speech Intelligibility Index (SII), and Articulation Index. These metrics are defined in standards like MIL-STD-1472 and ANSI/ASA S3.2 and are used to verify that communication systems meet the necessary airworthiness criteria. The MRT, specifically, is used to evaluate critical audio communication paths and requires a score of 80% (91% for Army) for operational military aircraft. Compliance is demonstrated through verification methods, often using standards like ANSI/ASA S3.2 for conducting the MRT. Therefore, ensuring adequate speech intelligibility through testing and verification is crucial for maintaining airworthiness and flight safety.

Our first impression is that this response is, by far, the shortest response. However, it also, by far, has the highest information density. The response is the most complete response capturing all of the testing methodologies including the most important, the MRT. In fact, this response also catches the scoring criteria for the MRT, “a score of 80% (91% for Army)”. 

But don’t take my word for it, here’s the source text’s excerpt on speech intelligibility:

9.2.8 Speech intelligibility.

Criterion: Verify that all audio communication systems have speech intelligibility of sufficient quality to ensure safe and effective aircraft system operation.

Standard: All critical audio communications paths are to be evaluated using the Modified Rhyme Test (MRT), in accordance with ANSI/ASA S3.2. Critical communication paths may include but are not limited to:

a. Pilot/operator audio communications with ground personnel or mission personnel.

b. Pilot/operator communications with Air Traffic Control.

c. Pilot/operator with another external air vehicle pilot/operator.

d. Pilots to aircrew and passengers.

An 80% (91% for Army) score on the modified rhyme test (MRT) conducted in accordance with ANSI/ASA S3.2 is considered adequate for all operational military aircraft communication paths essential to safety of flight.

In support of initial ground testing and first flight predictive measures of speech intelligibility including the Articulation Index (AI), Speech Transmission Index (STI), and Speech Intelligibility Index (SII) may, at the discretion of the airworthiness authority, be used to predict communication performance for some environments.

Method of Compliance: Speech intelligibility is verified by testing. Verify with talkers and listeners in the worst case noise environments where communications are required. Speech intelligibility scores cannot be combined for different personnel locations and flight conditions. Predictive measures are not used for full airworthiness compliance.

References: 

MIL-STD-1472

MIL-STD-1474

ANSI/ASA S3.2

ANSI/ASA S3.5

Now that we see the source text, it becomes even more obvious how far DeepSeek-R1 diverged from the ground truth and how accurate the Agentic Graph RAG response with TrustGraph is. But should we be surprised that DeepSeek-R1 can’t answer this obscure question? No, it’s absolutely an obscure question, but that’s the point. Obscure topics where people don’t know the answer are exactly the use cases where you would want to use AI. Perhaps Gemini’s response with the full source text is a bit surprising. On the other hand, our testing has - time and time again - pointed to less than 10% of the advertised context window being reliably usable.

But this question was only for one document. MIL-HDBK-516C references hundreds of other documents needed to fully understand airworthiness compliance. The full set of documents would likely be 10s of millions of tokens, far exceeding any LLM’s capabilities. Yet, RAG infrastructures, especially Agentic Graph RAG ones can manage these complex problems. The ability to save these data sets as Cognitive Cores that can be shared and loaded on demand are fundamental to TrustGraph’s Agentic Graph RAG approach.