6  Generation Evaluation

(ns notebooks.rag-evaluation
  (:require [libpython-clj2.require :refer [require-python]]
            [libpython-clj2.python :refer [py..] :as py]
            [notebooks.preparation :refer [ds]]
            [selmer.parser :as templates]
            [scicloj.kindly.v4.kind :as kind]
            [notebooks.generation :as gen]
            [scicloj.tableplot.v1.plotly :as plotly]
            [notebooks.llm-api :as llm]
            [clojure.edn :as edn]
            [clojure.string :as str]
            [tablecloth.api :as tc]
            [clojure.java.io :as io]
            [notebooks.vdb-evaluation :as vdb])
  (:import
   (dev.langchain4j.data.segment TextSegment)
   (dev.langchain4j.store.embedding CosineSimilarity)
   (dev.langchain4j.model.embedding.onnx.allminilml6v2 AllMiniLmL6V2EmbeddingModel)))

For this section, I will be relying heavily on the continuous-eval (python) metrics and approach for starting to think about how to evaluate the RAG.

That repository also has some great links to articles explaining some of the concepts in more detail.

As the creators of the project write, there are several kinds of questions you might want to consider when evaluating answer generation:

(source)

In this exercise, I will only really look at the question of what llm model might work best with the data that I have and the prompt/retrieval framework we have already set up.

We will focus on three categories of metrics:

Deterministic metrics are similar to how we measured the retrieval performace; they simply measure the token overlap between answers generated by the LLM and some kind of reference/ground-truth answers.

The semantic metric is similar to the method of retrieving information from the vector database; it checks how similar two pieces of text are based on vector embeddings.

LLM-based metrics utilise another LLM to assign a score to the output. For example, to determine ‘answer-correctness’, we will ask an LLM to assign a score between 1-5 to a generated answer, based on reference answers that we provide ourselves.

6.1 Evaluation Dataset

Before going into the metrics further, we will first create a testing dataset that contains some questions and ground truth answers. I’ve used 10 fairly random questions based on some of the material in the starting dataset of questions and answers.

Ideally, we would use a much larger and more thoughtfully curated evaluation dataset, perhaps with input from domain experts across different question areas. The goal here, however, is simply to test out some evaluation workflows in clojure, so a basic evaluation dataset will have to do for now.

Below, we just load that dataset. The ‘questions.edn’ file is set up as a clojure map, where the questions are keys and the ground truth answers and values.

(def evaluation-dataset
  (let [data         (edn/read-string (slurp "data/evaluation_questions/questions.edn"))
        questions    (keys data)
        ground-truth (vals data)]
    (mapv (fn [question truth] (-> {}
                                   (assoc :question question)
                                   (assoc :ground-truth truth)))
          questions
          ground-truth)))
(kind/table evaluation-dataset)
question ground-truth
How many people are eligible for payments under the Monther and Baby Homes scheme?
["It is anticipated that there are 34,000 people eligible for benefits under the Mother and Baby Institutions Payment Scheme"
 "A comprehensive body of work has been underway to develop the substantial administrative framework required to deliver a scheme of this size."]
What is the salary scale for an archaeologist in the local government sector?
["The salary scale for an archaeologist in the local government sector ranges from €55,519 to €77,176."]
How many pupils in total benefit from the School Transport Scheme (STS)? Can you also break down the numbers in terms of students with special educational needs and students who are from Ukraine?
["There are over 161,600 children total that benefit from the School Transport Scheme"
 "135,000 pupils travel on primary and post primary services."
 "19,800 pupils with special educational needs benefit from the scheme."
 "6,800 pupils who have arrived to Ireland from Ukraine avail of the scheme."]
Will the government put in place Level 6 (QQI) courses for healthcare assistants?
["The Office of Nursing and Midwifery Services Director (ONMSD) within its scope of governance designs, develops and delivers continuous professional development and clinical leadership, with programmes of education ranging from level 5 to level 8 on the National Framework of Qualifications."
 "The educational requirement for a Health Care Assistant is the relevant Health Skills Level 5 (QQ1) qualification"
 "Training is provided to Health Care Assistants who support the nursing and midwifery function through the ONMSD Centres of Nursing and Midwifery Education with the achievement of Ireland QQI Level 5 Major award in Health Service Skills."
 "Ensuring an expanded number of third level places in healthcare programmes is a priority for my Department."]
How many households were in receipt of HAP payments in 2023?
["At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support."]
What is the government doing with regard to the National Drugs Strategy?
["The National Drugs Strategy 2017-2025 is a health-led approach to drug and alcohol use in Ireland, and it represents a whole-of-Government response to reduce demand and associated harms."
 "The Department of Revenue has primary responsibility for the prevention, detection, interception and seizure of controlled drugs."]
What is the government doing to help improve GP services?
["Increased annual investment of over €211 million was provided for under the 2019 GP Agreement, providing for increased GP fees, supports and new services."
 "There is an increase in demand for GP services due to demographics and an aging population"
 "The government is conducting a review of services, that includes a focus on GP services in certain rural and urban deprived areas."
 "The government and the HSE have commenced a Strategic Review of General Practice, examining a broad range of issues."]
What is the punishment for someone driving without a driving licence?
["A person driving without a licence may be fined up to €5000, or be imprisoned for up to 6 months, or both."]
Are there plans to further reduce public transport fares?
["There are no plans to further reduce public transport fares at this time"]
How is the government encouraging local authorities to apply for the town and village renewal scheme?
["In July of 2023, the government announced funding of €15 million for this year's Town and Village Renewal Scheme"
 "The government's goal is to promote rural and community development and to support vibrant, inclusive and sustainable communities throughout Ireland."
 "The 2023 main Town and Village Renewal scheme also has a renewed focus on town centre regeneration and enhancing economic and social vibrancy in line with the Town Centre First policy."
 "Capital schemes including the Town and Village Renewal Scheme, LEADER, the Rural Regeneration and Development Fund, the Local Improvement Scheme, and the Community Centre Investment Fund are delivering projects throughout the country."]

6.2 Generate LLM Answers

Next, we will write a helper function to save llm responses and generate some responses by different llm models. These are the responses that we will evaluate later. As you can see below, I tested 20 different models. Some were locally running small models (around 8B parameters max), and some were cloud-based models from Google, OpenAI and Anthropic.

(defn ask-llm-save-responses! [model questions]
  (let [responses (reduce (fn [res question]
                            (conj res
                                  (gen/get-rag-answer
                                   (assoc question :model-ref model)
                                   :db-docs)))
                          [] questions)
        f-name (str "data/responses/" model "_responses.edn")]
    (spit f-name responses)))
(comment
  (ask-llm-save-responses! "gemini-2.0-flash-lite" evaluation-dataset)
  (ask-llm-save-responses! "llama3.1" evaluation-dataset)
  (ask-llm-save-responses! "gpt-3.5-turbo" evaluation-dataset)
  (ask-llm-save-responses! "gemma3:1b" evaluation-dataset)
  (ask-llm-save-responses! "gpt-4o-mini" evaluation-dataset)
  (ask-llm-save-responses! "gpt-4o" evaluation-dataset)
  (ask-llm-save-responses! "o4-mini-2025-04-16" evaluation-dataset)
  (ask-llm-save-responses! "o3-mini" evaluation-dataset)
  (ask-llm-save-responses! "gemini-2.0-flash" evaluation-dataset)
  (ask-llm-save-responses! "claude-3-7-sonnet-20250219" evaluation-dataset)
  (ask-llm-save-responses! "claude-3-5-haiku-20241022" evaluation-dataset)
  (ask-llm-save-responses! "claude-3-haiku-20240307" evaluation-dataset)
  (ask-llm-save-responses! "llama3.2" evaluation-dataset)
  (ask-llm-save-responses! "mistral" evaluation-dataset)
  (ask-llm-save-responses! "llava" evaluation-dataset)
  (ask-llm-save-responses! "deepseek-r1" evaluation-dataset)
  (ask-llm-save-responses! "gemma3:4b" evaluation-dataset)
  (ask-llm-save-responses! "granite3.2" evaluation-dataset)
  (ask-llm-save-responses! "gemini-2.5-pro-preview-03-25" evaluation-dataset)
  (ask-llm-save-responses! "gemini-2.5-flash-preview-04-17" evaluation-dataset))
(defonce responses-ds
  (let [responses-dir "data/responses"
        responses (->> responses-dir
                       (io/file)
                       file-seq
                       rest
                       (map (comp edn/read-string slurp))
                       (reduce into))]
    (tc/dataset responses)))
(tc/row-count responses-ds)
200

Each model answered the 10 questions from the evaluation dataset, so that’s 200 responses overall.

6.3 Continuous Eval Metrics Functions

Below, I am just creating a wrapper for the Continuous-eval deterministic metrics, and re-writing the LLM metrics in clojure, using the prompt templates that are provided in the continuous-eval repo

For demonstrating how the metrics work, we will use a couple of the generated responses as samples.

For the question “How many households were in receipt of HAP payments in 2023?”, the data available states that 57,617 households were in receipt of payments at the end of Q3 2023. In other words, the full data for 2023 was not available at that time. Most of the models seemed to be able to pick up that detail, but one of the lower-powered ones, gemma3(1 billion parameter model) didn’t qualify the figure to state that it was only for Q3.

Also, the question “Are there plans to further reduce public transport fares?” should be a simple ‘no’, based on the available data, but the gemma3:1b model also gets this one wrong.

(def sample-gen-responses
  (-> responses-ds
      (tc/select-rows #(and (or (= (:model-ref %) "llama3.1")
                                (= (:model-ref %) "gemma3:1b"))
                            (or (re-find #"receipt of HAP payments" (:question %))
                                (re-find #"transport fares" (:question %)))))))
(-> sample-gen-responses
    (tc/select-columns [:model-ref :question :answer])
    (kind/table))
model-ref question answer
llama3.1 Are there plans to further reduce public transport fares? No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.
llama3.1 How many households were in receipt of HAP payments in 2023? At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023.
gemma3:1b Are there plans to further reduce public transport fares? Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.
gemma3:1b How many households were in receipt of HAP payments in 2023? In 2023, there were 114,974 households in receipt of the HAP scheme.

6.3.1 Deterministic Metrics

(require-python '[continuous_eval.metrics.generation.text.deterministic :as det])
:ok
(defn add-deterministic-metrics [{:keys [answer retrieved-context ground-truth] :as rag-data}]
  (let [faithfullness-spec  {:answer            answer
                             :retrieved_context retrieved-context}
        correctness-spec    {:answer               answer
                             :ground_truth_answers (if (seq ground-truth) ground-truth retrieved-context)}
        faithfulness-scores (into {} (py.. (det/DeterministicFaithfulness) (**compute faithfullness-spec)))
        correctness-scores  (into {} (py.. (det/DeterministicAnswerCorrectness) (**compute correctness-spec)))
        reading-scores      (into {} (py.. (det/FleschKincaidReadability) (compute answer)))]
    (->
     (merge
      faithfulness-scores
      correctness-scores
      reading-scores
      rag-data)
     (clojure.set/rename-keys
      {"flesch_reading_ease"         :flesch-reading-ease
       "flesch_kincaid_grade_level"  :flesch-kincaid-grade-level
       "rouge_l_recall"              :rouge-l-recall
       "rouge_faithfulness"          :rouge-faithfulness
       "rouge_l_precision"           :rouge-l-precision
       "rouge_l_f1"                  :rouge-l-f1
       "rouge_p_by_sentence"         :rouge-p-by-sentence
       "bleu_score_by_sentence"      :bleu-score-by-sentence
       "bleu_faithfulness"           :bleu-faithfulness
       "bleu_score"                  :bleu-score
       "token_overlap_p_by_sentence" :token-overlap-p-by-sentence
       "token_overlap_f1"            :token-overlap-f1
       "token_overlap_precision"     :token-overlap-precision
       "token_overlap_recall"        :token-overlap-recall
       "token_overlap_faithfulness"  :token-overlap-faithfulness}))))

Example score for the sample responses:

(-> (mapv add-deterministic-metrics (tc/rows sample-gen-responses :as-maps))
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :rouge-l-f1 :token-overlap-f1 :bleu-score])
    (kind/table))
model-ref question answer rouge-l-f1 token-overlap-f1 bleu-score
llama3.1 Are there plans to further reduce public transport fares? No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. 0.3582089520962353 0.2926829268292683 0.18303581288409837
llama3.1 How many households were in receipt of HAP payments in 2023? At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. 0.5925925884224966 0.5263157894736842 0.3523584062391067
gemma3:1b Are there plans to further reduce public transport fares? Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. 0.25714285411836735 0.20833333333333331 0.14325120993710644
gemma3:1b How many households were in receipt of HAP payments in 2023? In 2023, there were 114,974 households in receipt of the HAP scheme. 0.5714285665306124 0.5 0.5050058400713138

The ‘F1’ scores are the combination of ‘precision’ and ‘recall’ metrics. As we saw in previous sections, precision is how much of the generated answer is reflected in the ground truth (i.e., what % of the generated answer is not ‘superfluous’), and recall is how much of the ground truth is reflected in the generated answer. The F1 score is the harmonic mean of both these scores, with a score closer to 1 being better. The ‘BLEU’ score is also better when it is closer to 1.

In this case, even though these metrics don’t check for semantic meaning or logic, the metrics do indicate that the llama3.1 responses were slightly better than the gemma3 responses.

6.3.2 Semantic Similarity

We’ll also check, very roughly, the semantic similarity (based on cosine similarity) between the generated responses and the ground truth.

(defn calculate-cosine-similarity [text-a vec-text-b]
  (let [embedding-model (AllMiniLmL6V2EmbeddingModel/new)
        embedding-fn (fn [text]
                       (->> (TextSegment/from text)
                            (. embedding-model embed)
                            (.content)))]
    (CosineSimilarity/between (embedding-fn text-a) (embedding-fn (str/join " " vec-text-b)))))
(defn add-semantic-similarity [{:keys [answer ground-truth] :as rag-data}]
  (let [similarity-score (calculate-cosine-similarity answer ground-truth)]
    (-> rag-data
        (assoc :cosine-similarity similarity-score))))
(add-semantic-similarity {:answer "Berlin in the capital of France."
                          :ground-truth ["Paris is the capital of France."]})
{:answer "Berlin in the capital of France.",
 :ground-truth ["Paris is the capital of France."],
 :cosine-similarity 0.5875568139261295}
(add-semantic-similarity {:answer "The capital of France is Paris."
                          :ground-truth ["Paris is the capital of France."]})
{:answer "The capital of France is Paris.",
 :ground-truth ["Paris is the capital of France."],
 :cosine-similarity 0.989356549878067}
(add-semantic-similarity {:answer "Paris is the capital of France."
                          :ground-truth ["The capital of France is Paris."
                                         "The Mona Lisa is in Paris."]})
{:answer "Paris is the capital of France.",
 :ground-truth
 ["The capital of France is Paris." "The Mona Lisa is in Paris."],
 :cosine-similarity 0.7456770568295227}
(-> (mapv add-semantic-similarity (tc/rows sample-gen-responses :as-maps))
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :cosine-similarity])
    (kind/table))
model-ref question answer cosine-similarity
llama3.1 Are there plans to further reduce public transport fares? No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. 0.6855268462688894
llama3.1 How many households were in receipt of HAP payments in 2023? At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. 0.9459118435287973
gemma3:1b Are there plans to further reduce public transport fares? Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. 0.5193210040131704
gemma3:1b How many households were in receipt of HAP payments in 2023? In 2023, there were 114,974 households in receipt of the HAP scheme. 0.7805409346707153

We can see a limitation with this metric here - even though the last gemma3:1b answer is factually inccorrect, it still is quite ‘semantically similar’ to the ground truth answer.

6.3.3 LLM Metrics

As I mentioned above, I’m using the same approach/prompts for the LLM-based metrics as is used in the continuous-eval project.

For example, the ‘faithfulness’ prompt can be seen below:

(-> "prompts/faithfulness_sys.txt"
    slurp
    (str/replace #"\n" "\n\n")
    (gen/quoted-response)
    kind/md)

You are tasked to evaluate whether the statement is fully supported by the context. With a reasoning following the response, respond with either Yes or No.

If the model has correctly identified that it can’t answer based on the context, then also respond with Yes.

– EXAMPLES –

Example 1:

Context: The Eiffel Tower, a wrought-iron lattice tower on the Champ de Mars in Paris, France, is one of the most famous landmarks in the world. It was designed by Gustave Eiffel and completed in 1889.

Statement: The Eiffel Tower can be found in the center of London, near the Thames River.

Response:

{

“reasoning”: “The statement contradicts with the context, which states that Eiffel Tower is in Paris, as opposed to the center of London.”,

“score”: “No”

}

Example 2:

Context: Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that can later be released to fuel the organisms’ activities. This chemical energy is stored in carbohydrate molecules, such as sugars, which are synthesized from carbon dioxide and water.

Statement: Photosynthesis in plants primarily involves the conversion of light energy into chemical energy stored in forms such as sugar.

Response:

{

“reasoning”: “The statement is supported by the context, which states that photosynthesis converts light energy into chemical energy and that the chemical energy is stored in carbohydrate molecules, such as sugars.”,

“score”: “Yes”

}

(defn add-llm-metric-correctness-score [{:keys [question answer ground-truth] :as rag-data} llm-model]
  (let [system-prompt (slurp "prompts/ans_correctness_sys.txt")
        user-prompt   (-> "prompts/ans_correctness_user.txt"
                          slurp
                          (templates/render {:question     question
                                             :answer       answer
                                             :ground-truth (if (seq ground-truth)
                                                             ground-truth
                                                             (:retrieved-context rag-data))}))
        response      (llm/ask-llm
                       {:model-ref     llm-model
                        :question      user-prompt
                        :system-prompt system-prompt})
        score (first (re-find #"(?<=[S|s]core(.{1,4}))[1|2|3|4|5]" response))
        score (when score (parse-long score))]
    (-> rag-data
        (assoc :metric-llm-correctness-explanation response)
        (assoc :metric-llm-correctness-score score))))
(defn add-llm-metric-faithfulness-score [{:keys [answer retrieved-context] :as rag-data} llm-model]
  (let [system-prompt  (slurp "prompts/faithfulness_sys.txt")
        ret-ctx-joined (str/join "\n" retrieved-context)
        user-prompt    (-> "prompts/faithfulness_user.txt"
                           slurp
                           (templates/render {:answer                   answer
                                              :retrieved-context-joined ret-ctx-joined}))
        response       (llm/ask-llm
                        {:model-ref     llm-model
                         :question      user-prompt
                         :system-prompt system-prompt})
        score (first (re-find #"(?<=[S|s]core(.{1,4}))[yes|no]" (str/lower-case response)))
        score (when score (if (= score "y") 1 0))]
    (-> rag-data
        (assoc :metric-llm-faithfulness-explanation response)
        (assoc :metric-llm-faithfulness-score score))))
(defn add-llm-metric-relevance-score [{:keys [answer question] :as rag-data} llm-model]
  (let [system-prompt  (slurp "prompts/ans_relevance_sys.txt")
        user-prompt    (-> "prompts/ans_relevance_user.txt"
                           slurp
                           (templates/render {:answer answer
                                              :question question}))
        response       (llm/ask-llm
                        {:model-ref     llm-model
                         :question      user-prompt
                         :system-prompt system-prompt})
        score (first (re-find #"(?<=[S|s]core(.{1,4}))[1|2|3]" response))
        score (when score (parse-long score))]
    (-> rag-data
        (assoc :metric-llm-relevance-explanation response)
        (assoc :metric-llm-relevance-score score))))
(defn add-llm-metrics [rag-data model]
  (-> rag-data
      (add-llm-metric-correctness-score model)
      (add-llm-metric-faithfulness-score model)
      (add-llm-metric-relevance-score model)
      (assoc :evaluator-model model)))

Finally, let’s wrap all of the above three metric types (deterministic, semantic, and llm-based) into a single function.

(defn add-all-generation-evaluation-metrics [responses evaluation-model]
  (mapv (fn [resp]
          (-> resp
              add-deterministic-metrics
              add-semantic-similarity
              (add-llm-metrics evaluation-model)))
        responses))

Now, let’s use these metrics to evaluate the two example question/answers we genreated earlier.

(comment
  (let [eval-model "gpt-4o"
        output-fname "data/evaluation_example/example.edn"
        sample-with-metrics (add-all-evaluation-metrics
                             (tc/rows sample-gen-responses :as-maps)
                             eval-model)]
    (spit output-fname sample-with-metrics)))
(def sample-gen-responses-metrics (edn/read-string (slurp "data/evaluation_example/example.edn")))
(first sample-gen-responses-metrics)
{:metric-llm-faithfulness-explanation
 "{\n  \"reasoning\": \"The statement is fully supported by the context. It confirms that there are no current plans to further reduce public transport fares, stating that the 20% average PSO fare initiative and the 90-minute fare will continue. It also mentions the recently published Fares Determination, which aims to provide more consistent and equitable fares, aligning with the information in the context.\",\n  \"score\": \"Yes\"\n}",
 :rouge-l-recall 0.9230769230769231,
 :flesch-kincaid-grade-level 14.486250000000002,
 :flesch-reading-ease 29.200416666666683,
 :rouge-faithfulness 1.0,
 :retrieved-context
 [" The funding allocated by Government also supports the fare initiatives we’ve introduced since late 2021 such as –  - Reducing average fares on all PSO services by 20%, helping people with cost-of-living pressures; - Introducing, and most recently expanding, the Young Adult Card, reducing fares by a further 50% for young people just as they’re making choices about how they travel independently; and - Introducing a new 90-minute fare to enable easy interchange between different types of public transport. These initiatives have made public transport more affordable for more people and have undoubtably contributed toward the strong passenger growth we’re seeing across the network. There are no plans to amend or extend the Student Leap Card."
  "On rail we’re also supporting improvements across the network on both commuter and inter-city services.  We have also continued the 20% average PSO fare initiative and the 90-minute fare, all of which are available to those persons who have not yet reached the State pension age. However, there are no plans to further reduce public transport fares at this time."
  "€613.463m has been secured for the provision of Public Service Obligation (PSO) and Local Link services. This package includes funding for the continuation of the 20% fare reduction on PSO services, the extension of the Young Adult Card on both PSO and commercial bus services to include 24-and-25-year-olds, and the continuation of the 90-minute fare until the end of 2024. Further, the recently published Fares Determination will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin."
  "The Government is strongly committed to providing all citizens with reliable and realistic sustainable mobility options, and public transport plays a key role in the delivery of this goal.  To support this objective, Government has provided increased levels of funding for public transport services in recent years. Under Budget 2024, a funding package of c."
  "The report also found that any increase in public transport demand due to free fares would largely be achieved by reductions in walking (7%) and cycling (13%), resulting in just a 1% reduction in car travel. My Department’s funding for public transport services is therefore focussed on existing fare initiatives and service improvements such as BusConnects in the five cities, expanding town services in the larger towns, and rolling out Connecting Ireland in rural Ireland. These improvements are delivering real benefits for citizens across the country."],
 :metric-llm-faithfulness-score 1,
 :metric-llm-correctness-score 5,
 :metric-llm-correctness-explanation
 "{\n  \"reasoning\": \"The generated answer is correct and complete. It states that there are no current plans to further reduce public transport fares, which matches the ground truth answer. Additionally, it provides extra context about existing fare initiatives and a recently published Fares Determination, which does not contradict the main point and adds useful information.\",\n  \"score\": 5\n}",
 :metric-llm-relevance-score 3,
 :token-overlap-f1 0.2926829268292683,
 :ground-truth
 ["There are no plans to further reduce public transport fares at this time"],
 :question "Are there plans to further reduce public transport fares?",
 :token-overlap-faithfulness 1.0,
 :metric-llm-relevance-explanation
 "{\n  \"reasoning\": \"The answer directly responds to the question by stating that there are currently no plans to further reduce public transport fares. It provides additional context about the existing fare initiatives and mentions the recently published Fares Determination aiming for consistent and equitable fares. This information is relevant and fully addresses the question.\",\n  \"score\": 3\n}",
 :bleu-score-by-sentence
 [6.682549838299777E-14 1.7938417188868533E-8 6.289332584320069E-8],
 :model-ref "llama3.1",
 :rouge-p-by-sentence
 [0.8666666666666667 0.5833333333333334 0.7727272727272727],
 :bleu-score 0.18303581288409837,
 :token-overlap-recall 1.0,
 :evaluator-model "gpt-4o",
 :rouge-l-precision 0.2222222222222222,
 :token-overlap-precision 0.17142857142857143,
 :system-prompt
 "I want you to act as a responsible and trustworthy senior government official.\nPlease provide an answer to a citizen's question, using only the context provided.\nAnswer as if you are talking directly to the citizen and be neutral and formal as possible.\nIf you can't find a specific detail from the question, please acknowledge this and provide any\nother helpful information that may be related to the question.\nIf you can't find sufficient information in the context to answer the question at all,\nthen reply with \"I am unable to answer this question with the information I have available.\"\n\n CONTEXT:  The funding allocated by Government also supports the fare initiatives we’ve introduced since late 2021 such as –  - Reducing average fares on all PSO services by 20%, helping people with cost-of-living pressures; - Introducing, and most recently expanding, the Young Adult Card, reducing fares by a further 50% for young people just as they’re making choices about how they travel independently; and - Introducing a new 90-minute fare to enable easy interchange between different types of public transport. These initiatives have made public transport more affordable for more people and have undoubtably contributed toward the strong passenger growth we’re seeing across the network. There are no plans to amend or extend the Student Leap Card.\n\nOn rail we’re also supporting improvements across the network on both commuter and inter-city services.  We have also continued the 20% average PSO fare initiative and the 90-minute fare, all of which are available to those persons who have not yet reached the State pension age. However, there are no plans to further reduce public transport fares at this time.\n\n€613.463m has been secured for the provision of Public Service Obligation (PSO) and Local Link services. This package includes funding for the continuation of the 20% fare reduction on PSO services, the extension of the Young Adult Card on both PSO and commercial bus services to include 24-and-25-year-olds, and the continuation of the 90-minute fare until the end of 2024. Further, the recently published Fares Determination will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.\n\nThe Government is strongly committed to providing all citizens with reliable and realistic sustainable mobility options, and public transport plays a key role in the delivery of this goal.  To support this objective, Government has provided increased levels of funding for public transport services in recent years. Under Budget 2024, a funding package of c.\n\nThe report also found that any increase in public transport demand due to free fares would largely be achieved by reductions in walking (7%) and cycling (13%), resulting in just a 1% reduction in car travel. My Department’s funding for public transport services is therefore focussed on existing fare initiatives and service improvements such as BusConnects in the five cities, expanding town services in the larger towns, and rolling out Connecting Ireland in rural Ireland. These improvements are delivering real benefits for citizens across the country.",
 :token-overlap-p-by-sentence
 [0.8571428571428571 0.6923076923076923 0.7333333333333333],
 :answer
 "No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.",
 :rouge-l-f1 0.3582089520962353,
 :bleu-faithfulness 2.69439366191892E-8}

Example LLM Faithfulness evaluation (score can be ‘1 - faithfull’ or ‘0 - not faithfull’):

(-> sample-gen-responses-metrics
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :metric-llm-faithfulness-score :metric-llm-faithfulness-explanation])
    (kind/table))
model-ref question answer metric-llm-faithfulness-score metric-llm-faithfulness-explanation
llama3.1 Are there plans to further reduce public transport fares? No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. 1 { "reasoning": "The statement is fully supported by the context. It confirms that there are no current plans to further reduce public transport fares, stating that the 20% average PSO fare initiative and the 90-minute fare will continue. It also mentions the recently published Fares Determination, which aims to provide more consistent and equitable fares, aligning with the information in the context.", "score": "Yes" }
llama3.1 How many households were in receipt of HAP payments in 2023? At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. 1 { "reasoning": "The statement is supported by the context, which provides the exact data point that at the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. Additionally, the context confirms that the latest available data covers only until the end of Q3 2023, aligning with the statement's note about the data coverage limitation.", "score": "Yes" }
gemma3:1b Are there plans to further reduce public transport fares? Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. 0 { "reasoning": "The statement inaccurately interprets the context by suggesting there are plans to further reduce public transport fares. The context clearly states, 'there are no plans to further reduce public transport fares at this time.' The mention of 'recently published Fares Determination' refers to achieving more consistent and equitable fares, not a further reduction in fares. Additionally, it is correct that there are no plans to amend or extend the Student Leap Card, but this does not contradict the point about fare reduction plans.", "score": "No" }
gemma3:1b How many households were in receipt of HAP payments in 2023? In 2023, there were 114,974 households in receipt of the HAP scheme. 0 { "reasoning": "The statement claims there were 114,974 households in receipt of the HAP scheme in 2023, but the context states that 114,974 HAP tenancies had been set-up since the scheme commenced. However, there were only 57,617 households actively in receipt of HAP support at the end of Q3 2023. Therefore, the statement confuses the total tenancies set up since inception with the actual number of active recipients in 2023.", "score": "No" }

As we can see, the evaluation model correctly identified the errors in the gemma3:1b answers.

Example LLM Correctness evaluation (range between 1 and 5):

(-> sample-gen-responses-metrics
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :metric-llm-correctness-score :metric-llm-correctness-explanation])
    (kind/table))
model-ref question answer metric-llm-correctness-score metric-llm-correctness-explanation
llama3.1 Are there plans to further reduce public transport fares? No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. 5 { "reasoning": "The generated answer is correct and complete. It states that there are no current plans to further reduce public transport fares, which matches the ground truth answer. Additionally, it provides extra context about existing fare initiatives and a recently published Fares Determination, which does not contradict the main point and adds useful information.", "score": 5 }
llama3.1 How many households were in receipt of HAP payments in 2023? At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. 5 { "reasoning": "The generated answer accurately provides the number of households in receipt of HAP payments at the end of Q3 2023 by stating 'nearly 57,617 households.' It also adds a note that this data is only up to the end of Q3 2023 and may not reflect the full year 2023, adding further context to the data. This additional context is a correct and informative expansion upon the ground truth answer.", "score": 5 }
gemma3:1b Are there plans to further reduce public transport fares? Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. 2 { "reasoning": "The generated answer is incorrect because it mistakenly states that there are plans to reduce public transport fares, contrary to the ground truth answer which indicates that there are no plans for further reduction. The generated answer also provides additional information about fare consistency and equity, but these do not address the main question directly regarding reduction plans.", "score": 2 }
gemma3:1b How many households were in receipt of HAP payments in 2023? In 2023, there were 114,974 households in receipt of the HAP scheme. 2 { "reasoning": "The generated answer is incorrect because it states that there were 114,974 households in receipt of the HAP scheme, while the ground truth answers mention that at the end of Q3 2023, there were nearly 57,617 households actively receiving HAP support. The figures are not aligned, indicating that the generated answer contains major errors.", "score": 2 }

Example LLM Relevance evaluation (range between 1 and 3):

(-> sample-gen-responses-metrics
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :metric-llm-relevance-score :metric-llm-relevance-explanation])
    (kind/table))
model-ref question answer metric-llm-relevance-score metric-llm-relevance-explanation
llama3.1 Are there plans to further reduce public transport fares? No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. 3 { "reasoning": "The answer directly responds to the question by stating that there are currently no plans to further reduce public transport fares. It provides additional context about the existing fare initiatives and mentions the recently published Fares Determination aiming for consistent and equitable fares. This information is relevant and fully addresses the question.", "score": 3 }
llama3.1 How many households were in receipt of HAP payments in 2023? At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. 3 { "reasoning": "The answer provides a specific number of households receiving HAP payments at the end of Q3 2023, which directly addresses the question about the number of households in receipt of HAP payments in 2023. It also clarifies the time frame of the data, indicating that it may not cover the entire year of 2023, thereby providing a complete and relevant response to the question.", "score": 3 }
gemma3:1b Are there plans to further reduce public transport fares? Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. 2 { "reasoning": "The answer correctly identifies that there are plans to further reduce public transport fares by mentioning a 'recently published Fares Determination' aimed at making fares more consistent and equitable. However, the reference to the Student Leap Card is irrelevant to the main question about reducing public transport fares. Thus, the answer partially answers the question, as it provides some relevant information but includes unnecessary details.", "score": 2 }
gemma3:1b How many households were in receipt of HAP payments in 2023? In 2023, there were 114,974 households in receipt of the HAP scheme. 3 { "reasoning": "The generated answer provides a specific number of households that were in receipt of HAP payments in 2023. This directly answers the question by providing the exact information requested.", "score": 3 }

Interestingly, even though the gemma3 responses were factually incorrect, they still received a high ‘relevance’ score from the evaluator model. In other words, it recognises that it was still attempting to answer the question in a ‘relevant’ manner, even though it got the facts wrong.

6.3.4 Running/Saving evaluations

(defn run-and-save-evaluation-metrics! [responses model]
  (let [model-ref (:model-ref (first responses))
        f-name (str "data/responses_evaluation/" model-ref "_evaluation.edn")
        resp (add-all-generation-evaluation-metrics responses model)]
    (spit f-name resp)))
(defn run-and-save-all-evals! [responses-dir model]
  (let [responses (->> (io/file responses-dir)
                       file-seq
                       rest
                       (mapv (comp edn/read-string slurp)))]
    (mapv #(run-and-save-evaluation-metrics! % model) responses)))
(comment
  ;; 43:55 (very roughly) to run around 15 models
  ;; cost - around 1.44 USD for 18 models * 10 questions each - 180 evaluations
  (run-and-save-all-evals! "data/responses" "o4-mini-2025-04-16"))

6.4 Exploring Performance

In this part we’ll try to compare the 20 models based on their performance across the metrics.

We’ll start be defining a few helper functions.

(defn average-coll [coll]
  (float
   (/ (apply + (remove nil? coll))
      (count (remove nil? coll)))))
(defn average-all-cols [numerical-ds]
  (let [cols (tc/column-names numerical-ds)]
    (tc/dataset
     (reduce (fn [res col]
               (assoc res col (average-coll (numerical-ds col))))
             {} cols))))
(defn summarise-model-performance-avgs [rag-datas]
  (let [model-ref (:model-ref (first rag-datas))]
    (-> rag-datas
        (tc/dataset)
        (tc/drop-columns #(re-find #"by-sentence" (name %)))
        (tc/select-columns :type/numerical)
        average-all-cols
        (tc/add-column :model-ref model-ref))))
(defn build-responses-eval-ds-avgs [responses-eval-dir]
  (let [responses (->> responses-eval-dir
                       io/file
                       file-seq
                       rest
                       (mapv (comp edn/read-string slurp))
                       (mapv summarise-model-performance-avgs))]
    (apply tc/concat responses)))
(def ds-performance-averages (build-responses-eval-ds-avgs "data/responses_evaluation"))
(kind/table ds-performance-averages)
rouge-l-recall flesch-kincaid-grade-level flesch-reading-ease rouge-faithfulness metric-llm-faithfulness-score metric-llm-correctness-score metric-llm-relevance-score token-overlap-f1 token-overlap-faithfulness bleu-score token-overlap-recall rouge-l-precision token-overlap-precision cosine-similarity rouge-l-f1 bleu-faithfulness model-ref
0.8239417672157288 14.509193420410156 30.045215606689453 0.8083333373069763 1.0 4.5 2.700000047683716 0.3923736810684204 0.824999988079071 0.2834573984146118 0.9157578349113464 0.30401644110679626 0.2837636470794678 0.8581802845001221 0.40897229313850403 3.135146835120395E-5 gemini-2.5-flash-preview-04-17
0.7446075677871704 15.492807388305664 27.18585777282715 0.670634925365448 0.699999988079071 4.400000095367432 2.700000047683716 0.25699713826179504 0.6595237851142883 0.20455192029476166 0.809962272644043 0.18923130631446838 0.19086696207523346 0.7771462798118591 0.2610572874546051 1.724415342323482E-4 llava
0.6585989594459534 12.719018936157227 38.68239212036133 0.8999999761581421 0.800000011920929 4.0 2.5 0.48910555243492126 0.800000011920929 0.4617713987827301 0.7250267267227173 0.4208480715751648 0.43597525358200073 0.7402637004852295 0.46610692143440247 5.130228237248957E-5 gemini-2.0-flash-lite
0.8267883062362671 15.08983325958252 24.900646209716797 0.5108433365821838 0.8999999761581421 4.599999904632568 2.700000047683716 0.19189156591892242 0.49193307757377625 0.11331550776958466 0.873235285282135 0.1329944133758545 0.11036045849323273 0.8017071485519409 0.2255573570728302 4.652861025533639E-5 mistral
0.7401713728904724 13.667346954345703 30.989845275878906 0.8883333206176758 1.0 4.099999904632568 2.5999999046325684 0.43316417932510376 0.8666666746139526 0.35321152210235596 0.829186201095581 0.3343445956707001 0.32255521416664124 0.844434380531311 0.4301661252975464 2.77885210380191E-5 gemma3:4b
0.7641585469245911 16.72433853149414 18.24077796936035 0.8014285564422607 0.7777777910232544 4.5 2.799999952316284 0.27142131328582764 0.770714282989502 0.18540184199810028 0.832671582698822 0.1887160837650299 0.1654357612133026 0.8286173939704895 0.29786062240600586 3.24876127706375E-5 gpt-3.5-turbo
0.8675227165222168 13.08331298828125 35.59113693237305 0.45357003808021545 0.800000011920929 4.0 2.700000047683716 0.09177528321743011 0.4481297433376312 0.054576508700847626 0.8706628084182739 0.08819595724344254 0.049087464809417725 0.7681989669799805 0.1584366261959076 4.776632704306394E-5 deepseek-r1
0.8518903255462646 15.363012313842773 29.911664962768555 0.8178571462631226 0.699999988079071 4.400000095367432 2.799999952316284 0.40320008993148804 0.8285714387893677 0.2871659994125366 0.930367648601532 0.2962702214717865 0.28042489290237427 0.834974467754364 0.4168442189693451 8.801817602943629E-5 llama3.1
0.8944202661514282 13.306819915771484 37.606502532958984 0.901190459728241 1.0 4.400000095367432 2.5 0.5528241395950317 0.9279761910438538 0.4853930175304413 0.9163230061531067 0.4865003228187561 0.4802079498767853 0.8847209811210632 0.5746822357177734 2.626385321491398E-5 gemini-2.0-flash
0.8093876838684082 17.22306251525879 22.876859664916992 0.7422980070114136 1.0 4.699999809265137 2.9000000953674316 0.29043659567832947 0.7595959305763245 0.1903877556324005 0.9099509716033936 0.23230594396591187 0.19369524717330933 0.839836835861206 0.3286050856113434 0.0020474481862038374 claude-3-haiku-20240307
0.6825458407402039 14.20469856262207 29.108247756958008 0.8657143115997314 0.8999999761581421 4.400000095367432 2.5999999046325684 0.41160085797309875 0.841269850730896 0.36301279067993164 0.7764495611190796 0.3486059010028839 0.34439048171043396 0.7830838561058044 0.4076654016971588 1.0013163773692213E-5 gpt-4o
0.7528297305107117 15.68631649017334 27.47216796875 0.6123015880584717 0.800000011920929 4.300000190734863 2.5 0.34956371784210205 0.6194444298744202 0.26341864466667175 0.876151978969574 0.26649415493011475 0.25289803743362427 0.8231343030929565 0.3607568144798279 8.60626096255146E-5 llama3.2
0.8491318821907043 13.80518627166748 31.911664962768555 0.8536802530288696 1.0 4.699999809265137 2.700000047683716 0.38320112228393555 0.9280391931533813 0.2799951434135437 0.9323809742927551 0.29193824529647827 0.2698879837989807 0.8561334013938904 0.4001650810241699 3.265933628426865E-5 gemini-2.5-pro-preview-03-25
0.6980282664299011 15.064505577087402 24.628639221191406 0.628333330154419 0.8999999761581421 4.699999809265137 2.9000000953674316 0.29340195655822754 0.5958333611488342 0.21232956647872925 0.8216220736503601 0.19381453096866608 0.1887860894203186 0.8139668107032776 0.294689804315567 6.333462806651369E-5 o3-mini
0.7963286638259888 14.412919998168945 28.09100341796875 0.8030555844306946 0.8999999761581421 4.400000095367432 2.799999952316284 0.37689483165740967 0.7830555438995361 0.26848381757736206 0.9036029577255249 0.2749701142311096 0.25850623846054077 0.8277978897094727 0.38727399706840515 2.2545595129486173E-5 granite3.2
0.7367693185806274 15.385704040527344 23.50239372253418 0.7857142686843872 0.8999999761581421 4.5 2.700000047683716 0.40638965368270874 0.7749999761581421 0.30492275953292847 0.8624509572982788 0.2630685865879059 0.2982247471809387 0.8380616903305054 0.3672952651977539 2.1593075871351175E-5 gpt-4o-mini
0.7743207216262817 18.492828369140625 16.849618911743164 0.8702597618103027 0.8999999761581421 4.599999904632568 2.799999952316284 0.2994838356971741 0.885064959526062 0.22269850969314575 0.8431162238121033 0.2348792999982834 0.2037002444267273 0.8248969316482544 0.3347106873989105 0.0013325664913281798 claude-3-5-haiku-20241022
0.8427241444587708 17.876224517822266 21.390300750732422 0.8999999761581421 0.8999999761581421 4.300000190734863 2.799999952316284 0.3259376883506775 0.8999999761581421 0.2285032570362091 0.919766902923584 0.24280297756195068 0.21922659873962402 0.8467821478843689 0.3485448658466339 0.0023539727553725243 claude-3-7-sonnet-20250219
0.6483379006385803 14.382723808288574 31.3478946685791 0.49647414684295654 0.20000000298023224 2.4000000953674316 2.200000047683716 0.21258415281772614 0.5069719552993774 0.17131498456001282 0.6721078157424927 0.18695475161075592 0.16818435490131378 0.733691394329071 0.24562521278858185 2.699718461371958E-4 gemma3:1b
0.6876193284988403 15.509061813354492 28.586748123168945 0.7070617079734802 0.8999999761581421 4.800000190734863 2.9000000953674316 0.36989516019821167 0.6689935326576233 0.32387083768844604 0.7959415316581726 0.3038523495197296 0.3066492974758148 0.8207412958145142 0.35781922936439514 8.054783684201539E-4 o4-mini-2025-04-16
(defn concat-responses-eval-data [responses-eval-dir]
  (let [responses (->> responses-eval-dir
                       io/file
                       file-seq
                       rest
                       (mapv (comp edn/read-string slurp)))]
    (reduce into responses)))
(defn add-model-platform [ds]
  (-> ds
      (tc/map-columns :platform [:model-ref]
                      (fn [m]
                        (->
                         (filter #(= (:model-ref %) m) llm/llm-models)
                         first
                         :platform)))))
(defn concat-responses-eval-ds-narrowed [responses-eval-dir]
  (let [ds (tc/dataset (concat-responses-eval-data responses-eval-dir))]
    (-> ds
        (tc/select-columns
         (concat
          (tc/column-names ds :type/numerical)
          [:model-ref :question])))))
(def responses-eval-data (concat-responses-eval-data "data/responses_evaluation"))
(def ds-responses-eval-narrowed (concat-responses-eval-ds-narrowed "data/responses_evaluation"))
(defn make-boxplot [metric]
  (->
   ds-responses-eval-narrowed
   add-model-platform
   (tc/order-by :model-ref)
   (plotly/base
    {:=width 800
     :=color :platform})
   (plotly/layer-boxplot
    {:=x :model-ref
     :=y metric})))

6.4.1 Deterministic Metrics (non-llm)

6.4.1.1 Reading Ease

The flesch-kincaid-grade-level and flesch-reading-ease metrics help show how readable the response is. A lower grade level and higher reading ease level makes the text more readable.

(make-boxplot :flesch-reading-ease)
(make-boxplot :flesch-kincaid-grade-level)

Example of max/min reading ease answers

(-> responses-eval-data
    (tc/dataset)
    (tc/select-columns [:flesch-reading-ease :answer])
    (tc/order-by :flesch-reading-ease)
    (tc/select-rows (range 1)))

_unnamed [1 2]:

:flesch-reading-ease :answer
-9.58022523 Based on the information provided, I cannot give a specific answer about whether the government will put in place Level 6 (QQI) courses for healthcare assistants.
The context indicates that currently:
- The educational requirement for a Health Care Assistant is a Level 5 (QQI) qualification
- The Office of Nursing and Midwifery Services Director (ONMSD) designs and delivers professional development programs ranging from level 5 to level 8
- Healthcare assistants are supported through ONMSD Centers to achieve the Level 5 Major award in Health Service Skills
While there is information about support for jobseekers pursuing higher education up to Level 8 and the ongoing revalidation of craft programs, there is no specific mention of plans to develop Level 6 courses specifically for healthcare assistants.
If you require more information about progression pathways for healthcare assistants beyond Level 5, I would recommend contacting the ONMSD or the relevant department directly.
(-> responses-eval-data
    (tc/dataset)
    (tc/select-columns [:flesch-reading-ease :answer])
    (tc/order-by :flesch-reading-ease :desc)
    (tc/select-rows (range 1)))

_unnamed [1 2]:

:flesch-reading-ease :answer
69.99384615 In 2023, there were 114,974 households in receipt of the HAP scheme.

Let’s try a high reading-ease answer with more than 100 words…

(-> responses-eval-data
    (tc/dataset)
    (tc/select-columns [:flesch-reading-ease :answer])
    (tc/map-columns :wc [:answer] (fn [ans]
                                    (-> (str/split ans #"\w+")
                                        (count))))
    (tc/select-rows #(> (:wc %) 100))
    (tc/order-by :flesch-reading-ease :desc)
    (tc/select-rows (range 1)))

_unnamed [1 3]:

:flesch-reading-ease :answer :wc
47.57846154 According to the provided context, a total of 161,600 children benefit from the School Transport Scheme. Here is the breakdown: 118
1. Over 135,000 pupils are traveling on primary and post-primary services.
2. 19,800 pupils have special educational needs.
3. 6,800 pupils who have arrived in Ireland from Ukraine are included in this number as well.
It’s important to note that these numbers might overlap for students with special educational needs who also travel on primary or post-primary services. The context does not provide a separate total for students with special educational needs and those from Ukraine, but it does give the individual numbers within the overall figure of 161,600.

6.4.1.2 Precision

(-> ds-performance-averages
    add-model-platform
    (plotly/base
     {:=width 800
      :=color :platform
      :=x :model-ref})
    (plotly/layer-bar
     {:=y :token-overlap-precision})
    (plotly/layer-bar
     {:=y :rouge-l-precision}))

6.4.1.3 Recall

(-> ds-performance-averages
    add-model-platform
    (plotly/base
     {:=width 800
      :=color :platform
      :=x :model-ref})
    (plotly/layer-bar
     {:=y :token-overlap-recall})
    (plotly/layer-bar
     {:=y :rouge-l-recall}))

6.4.1.4 Precision/Recall (F1)

(-> ds-performance-averages
    add-model-platform
    (plotly/base
     {:=width 800
      :=color :platform
      :=x :model-ref})
    (plotly/layer-bar
     {:=y :rouge-l-f1}))
(-> ds-performance-averages
    add-model-platform
    (plotly/base
     {:=width 800
      :=color :platform
      :=x :model-ref})
    (plotly/layer-bar
     {:=y :token-overlap-f1}))

6.4.2 LLM Generated Metrics

6.4.2.1 Faithfulness

(defn make-bar-avgs [metric]
  (->
   ds-performance-averages
   add-model-platform
   (tc/order-by metric)
   (plotly/base
    {:=width 800
     :=color :platform})
   (plotly/layer-bar
    {:=x :model-ref
     :=y metric})))
(make-bar-avgs :metric-llm-faithfulness-score)

6.4.2.2 Correctness

(make-bar-avgs :metric-llm-correctness-score)

6.4.2.3 Relevance

(make-bar-avgs :metric-llm-relevance-score)

6.4.3 Individual Performances

Let’s make a simple ‘dashboard’ type view to try to get a sense of each model’s performance at a glance.

We’ll introduce an ‘indicator’ marker to show if the model is performing okay for a metric. Perhaps in an actual evaluation system this could be some kind of target threshold that the model should meet. In this case, we’ll just use the averages of all the model performances, so that the indicator will simply indicate if the metric is above/below average.

(defn average-score [ds metrics]
  (->>
   (mapv #(ds %) metrics)
   (reduce into)
   (average-coll)))
(defn eval-averages [ds]
  (-> ds
      (tc/aggregate {:faithfulness #(average-score % [:metric-llm-faithfulness-score])
                     :correctness #(average-score % [:metric-llm-correctness-score])
                     :relevance #(average-score % [:metric-llm-relevance-score])
                     :semantic-similarity #(average-score % [:cosine-similarity])
                     :recall #(average-score % [:token-overlap-recall
                                                :rouge-l-recall])
                     :precision #(average-score % [:rouge-l-precision
                                                   :token-overlap-precision])
                     :f1 #(average-score % [:rouge-l-f1
                                            :token-overlap-f1])})
      (tc/rows :as-maps)
      first))
(def eval-averages-all (eval-averages ds-responses-eval-narrowed))
(defn indicator-symbol [colour]
  [:span {:style (str "color: " colour ";")} "&#11044"])
(def indicator-bad (indicator-symbol "red"))
(def indicator-medium (indicator-symbol "yellow"))
(def indicator-good (indicator-symbol "green"))

If above target - green If within less than 10% of target - amber If less than 10% target - red

(defn make-indicator-symbol [value target-value]
  (if (> value target-value) indicator-good
      (let [diff (abs (- target-value value))
            diff-percent (float (/ diff target-value))]
        (if (<= diff-percent 0.1)
          indicator-medium
          indicator-bad))))
(defn model-performance-summary [ds model-ref]
  (let [model-per       (filter #(= (:model-ref %) model-ref) ds)
        faithfulness    (count (filter #(= (:metric-llm-faithfulness-score %) 1) model-per))
        total-questions (count model-per)
        {:keys [correctness
                relevance
                semantic-similarity
                recall
                precision
                f1]} (-> model-per tc/dataset eval-averages)]
    [:div
     [:h1 (name model-ref)]
     [:p (str "Scores based on " total-questions " evaluation questions.")]
     [:table {:style "width: 70%;"}
      [:tr
       [:th "Metric"]
       [:th "Score"]
       [:th "Reference Average"]
       [:th "Status"]]
      [:tr
       [:td "Faithfulness"]
       [:td (str faithfulness "/" total-questions)]
       [:td (:faithfulness eval-averages-all)]
       [:td (make-indicator-symbol (/ faithfulness total-questions) (:faithfulness eval-averages-all))]]
      [:tr
       [:td "Correctness"]
       [:td correctness]
       [:td (:correctness eval-averages-all)]
       [:td (make-indicator-symbol correctness
                                   (:correctness eval-averages-all))]]
      [:tr
       [:td "Relevance"]
       [:td relevance]
       [:td (:relevance eval-averages-all)]
       [:td (make-indicator-symbol relevance
                                   (:relevance eval-averages-all))]]
      [:tr
       [:td "Semantic Similarity"]
       [:td semantic-similarity]
       [:td (:semantic-similarity eval-averages-all)]
       [:td (make-indicator-symbol semantic-similarity
                                   (:semantic-similarity eval-averages-all))]]
      [:tr
       [:td "Recall"]
       [:td recall]
       [:td (:recall eval-averages-all)]
       [:td (make-indicator-symbol recall
                                   (:recall eval-averages-all))]]
      [:tr
       [:td "Precision"]
       [:td precision]
       [:td (:precision eval-averages-all)]
       [:td (make-indicator-symbol precision
                                   (:precision eval-averages-all))]]
      [:tr
       [:td "F1"]
       [:td f1]
       [:td (:f1 eval-averages-all)]
       [:td (make-indicator-symbol f1
                                   (:f1 eval-averages-all))]]]]))
(mapv #(kind/hiccup (model-performance-summary responses-eval-data %))
     (distinct (map :model-ref responses-eval-data)))

[

gemini-2.5-flash-preview-04-17

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 10/10 0.8492462
Correctness 4.5 4.335
Relevance 2.7 2.69
Semantic Similarity 0.8581803 0.8173185
Recall 0.8698498 0.8116715
Precision 0.29389003 0.25759077
F1 0.40067297 0.34687445

llava

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 7/10 0.8492462
Correctness 4.4 4.335
Relevance 2.7 2.69
Semantic Similarity 0.7771463 0.8173185
Recall 0.7772849 0.8116715
Precision 0.19004913 0.25759077
F1 0.2590272 0.34687445

gemini-2.0-flash-lite

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 8/10 0.8492462
Correctness 4.0 4.335
Relevance 2.5 2.69
Semantic Similarity 0.7402637 0.8173185
Recall 0.6918128 0.8116715
Precision 0.42841166 0.25759077
F1 0.47760624 0.34687445

mistral

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 9/10 0.8492462
Correctness 4.6 4.335
Relevance 2.7 2.69
Semantic Similarity 0.80170715 0.8173185
Recall 0.85001177 0.8116715
Precision 0.121677436 0.25759077
F1 0.20872447 0.34687445

gemma3:4b

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 10/10 0.8492462
Correctness 4.1 4.335
Relevance 2.6 2.69
Semantic Similarity 0.8444344 0.8173185
Recall 0.78467876 0.8116715
Precision 0.3284499 0.25759077
F1 0.43166515 0.34687445

gpt-3.5-turbo

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 7/10 0.8492462
Correctness 4.5 4.335
Relevance 2.8 2.69
Semantic Similarity 0.8286174 0.8173185
Recall 0.79841506 0.8116715
Precision 0.17707592 0.25759077
F1 0.28464097 0.34687445

deepseek-r1

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 8/10 0.8492462
Correctness 4.0 4.335
Relevance 2.7 2.69
Semantic Similarity 0.76819897 0.8173185
Recall 0.86909276 0.8116715
Precision 0.06864171 0.25759077
F1 0.12510595 0.34687445

llama3.1

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 7/10 0.8492462
Correctness 4.4 4.335
Relevance 2.8 2.69
Semantic Similarity 0.83497447 0.8173185
Recall 0.89112896 0.8116715
Precision 0.28834757 0.25759077
F1 0.41002214 0.34687445

gemini-2.0-flash

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 10/10 0.8492462
Correctness 4.4 4.335
Relevance 2.5 2.69
Semantic Similarity 0.884721 0.8173185
Recall 0.90537167 0.8116715
Precision 0.48335415 0.25759077
F1 0.5637532 0.34687445

claude-3-haiku-20240307

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 10/10 0.8492462
Correctness 4.7 4.335
Relevance 2.9 2.69
Semantic Similarity 0.83983684 0.8173185
Recall 0.8596693 0.8116715
Precision 0.2130006 0.25759077
F1 0.30952084 0.34687445

gpt-4o

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 9/10 0.8492462
Correctness 4.4 4.335
Relevance 2.6 2.69
Semantic Similarity 0.78308386 0.8173185
Recall 0.72949773 0.8116715
Precision 0.3464982 0.25759077
F1 0.40963313 0.34687445

llama3.2

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 8/10 0.8492462
Correctness 4.3 4.335
Relevance 2.5 2.69
Semantic Similarity 0.8231343 0.8173185
Recall 0.81449085 0.8116715
Precision 0.2596961 0.25759077
F1 0.35516027 0.34687445

gemini-2.5-pro-preview-03-25

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 10/10 0.8492462
Correctness 4.7 4.335
Relevance 2.7 2.69
Semantic Similarity 0.8561334 0.8173185
Recall 0.8907564 0.8116715
Precision 0.2809131 0.25759077
F1 0.3916831 0.34687445

o3-mini

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 9/10 0.8492462
Correctness 4.7 4.335
Relevance 2.9 2.69
Semantic Similarity 0.8139668 0.8173185
Recall 0.75982517 0.8116715
Precision 0.1913003 0.25759077
F1 0.29404587 0.34687445

granite3.2

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 9/10 0.8492462
Correctness 4.4 4.335
Relevance 2.8 2.69
Semantic Similarity 0.8277979 0.8173185
Recall 0.8499658 0.8116715
Precision 0.26673818 0.25759077
F1 0.3820844 0.34687445

gpt-4o-mini

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 9/10 0.8492462
Correctness 4.5 4.335
Relevance 2.7 2.69
Semantic Similarity 0.8380617 0.8173185
Recall 0.79961014 0.8116715
Precision 0.28064665 0.25759077
F1 0.38684246 0.34687445

claude-3-5-haiku-20241022

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 9/10 0.8492462
Correctness 4.6 4.335
Relevance 2.8 2.69
Semantic Similarity 0.82489693 0.8173185
Recall 0.8087185 0.8116715
Precision 0.21928978 0.25759077
F1 0.31709728 0.34687445

claude-3-7-sonnet-20250219

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 9/10 0.8492462
Correctness 4.3 4.335
Relevance 2.8 2.69
Semantic Similarity 0.84678215 0.8173185
Recall 0.88124555 0.8116715
Precision 0.23101479 0.25759077
F1 0.3372413 0.34687445

gemma3:1b

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 2/10 0.8492462
Correctness 2.4 4.335
Relevance 2.2 2.69
Semantic Similarity 0.7336914 0.8173185
Recall 0.6602229 0.8116715
Precision 0.17756955 0.25759077
F1 0.22910468 0.34687445

o4-mini-2025-04-16

Scores based on 10 evaluation questions.

Metric Score Reference Average Status
Faithfulness 9/10 0.8492462
Correctness 4.8 4.335
Relevance 2.9 2.69
Semantic Similarity 0.8207413 0.8173185
Recall 0.74178046 0.8116715
Precision 0.30525082 0.25759077
F1 0.36385718 0.34687445

]

6.4.4 Evaluating the Evaluation Dataset

As a last step, let’s have a quick look to see if the metrics can tell us anything about our evaluation dataset itself.

For example, qhich question has the most wrong (non-faithfull) answers?

(-> responses-eval-data
    (tc/dataset)
    (tc/select-columns [:question :metric-llm-faithfulness-score])
    (tc/drop-missing :metric-llm-faithfulness-score) ;; There is actually one missing here...
    (tc/group-by [:question])
    (tc/aggregate {:total-correct #(apply + (% :metric-llm-faithfulness-score))})
    (tc/order-by :total-correct))

_unnamed [10 2]:

:question :total-correct
Will the government put in place Level 6 (QQI) courses for healthcare assistants? 11
What is the salary scale for an archaeologist in the local government sector? 16
How many pupils in total benefit from the School Transport Scheme (STS)? Can you also break down the numbers in terms of students with special educational needs and students who are from Ukraine? 17
How many households were in receipt of HAP payments in 2023? 17
What is the government doing with regard to the National Drugs Strategy? 17
What is the government doing to help improve GP services? 17
How is the government encouraging local authorities to apply for the town and village renewal scheme? 17
What is the punishment for someone driving without a driving licence? 18
Are there plans to further reduce public transport fares? 19
How many people are eligible for payments under the Monther and Baby Homes scheme? 20

The question about healthcare assistants only had 11/20 correct answers. This is unsurprising in retrospect, as even I had trouble understanding this original question/answer.

Let’s look at a couple of examples/evaluation reasoning for the lowest-scoring question

(-> responses-eval-data
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :metric-llm-faithfulness-score :metric-llm-faithfulness-explanation])
    (tc/drop-missing :metric-llm-faithfulness-score)
    (tc/select-rows #(and (= (:question %) "Will the government put in place Level 6 (QQI) courses for healthcare assistants?")
                          (= (:metric-llm-faithfulness-score %) 0)))
    (tc/select-columns [:model-ref :answer :metric-llm-faithfulness-explanation]))

_unnamed [9 3]:

:model-ref :answer :metric-llm-faithfulness-explanation
llava As a senior government official, I am responsible and trustworthy in providing accurate and reliable information to citizens. Regarding your question on whether the government will put in place QQI level 6 courses for healthcare assistants, it is important to note that decisions regarding the introduction of new educational programmes are made by relevant stakeholders, such as Higher Education Institutions (HEIs) and education authorities. The Department does not place a cap on the number of places in HEIs, which means the decision to introduce new courses rests with the institutions themselves based on their capacity and academic criteria. If you have any further questions or concerns about educational requirements for healthcare assistants, please do not hesitate to contact me. {
“reasoning”: “The statement makes claims not found in the context—such as the speaker’s trustworthiness as a senior official and that decisions to introduce new QQI level 6 courses rest solely with HEIs and education authorities. The context only specifies existing Level 5 requirements, QQI’s role in revalidation, and that HEIs set their own intake numbers; it does not address who decides the introduction of new programmes or affirm the speaker’s personal credibility.”,
“score”: “No”
}
gemini-2.0-flash-lite I am unable to answer this question with the information I have available. {
“reasoning”: “The statement is a meta-comment about inability to answer, which is not a factual claim supported by the provided context. The context actually provides substantial information and does not indicate any inability to answer questions.”,
“score”: “No”
}
gpt-3.5-turbo At this time, the current educational requirement for a Health Care Assistant as per HSE eligibility criteria is the relevant Health Skills Level 5 (QQI) qualification. The purpose of this qualification is to provide learners with the skills and knowledge necessary to work under supervision in healthcare settings. {
“reasoning”: “While the statement correctly references the Level 5 (QQI) requirement and its purpose (working under supervision) and aligns with the fact that programme content is periodically updated, it introduces unsupported claims—specifically that there are no Level 6 courses for healthcare assistants and that future changes will be communicated through official channels in a particular way. These points are not stated or implied in the context.”,
While there are currently no Level 6 courses specifically designated for healthcare assistants, it is important to note that programme content is periodically updated to incorporate technical or legislative changes. The Government is committed to expanding healthcare education opportunities, and any future developments in the qualification requirements for healthcare assistants would be communicated through official channels and in alignment with educational needs in the sector. “score”: “No”
}
If you are interested in furthering your healthcare education beyond Level 5, exploring higher education courses or professional development opportunities may be beneficial. Thank you for your inquiry.
deepseek-r1 {
Okay, I’m trying to figure out if the government will put in place Level 6 (QQI) courses for healthcare assistants based on the provided context. Let me read through the context again. “reasoning”: “The context clearly states existing qualifications (Level 5, up to Level 8, and postgraduate Level 9) and makes no mention of any Level 6 courses, so one can conclude that no Level 6 courses are designated for healthcare assistants. Thus it is possible to answer based on the information provided, making the claim of being unable to answer unsupported.”,
“score”: “No”
The context mentions that the HSE eligibility criteria require a Health Skills Level 5 (QQI) qualification. It talks about the Major Health Service Skills Award (HSSA), which is Level 5, and the ONMSD’s role in providing continuous professional development and clinical leadership through various educational programs up to Level 8. }
Looking at the training for Health Care Assistants, it says they can support the nursing and midwifery function through the ONMSD Centres of Nursing and Midwifery Education with the achievement of a QQI Level 5 award. The total credit value is 120 ECTS across 8 modules.
For higher education, jobseekers can go up to an honours degree (QQI level 8) but not beyond that for masters or postgrad levels. They mention the Professional Masters in Education as Level 9, but it doesn’t support master’s level studies.
The context also talks about expanding third-level places in healthcare but notes that HEIs decide their own capacity without government capping them.
So, where does Level 6 come into play? The QQI system goes up to Level 8 for awards. There’s no mention of Level 6 specifically being introduced or planned. The highest level mentioned is Level 5 (HSSA) and higher degrees up to Level 9 in professional master’s but not beyond that.
The context doesn’t indicate any recent announcements or plans about introducing Level 6 courses for healthcare assistants. It mainly focuses on existing qualifications like Level 5, honours degrees, and postgraduate programs. There’s no information suggesting the government is moving towards a Level 6 qualification specifically for this role.
Therefore, based on what’s provided, there’s no evidence that the government has put in place or plans to introduce Level 6 courses for healthcare assistants.
Based on the context provided, the government does not currently have any Level 6 (QQI) courses designated for healthcare assistants. The information available refers to existing qualifications at Level 5 and higher degrees up to Level 8, as well as postgraduate programs like the Professional Master’s in Education at Level 9. There is no mention of Level 6 courses being introduced or planned specifically for this role.
Answer: I am unable to answer this question with the information I have available.
gpt-4o I am unable to answer this question with the information I have available. {
“reasoning”: “The statement is a meta-comment about the assistant’s ability and is not supported or addressed by any factual information in the provided context.”,
“score”: “No”
}
llama3.2 The context provided doesn’t explicitly mention the possibility of introducing Level 6 (QQI) courses specifically for healthcare assistants. However, it does highlight the Office of Nursing and Midwifery Services Director’s focus on continuous professional development and clinical leadership at various levels, including level 5. {
“reasoning”: “The statement introduces information about discussions and proposals for Level 6 (QQI) courses that are not mentioned or supported by the provided context. While the context does note ONMSD programmes spanning levels 5 to 8 and the priority on expanding third‑level places, it does not reference any plans or discussions to introduce Level 6 courses specifically for healthcare assistants.”,
That being said, I can tell you that there have been discussions and proposals regarding expanding higher education opportunities in the healthcare sector, including the creation of Level 6 (QQI) courses. However, I couldn’t find any specific information or confirmation from the government on whether such courses are currently planned or being considered for introduction. “score”: “No”
}
It’s also worth noting that the Department is prioritizing ensuring an expanded number of third-level places in healthcare programmes, which could potentially lead to more opportunities for healthcare assistants to pursue further education and training. However, the specifics of any proposed Level 6 (QQI) courses, if they were to be introduced, are not mentioned in this context.
I recommend keeping an eye on official government announcements or updates from relevant authorities, such as the Office of Nursing and Midwifery Services Director, for any developments on this matter.
granite3.2 The context provided does not explicitly mention the introduction of Level 6 (QQI) courses specifically for Health Care Assistants. However, it does highlight that the Department prioritizes expanding third level places in healthcare programmes. These could potentially include Level 6 courses. For the most accurate information, I would recommend contacting the relevant Higher Education Institutions or the Department directly. They can provide updates on specific course offerings and any plans for introducing Level 6 (QQI) courses for Health Care Assistants. {
“reasoning”: “The context does not mention the introduction of Level 6 (QQI) courses for Health Care Assistants and only states that the Department prioritizes expanding third‑level places. The suggestion that Level 6 courses could be included and the recommendation to contact institutions for updates go beyond what the context explicitly supports.”,
“score”: “No”
}
gpt-4o-mini I am unable to answer this question with the information I have available. However, I can inform you that the current educational requirement for Health Care Assistants is the Level 5 Major Health Service Skills Award (QQI), and there is an ongoing focus on providing professional development and pathways for further education in healthcare. If you have further inquiries or require specific updates regarding potential Level 6 courses, I recommend reaching out to the relevant educational institutions or governing bodies for the most accurate information. {
“reasoning”: “The statement begins by claiming an inability to answer, but the context clearly provides the educational requirement. While the requirement (Level 5 Major Health Service Skills Award) and focus on professional development and further education are supported, the speaker’s claim of being unable to answer is not. Therefore, the statement is not fully supported by the context.”,
“score”: “No”
}
gemma3:1b That’s a very important and complex question, and it’s one we’re actively considering. As of today, November 2, 2023, the government has stated that Level 6 (QQI) courses for Healthcare Assistants are being considered for implementation. {
“reasoning”: “The context makes no mention of the government considering or planning QQI Level 6 courses for Healthcare Assistants, nor does it outline any phased implementation timeline. Therefore, the statement is not supported by the provided context.”,
However, it’s crucial to understand the timeline and the scope of this consideration. The government has outlined a phased approach: “score”: “No”
}
* Phase 1 (2024-2025): Initial assessment and planning for Level 6 courses. This will involve a thorough review of existing training programs and identifying potential gaps.
* Phase 2 (2026-2027): Development of the first Level 6 courses, focusing on key skills and competencies required for a wider range of healthcare roles.
* Phase 3 (2028 onwards): Expansion and refinement of the Level 6 framework based on feedback and evolving industry needs.
Currently, there’s no definitive date for when these courses will be rolled out. The government is prioritizing a careful and considered approach to ensure these new qualifications are aligned with the needs of the healthcare sector.
I can provide you with more detailed information if you’d like, such as:
* Links to the official government website regarding the Level 6 initiative.
* A summary of the key priorities and goals for this phase of the program.
* Information on the specific skills and competencies that will be covered in these new courses.

We can see a major error here with my evaluation prompt. In some cases the model answers that “It cannot provide information using the information available” which should be an acceptable answer in this context (since the prompt instructs is that it should provide this default if it can’t answer)

I went back and added an extra instruction in the prompt to try account for these cases. But, it’s an important lesson in trying to think logically about the material in the prompts.

source: src/notebooks/rag_evaluation.clj