6 Generation Evaluation

(ns notebooks.rag-evaluation
  (:require [libpython-clj2.require :refer [require-python]]
            [libpython-clj2.python :refer [py..] :as py]
            [notebooks.preparation :refer [ds]]
            [selmer.parser :as templates]
            [scicloj.kindly.v4.kind :as kind]
            [notebooks.generation :as gen]
            [scicloj.tableplot.v1.plotly :as plotly]
            [notebooks.llm-api :as llm]
            [clojure.edn :as edn]
            [clojure.string :as str]
            [tablecloth.api :as tc]
            [clojure.java.io :as io]
            [notebooks.vdb-evaluation :as vdb])
  (:import
   (dev.langchain4j.data.segment TextSegment)
   (dev.langchain4j.store.embedding CosineSimilarity)
   (dev.langchain4j.model.embedding.onnx.allminilml6v2 AllMiniLmL6V2EmbeddingModel)))

For this section, I will be relying heavily on the continuous-eval (python) metrics and approach for starting to think about how to evaluate the RAG.

That repository also has some great links to articles explaining some of the concepts in more detail.

As the creators of the project write, there are several kinds of questions you might want to consider when evaluating answer generation:

Do I have to use GPT-4 or would a smaller model work too?
Should I fine-tune an LLM for my RAG application?
Which prompts minimize hallucination the most?
How sensitive are answers to different prompts?
Is the LLM already good enough if I provide the right contexts, and should I focus on improving Retrieval instead?

(source)

In this exercise, I will only really look at the question of what llm model might work best with the data that I have and the prompt/retrieval framework we have already set up.

We will focus on three categories of metrics:

Deterministic
Semantic
LLM-based

Deterministic metrics are similar to how we measured the retrieval performace; they simply measure the token overlap between answers generated by the LLM and some kind of reference/ground-truth answers.

The semantic metric is similar to the method of retrieving information from the vector database; it checks how similar two pieces of text are based on vector embeddings.

LLM-based metrics utilise another LLM to assign a score to the output. For example, to determine ‘answer-correctness’, we will ask an LLM to assign a score between 1-5 to a generated answer, based on reference answers that we provide ourselves.

6.1 Evaluation Dataset

Before going into the metrics further, we will first create a testing dataset that contains some questions and ground truth answers. I’ve used 10 fairly random questions based on some of the material in the starting dataset of questions and answers.

Ideally, we would use a much larger and more thoughtfully curated evaluation dataset, perhaps with input from domain experts across different question areas. The goal here, however, is simply to test out some evaluation workflows in clojure, so a basic evaluation dataset will have to do for now.

Below, we just load that dataset. The ‘questions.edn’ file is set up as a clojure map, where the questions are keys and the ground truth answers and values.

(def evaluation-dataset
  (let [data         (edn/read-string (slurp "data/evaluation_questions/questions.edn"))
        questions    (keys data)
        ground-truth (vals data)]
    (mapv (fn [question truth] (-> {}
                                   (assoc :question question)
                                   (assoc :ground-truth truth)))
          questions
          ground-truth)))

(kind/table evaluation-dataset)

question	ground-truth
How many people are eligible for payments under the Monther and Baby Homes scheme?	`["It is anticipated that there are 34,000 people eligible for benefits under the Mother and Baby Institutions Payment Scheme" "A comprehensive body of work has been underway to develop the substantial administrative framework required to deliver a scheme of this size."]`
What is the salary scale for an archaeologist in the local government sector?	`["The salary scale for an archaeologist in the local government sector ranges from €55,519 to €77,176."]`
How many pupils in total benefit from the School Transport Scheme (STS)? Can you also break down the numbers in terms of students with special educational needs and students who are from Ukraine?	`["There are over 161,600 children total that benefit from the School Transport Scheme" "135,000 pupils travel on primary and post primary services." "19,800 pupils with special educational needs benefit from the scheme." "6,800 pupils who have arrived to Ireland from Ukraine avail of the scheme."]`
Will the government put in place Level 6 (QQI) courses for healthcare assistants?	["The Office of Nursing and Midwifery Services Director (ONMSD) within its scope of governance designs, develops and delivers continuous professional development and clinical leadership, with programmes of education ranging from level 5 to level 8 on the National Framework of Qualifications." "The educational requirement for a Health Care Assistant is the relevant Health Skills Level 5 (QQ1) qualification" "Training is provided to Health Care Assistants who support the nursing and midwifery function through the ONMSD Centres of Nursing and Midwifery Education with the achievement of Ireland QQI Level 5 Major award in Health Service Skills." "Ensuring an expanded number of third level places in healthcare programmes is a priority for my Department."]
How many households were in receipt of HAP payments in 2023?	`["At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support."]`
What is the government doing with regard to the National Drugs Strategy?	`["The National Drugs Strategy 2017-2025 is a health-led approach to drug and alcohol use in Ireland, and it represents a whole-of-Government response to reduce demand and associated harms." "The Department of Revenue has primary responsibility for the prevention, detection, interception and seizure of controlled drugs."]`
What is the government doing to help improve GP services?	["Increased annual investment of over €211 million was provided for under the 2019 GP Agreement, providing for increased GP fees, supports and new services." "There is an increase in demand for GP services due to demographics and an aging population" "The government is conducting a review of services, that includes a focus on GP services in certain rural and urban deprived areas." "The government and the HSE have commenced a Strategic Review of General Practice, examining a broad range of issues."]
What is the punishment for someone driving without a driving licence?	`["A person driving without a licence may be fined up to €5000, or be imprisoned for up to 6 months, or both."]`
Are there plans to further reduce public transport fares?	`["There are no plans to further reduce public transport fares at this time"]`
How is the government encouraging local authorities to apply for the town and village renewal scheme?	["In July of 2023, the government announced funding of €15 million for this year's Town and Village Renewal Scheme" "The government's goal is to promote rural and community development and to support vibrant, inclusive and sustainable communities throughout Ireland." "The 2023 main Town and Village Renewal scheme also has a renewed focus on town centre regeneration and enhancing economic and social vibrancy in line with the Town Centre First policy." "Capital schemes including the Town and Village Renewal Scheme, LEADER, the Rural Regeneration and Development Fund, the Local Improvement Scheme, and the Community Centre Investment Fund are delivering projects throughout the country."]

6.2 Generate LLM Answers

Next, we will write a helper function to save llm responses and generate some responses by different llm models. These are the responses that we will evaluate later. As you can see below, I tested 20 different models. Some were locally running small models (around 8B parameters max), and some were cloud-based models from Google, OpenAI and Anthropic.

(defn ask-llm-save-responses! [model questions]
  (let [responses (reduce (fn [res question]
                            (conj res
                                  (gen/get-rag-answer
                                   (assoc question :model-ref model)
                                   :db-docs)))
                          [] questions)
        f-name (str "data/responses/" model "_responses.edn")]
    (spit f-name responses)))

(comment
  (ask-llm-save-responses! "gemini-2.0-flash-lite" evaluation-dataset)
  (ask-llm-save-responses! "llama3.1" evaluation-dataset)
  (ask-llm-save-responses! "gpt-3.5-turbo" evaluation-dataset)
  (ask-llm-save-responses! "gemma3:1b" evaluation-dataset)
  (ask-llm-save-responses! "gpt-4o-mini" evaluation-dataset)
  (ask-llm-save-responses! "gpt-4o" evaluation-dataset)
  (ask-llm-save-responses! "o4-mini-2025-04-16" evaluation-dataset)
  (ask-llm-save-responses! "o3-mini" evaluation-dataset)
  (ask-llm-save-responses! "gemini-2.0-flash" evaluation-dataset)
  (ask-llm-save-responses! "claude-3-7-sonnet-20250219" evaluation-dataset)
  (ask-llm-save-responses! "claude-3-5-haiku-20241022" evaluation-dataset)
  (ask-llm-save-responses! "claude-3-haiku-20240307" evaluation-dataset)
  (ask-llm-save-responses! "llama3.2" evaluation-dataset)
  (ask-llm-save-responses! "mistral" evaluation-dataset)
  (ask-llm-save-responses! "llava" evaluation-dataset)
  (ask-llm-save-responses! "deepseek-r1" evaluation-dataset)
  (ask-llm-save-responses! "gemma3:4b" evaluation-dataset)
  (ask-llm-save-responses! "granite3.2" evaluation-dataset)
  (ask-llm-save-responses! "gemini-2.5-pro-preview-03-25" evaluation-dataset)
  (ask-llm-save-responses! "gemini-2.5-flash-preview-04-17" evaluation-dataset))

(defonce responses-ds
  (let [responses-dir "data/responses"
        responses (->> responses-dir
                       (io/file)
                       file-seq
                       rest
                       (map (comp edn/read-string slurp))
                       (reduce into))]
    (tc/dataset responses)))

(tc/row-count responses-ds)

Each model answered the 10 questions from the evaluation dataset, so that’s 200 responses overall.

6.3 Continuous Eval Metrics Functions

Below, I am just creating a wrapper for the Continuous-eval deterministic metrics, and re-writing the LLM metrics in clojure, using the prompt templates that are provided in the continuous-eval repo

For demonstrating how the metrics work, we will use a couple of the generated responses as samples.

For the question “How many households were in receipt of HAP payments in 2023?”, the data available states that 57,617 households were in receipt of payments at the end of Q3 2023. In other words, the full data for 2023 was not available at that time. Most of the models seemed to be able to pick up that detail, but one of the lower-powered ones, gemma3(1 billion parameter model) didn’t qualify the figure to state that it was only for Q3.

Also, the question “Are there plans to further reduce public transport fares?” should be a simple ‘no’, based on the available data, but the gemma3:1b model also gets this one wrong.

(def sample-gen-responses
  (-> responses-ds
      (tc/select-rows #(and (or (= (:model-ref %) "llama3.1")
                                (= (:model-ref %) "gemma3:1b"))
                            (or (re-find #"receipt of HAP payments" (:question %))
                                (re-find #"transport fares" (:question %)))))))

(-> sample-gen-responses
    (tc/select-columns [:model-ref :question :answer])
    (kind/table))

model-ref	question	answer
llama3.1	Are there plans to further reduce public transport fares?	No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.
llama3.1	How many households were in receipt of HAP payments in 2023?	At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023.
gemma3:1b	Are there plans to further reduce public transport fares?	Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.
gemma3:1b	How many households were in receipt of HAP payments in 2023?	In 2023, there were 114,974 households in receipt of the HAP scheme.

6.3.1 Deterministic Metrics

(require-python '[continuous_eval.metrics.generation.text.deterministic :as det])

:ok

(defn add-deterministic-metrics [{:keys [answer retrieved-context ground-truth] :as rag-data}]
  (let [faithfullness-spec  {:answer            answer
                             :retrieved_context retrieved-context}
        correctness-spec    {:answer               answer
                             :ground_truth_answers (if (seq ground-truth) ground-truth retrieved-context)}
        faithfulness-scores (into {} (py.. (det/DeterministicFaithfulness) (**compute faithfullness-spec)))
        correctness-scores  (into {} (py.. (det/DeterministicAnswerCorrectness) (**compute correctness-spec)))
        reading-scores      (into {} (py.. (det/FleschKincaidReadability) (compute answer)))]
    (->
     (merge
      faithfulness-scores
      correctness-scores
      reading-scores
      rag-data)
     (clojure.set/rename-keys
      {"flesch_reading_ease"         :flesch-reading-ease
       "flesch_kincaid_grade_level"  :flesch-kincaid-grade-level
       "rouge_l_recall"              :rouge-l-recall
       "rouge_faithfulness"          :rouge-faithfulness
       "rouge_l_precision"           :rouge-l-precision
       "rouge_l_f1"                  :rouge-l-f1
       "rouge_p_by_sentence"         :rouge-p-by-sentence
       "bleu_score_by_sentence"      :bleu-score-by-sentence
       "bleu_faithfulness"           :bleu-faithfulness
       "bleu_score"                  :bleu-score
       "token_overlap_p_by_sentence" :token-overlap-p-by-sentence
       "token_overlap_f1"            :token-overlap-f1
       "token_overlap_precision"     :token-overlap-precision
       "token_overlap_recall"        :token-overlap-recall
       "token_overlap_faithfulness"  :token-overlap-faithfulness}))))

Example score for the sample responses:

(-> (mapv add-deterministic-metrics (tc/rows sample-gen-responses :as-maps))
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :rouge-l-f1 :token-overlap-f1 :bleu-score])
    (kind/table))

model-ref	question	answer	rouge-l-f1	token-overlap-f1	bleu-score
llama3.1	Are there plans to further reduce public transport fares?	No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.	0.3582089520962353	0.2926829268292683	0.18303581288409837
llama3.1	How many households were in receipt of HAP payments in 2023?	At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023.	0.5925925884224966	0.5263157894736842	0.3523584062391067
gemma3:1b	Are there plans to further reduce public transport fares?	Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.	0.25714285411836735	0.20833333333333331	0.14325120993710644
gemma3:1b	How many households were in receipt of HAP payments in 2023?	In 2023, there were 114,974 households in receipt of the HAP scheme.	0.5714285665306124	0.5	0.5050058400713138

The ‘F1’ scores are the combination of ‘precision’ and ‘recall’ metrics. As we saw in previous sections, precision is how much of the generated answer is reflected in the ground truth (i.e., what % of the generated answer is not ‘superfluous’), and recall is how much of the ground truth is reflected in the generated answer. The F1 score is the harmonic mean of both these scores, with a score closer to 1 being better. The ‘BLEU’ score is also better when it is closer to 1.

In this case, even though these metrics don’t check for semantic meaning or logic, the metrics do indicate that the llama3.1 responses were slightly better than the gemma3 responses.

6.3.2 Semantic Similarity

We’ll also check, very roughly, the semantic similarity (based on cosine similarity) between the generated responses and the ground truth.

(defn calculate-cosine-similarity [text-a vec-text-b]
  (let [embedding-model (AllMiniLmL6V2EmbeddingModel/new)
        embedding-fn (fn [text]
                       (->> (TextSegment/from text)
                            (. embedding-model embed)
                            (.content)))]
    (CosineSimilarity/between (embedding-fn text-a) (embedding-fn (str/join " " vec-text-b)))))

(defn add-semantic-similarity [{:keys [answer ground-truth] :as rag-data}]
  (let [similarity-score (calculate-cosine-similarity answer ground-truth)]
    (-> rag-data
        (assoc :cosine-similarity similarity-score))))

(add-semantic-similarity {:answer "Berlin in the capital of France."
                          :ground-truth ["Paris is the capital of France."]})

{:answer "Berlin in the capital of France.",
 :ground-truth ["Paris is the capital of France."],
 :cosine-similarity 0.5875568139261295}

(add-semantic-similarity {:answer "The capital of France is Paris."
                          :ground-truth ["Paris is the capital of France."]})

{:answer "The capital of France is Paris.",
 :ground-truth ["Paris is the capital of France."],
 :cosine-similarity 0.989356549878067}

(add-semantic-similarity {:answer "Paris is the capital of France."
                          :ground-truth ["The capital of France is Paris."
                                         "The Mona Lisa is in Paris."]})

{:answer "Paris is the capital of France.",
 :ground-truth
 ["The capital of France is Paris." "The Mona Lisa is in Paris."],
 :cosine-similarity 0.7456770568295227}

(-> (mapv add-semantic-similarity (tc/rows sample-gen-responses :as-maps))
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :cosine-similarity])
    (kind/table))

model-ref	question	answer	cosine-similarity
llama3.1	Are there plans to further reduce public transport fares?	No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.	0.6855268462688894
llama3.1	How many households were in receipt of HAP payments in 2023?	At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023.	0.9459118435287973
gemma3:1b	Are there plans to further reduce public transport fares?	Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.	0.5193210040131704
gemma3:1b	How many households were in receipt of HAP payments in 2023?	In 2023, there were 114,974 households in receipt of the HAP scheme.	0.7805409346707153

We can see a limitation with this metric here - even though the last gemma3:1b answer is factually inccorrect, it still is quite ‘semantically similar’ to the ground truth answer.

6.3.3 LLM Metrics

As I mentioned above, I’m using the same approach/prompts for the LLM-based metrics as is used in the continuous-eval project.

For example, the ‘faithfulness’ prompt can be seen below:

(-> "prompts/faithfulness_sys.txt"
    slurp
    (str/replace #"\n" "\n\n")
    (gen/quoted-response)
    kind/md)

You are tasked to evaluate whether the statement is fully supported by the context. With a reasoning following the response, respond with either Yes or No.

If the model has correctly identified that it can’t answer based on the context, then also respond with Yes.

– EXAMPLES –

Example 1:

Context: The Eiffel Tower, a wrought-iron lattice tower on the Champ de Mars in Paris, France, is one of the most famous landmarks in the world. It was designed by Gustave Eiffel and completed in 1889.

Statement: The Eiffel Tower can be found in the center of London, near the Thames River.

Response:

{

“reasoning”: “The statement contradicts with the context, which states that Eiffel Tower is in Paris, as opposed to the center of London.”,

“score”: “No”

}

Example 2:

Context: Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that can later be released to fuel the organisms’ activities. This chemical energy is stored in carbohydrate molecules, such as sugars, which are synthesized from carbon dioxide and water.

Statement: Photosynthesis in plants primarily involves the conversion of light energy into chemical energy stored in forms such as sugar.

Response:

{

“reasoning”: “The statement is supported by the context, which states that photosynthesis converts light energy into chemical energy and that the chemical energy is stored in carbohydrate molecules, such as sugars.”,

“score”: “Yes”

}

(defn add-llm-metric-correctness-score [{:keys [question answer ground-truth] :as rag-data} llm-model]
  (let [system-prompt (slurp "prompts/ans_correctness_sys.txt")
        user-prompt   (-> "prompts/ans_correctness_user.txt"
                          slurp
                          (templates/render {:question     question
                                             :answer       answer
                                             :ground-truth (if (seq ground-truth)
                                                             ground-truth
                                                             (:retrieved-context rag-data))}))
        response      (llm/ask-llm
                       {:model-ref     llm-model
                        :question      user-prompt
                        :system-prompt system-prompt})
        score (first (re-find #"(?<=[S|s]core(.{1,4}))[1|2|3|4|5]" response))
        score (when score (parse-long score))]
    (-> rag-data
        (assoc :metric-llm-correctness-explanation response)
        (assoc :metric-llm-correctness-score score))))

(defn add-llm-metric-faithfulness-score [{:keys [answer retrieved-context] :as rag-data} llm-model]
  (let [system-prompt  (slurp "prompts/faithfulness_sys.txt")
        ret-ctx-joined (str/join "\n" retrieved-context)
        user-prompt    (-> "prompts/faithfulness_user.txt"
                           slurp
                           (templates/render {:answer                   answer
                                              :retrieved-context-joined ret-ctx-joined}))
        response       (llm/ask-llm
                        {:model-ref     llm-model
                         :question      user-prompt
                         :system-prompt system-prompt})
        score (first (re-find #"(?<=[S|s]core(.{1,4}))[yes|no]" (str/lower-case response)))
        score (when score (if (= score "y") 1 0))]
    (-> rag-data
        (assoc :metric-llm-faithfulness-explanation response)
        (assoc :metric-llm-faithfulness-score score))))

(defn add-llm-metric-relevance-score [{:keys [answer question] :as rag-data} llm-model]
  (let [system-prompt  (slurp "prompts/ans_relevance_sys.txt")
        user-prompt    (-> "prompts/ans_relevance_user.txt"
                           slurp
                           (templates/render {:answer answer
                                              :question question}))
        response       (llm/ask-llm
                        {:model-ref     llm-model
                         :question      user-prompt
                         :system-prompt system-prompt})
        score (first (re-find #"(?<=[S|s]core(.{1,4}))[1|2|3]" response))
        score (when score (parse-long score))]
    (-> rag-data
        (assoc :metric-llm-relevance-explanation response)
        (assoc :metric-llm-relevance-score score))))

(defn add-llm-metrics [rag-data model]
  (-> rag-data
      (add-llm-metric-correctness-score model)
      (add-llm-metric-faithfulness-score model)
      (add-llm-metric-relevance-score model)
      (assoc :evaluator-model model)))

Finally, let’s wrap all of the above three metric types (deterministic, semantic, and llm-based) into a single function.

(defn add-all-generation-evaluation-metrics [responses evaluation-model]
  (mapv (fn [resp]
          (-> resp
              add-deterministic-metrics
              add-semantic-similarity
              (add-llm-metrics evaluation-model)))
        responses))

Now, let’s use these metrics to evaluate the two example question/answers we genreated earlier.

(comment
  (let [eval-model "gpt-4o"
        output-fname "data/evaluation_example/example.edn"
        sample-with-metrics (add-all-evaluation-metrics
                             (tc/rows sample-gen-responses :as-maps)
                             eval-model)]
    (spit output-fname sample-with-metrics)))

(def sample-gen-responses-metrics (edn/read-string (slurp "data/evaluation_example/example.edn")))

(first sample-gen-responses-metrics)

{:metric-llm-faithfulness-explanation
 "{\n  \"reasoning\": \"The statement is fully supported by the context. It confirms that there are no current plans to further reduce public transport fares, stating that the 20% average PSO fare initiative and the 90-minute fare will continue. It also mentions the recently published Fares Determination, which aims to provide more consistent and equitable fares, aligning with the information in the context.\",\n  \"score\": \"Yes\"\n}",
 :rouge-l-recall 0.9230769230769231,
 :flesch-kincaid-grade-level 14.486250000000002,
 :flesch-reading-ease 29.200416666666683,
 :rouge-faithfulness 1.0,
 :retrieved-context
 [" The funding allocated by Government also supports the fare initiatives we’ve introduced since late 2021 such as –  - Reducing average fares on all PSO services by 20%, helping people with cost-of-living pressures; - Introducing, and most recently expanding, the Young Adult Card, reducing fares by a further 50% for young people just as they’re making choices about how they travel independently; and - Introducing a new 90-minute fare to enable easy interchange between different types of public transport. These initiatives have made public transport more affordable for more people and have undoubtably contributed toward the strong passenger growth we’re seeing across the network. There are no plans to amend or extend the Student Leap Card."
  "On rail we’re also supporting improvements across the network on both commuter and inter-city services.  We have also continued the 20% average PSO fare initiative and the 90-minute fare, all of which are available to those persons who have not yet reached the State pension age. However, there are no plans to further reduce public transport fares at this time."
  "€613.463m has been secured for the provision of Public Service Obligation (PSO) and Local Link services. This package includes funding for the continuation of the 20% fare reduction on PSO services, the extension of the Young Adult Card on both PSO and commercial bus services to include 24-and-25-year-olds, and the continuation of the 90-minute fare until the end of 2024. Further, the recently published Fares Determination will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin."
  "The Government is strongly committed to providing all citizens with reliable and realistic sustainable mobility options, and public transport plays a key role in the delivery of this goal.  To support this objective, Government has provided increased levels of funding for public transport services in recent years. Under Budget 2024, a funding package of c."
  "The report also found that any increase in public transport demand due to free fares would largely be achieved by reductions in walking (7%) and cycling (13%), resulting in just a 1% reduction in car travel. My Department’s funding for public transport services is therefore focussed on existing fare initiatives and service improvements such as BusConnects in the five cities, expanding town services in the larger towns, and rolling out Connecting Ireland in rural Ireland. These improvements are delivering real benefits for citizens across the country."],
 :metric-llm-faithfulness-score 1,
 :metric-llm-correctness-score 5,
 :metric-llm-correctness-explanation
 "{\n  \"reasoning\": \"The generated answer is correct and complete. It states that there are no current plans to further reduce public transport fares, which matches the ground truth answer. Additionally, it provides extra context about existing fare initiatives and a recently published Fares Determination, which does not contradict the main point and adds useful information.\",\n  \"score\": 5\n}",
 :metric-llm-relevance-score 3,
 :token-overlap-f1 0.2926829268292683,
 :ground-truth
 ["There are no plans to further reduce public transport fares at this time"],
 :question "Are there plans to further reduce public transport fares?",
 :token-overlap-faithfulness 1.0,
 :metric-llm-relevance-explanation
 "{\n  \"reasoning\": \"The answer directly responds to the question by stating that there are currently no plans to further reduce public transport fares. It provides additional context about the existing fare initiatives and mentions the recently published Fares Determination aiming for consistent and equitable fares. This information is relevant and fully addresses the question.\",\n  \"score\": 3\n}",
 :bleu-score-by-sentence
 [6.682549838299777E-14 1.7938417188868533E-8 6.289332584320069E-8],
 :model-ref "llama3.1",
 :rouge-p-by-sentence
 [0.8666666666666667 0.5833333333333334 0.7727272727272727],
 :bleu-score 0.18303581288409837,
 :token-overlap-recall 1.0,
 :evaluator-model "gpt-4o",
 :rouge-l-precision 0.2222222222222222,
 :token-overlap-precision 0.17142857142857143,
 :system-prompt
 "I want you to act as a responsible and trustworthy senior government official.\nPlease provide an answer to a citizen's question, using only the context provided.\nAnswer as if you are talking directly to the citizen and be neutral and formal as possible.\nIf you can't find a specific detail from the question, please acknowledge this and provide any\nother helpful information that may be related to the question.\nIf you can't find sufficient information in the context to answer the question at all,\nthen reply with \"I am unable to answer this question with the information I have available.\"\n\n CONTEXT:  The funding allocated by Government also supports the fare initiatives we’ve introduced since late 2021 such as –  - Reducing average fares on all PSO services by 20%, helping people with cost-of-living pressures; - Introducing, and most recently expanding, the Young Adult Card, reducing fares by a further 50% for young people just as they’re making choices about how they travel independently; and - Introducing a new 90-minute fare to enable easy interchange between different types of public transport. These initiatives have made public transport more affordable for more people and have undoubtably contributed toward the strong passenger growth we’re seeing across the network. There are no plans to amend or extend the Student Leap Card.\n\nOn rail we’re also supporting improvements across the network on both commuter and inter-city services.  We have also continued the 20% average PSO fare initiative and the 90-minute fare, all of which are available to those persons who have not yet reached the State pension age. However, there are no plans to further reduce public transport fares at this time.\n\n€613.463m has been secured for the provision of Public Service Obligation (PSO) and Local Link services. This package includes funding for the continuation of the 20% fare reduction on PSO services, the extension of the Young Adult Card on both PSO and commercial bus services to include 24-and-25-year-olds, and the continuation of the 90-minute fare until the end of 2024. Further, the recently published Fares Determination will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.\n\nThe Government is strongly committed to providing all citizens with reliable and realistic sustainable mobility options, and public transport plays a key role in the delivery of this goal.  To support this objective, Government has provided increased levels of funding for public transport services in recent years. Under Budget 2024, a funding package of c.\n\nThe report also found that any increase in public transport demand due to free fares would largely be achieved by reductions in walking (7%) and cycling (13%), resulting in just a 1% reduction in car travel. My Department’s funding for public transport services is therefore focussed on existing fare initiatives and service improvements such as BusConnects in the five cities, expanding town services in the larger towns, and rolling out Connecting Ireland in rural Ireland. These improvements are delivering real benefits for citizens across the country.",
 :token-overlap-p-by-sentence
 [0.8571428571428571 0.6923076923076923 0.7333333333333333],
 :answer
 "No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.",
 :rouge-l-f1 0.3582089520962353,
 :bleu-faithfulness 2.69439366191892E-8}

Example LLM Faithfulness evaluation (score can be ‘1 - faithfull’ or ‘0 - not faithfull’):

(-> sample-gen-responses-metrics
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :metric-llm-faithfulness-score :metric-llm-faithfulness-explanation])
    (kind/table))

model-ref	question	answer	metric-llm-faithfulness-score	metric-llm-faithfulness-explanation
llama3.1	Are there plans to further reduce public transport fares?	No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.	1	{ "reasoning": "The statement is fully supported by the context. It confirms that there are no current plans to further reduce public transport fares, stating that the 20% average PSO fare initiative and the 90-minute fare will continue. It also mentions the recently published Fares Determination, which aims to provide more consistent and equitable fares, aligning with the information in the context.", "score": "Yes" }
llama3.1	How many households were in receipt of HAP payments in 2023?	At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023.	1	{ "reasoning": "The statement is supported by the context, which provides the exact data point that at the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. Additionally, the context confirms that the latest available data covers only until the end of Q3 2023, aligning with the statement's note about the data coverage limitation.", "score": "Yes" }
gemma3:1b	Are there plans to further reduce public transport fares?	Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.	0	{ "reasoning": "The statement inaccurately interprets the context by suggesting there are plans to further reduce public transport fares. The context clearly states, 'there are no plans to further reduce public transport fares at this time.' The mention of 'recently published Fares Determination' refers to achieving more consistent and equitable fares, not a further reduction in fares. Additionally, it is correct that there are no plans to amend or extend the Student Leap Card, but this does not contradict the point about fare reduction plans.", "score": "No" }
gemma3:1b	How many households were in receipt of HAP payments in 2023?	In 2023, there were 114,974 households in receipt of the HAP scheme.	0	{ "reasoning": "The statement claims there were 114,974 households in receipt of the HAP scheme in 2023, but the context states that 114,974 HAP tenancies had been set-up since the scheme commenced. However, there were only 57,617 households actively in receipt of HAP support at the end of Q3 2023. Therefore, the statement confuses the total tenancies set up since inception with the actual number of active recipients in 2023.", "score": "No" }

As we can see, the evaluation model correctly identified the errors in the gemma3:1b answers.

Example LLM Correctness evaluation (range between 1 and 5):

(-> sample-gen-responses-metrics
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :metric-llm-correctness-score :metric-llm-correctness-explanation])
    (kind/table))

model-ref	question	answer	metric-llm-correctness-score	metric-llm-correctness-explanation
llama3.1	Are there plans to further reduce public transport fares?	No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.	5	{ "reasoning": "The generated answer is correct and complete. It states that there are no current plans to further reduce public transport fares, which matches the ground truth answer. Additionally, it provides extra context about existing fare initiatives and a recently published Fares Determination, which does not contradict the main point and adds useful information.", "score": 5 }
llama3.1	How many households were in receipt of HAP payments in 2023?	At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023.	5	{ "reasoning": "The generated answer accurately provides the number of households in receipt of HAP payments at the end of Q3 2023 by stating 'nearly 57,617 households.' It also adds a note that this data is only up to the end of Q3 2023 and may not reflect the full year 2023, adding further context to the data. This additional context is a correct and informative expansion upon the ground truth answer.", "score": 5 }
gemma3:1b	Are there plans to further reduce public transport fares?	Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.	2	{ "reasoning": "The generated answer is incorrect because it mistakenly states that there are plans to reduce public transport fares, contrary to the ground truth answer which indicates that there are no plans for further reduction. The generated answer also provides additional information about fare consistency and equity, but these do not address the main question directly regarding reduction plans.", "score": 2 }
gemma3:1b	How many households were in receipt of HAP payments in 2023?	In 2023, there were 114,974 households in receipt of the HAP scheme.	2	{ "reasoning": "The generated answer is incorrect because it states that there were 114,974 households in receipt of the HAP scheme, while the ground truth answers mention that at the end of Q3 2023, there were nearly 57,617 households actively receiving HAP support. The figures are not aligned, indicating that the generated answer contains major errors.", "score": 2 }

Example LLM Relevance evaluation (range between 1 and 3):

(-> sample-gen-responses-metrics
    (tc/dataset)
    (tc/select-columns [:model-ref :question :answer :metric-llm-relevance-score :metric-llm-relevance-explanation])
    (kind/table))

model-ref	question	answer	metric-llm-relevance-score	metric-llm-relevance-explanation
llama3.1	Are there plans to further reduce public transport fares?	No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.	3	{ "reasoning": "The answer directly responds to the question by stating that there are currently no plans to further reduce public transport fares. It provides additional context about the existing fare initiatives and mentions the recently published Fares Determination aiming for consistent and equitable fares. This information is relevant and fully addresses the question.", "score": 3 }
llama3.1	How many households were in receipt of HAP payments in 2023?	At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023.	3	{ "reasoning": "The answer provides a specific number of households receiving HAP payments at the end of Q3 2023, which directly addresses the question about the number of households in receipt of HAP payments in 2023. It also clarifies the time frame of the data, indicating that it may not cover the entire year of 2023, thereby providing a complete and relevant response to the question.", "score": 3 }
gemma3:1b	Are there plans to further reduce public transport fares?	Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.	2	{ "reasoning": "The answer correctly identifies that there are plans to further reduce public transport fares by mentioning a 'recently published Fares Determination' aimed at making fares more consistent and equitable. However, the reference to the Student Leap Card is irrelevant to the main question about reducing public transport fares. Thus, the answer partially answers the question, as it provides some relevant information but includes unnecessary details.", "score": 2 }
gemma3:1b	How many households were in receipt of HAP payments in 2023?	In 2023, there were 114,974 households in receipt of the HAP scheme.	3	{ "reasoning": "The generated answer provides a specific number of households that were in receipt of HAP payments in 2023. This directly answers the question by providing the exact information requested.", "score": 3 }

Interestingly, even though the gemma3 responses were factually incorrect, they still received a high ‘relevance’ score from the evaluator model. In other words, it recognises that it was still attempting to answer the question in a ‘relevant’ manner, even though it got the facts wrong.

6.3.4 Running/Saving evaluations

(defn run-and-save-evaluation-metrics! [responses model]
  (let [model-ref (:model-ref (first responses))
        f-name (str "data/responses_evaluation/" model-ref "_evaluation.edn")
        resp (add-all-generation-evaluation-metrics responses model)]
    (spit f-name resp)))

(defn run-and-save-all-evals! [responses-dir model]
  (let [responses (->> (io/file responses-dir)
                       file-seq
                       rest
                       (mapv (comp edn/read-string slurp)))]
    (mapv #(run-and-save-evaluation-metrics! % model) responses)))

(comment
  ;; 43:55 (very roughly) to run around 15 models
  ;; cost - around 1.44 USD for 18 models * 10 questions each - 180 evaluations
  (run-and-save-all-evals! "data/responses" "o4-mini-2025-04-16"))

6.4 Exploring Performance

In this part we’ll try to compare the 20 models based on their performance across the metrics.

We’ll start be defining a few helper functions.

(defn average-coll [coll]
  (float
   (/ (apply + (remove nil? coll))
      (count (remove nil? coll)))))

(defn average-all-cols [numerical-ds]
  (let [cols (tc/column-names numerical-ds)]
    (tc/dataset
     (reduce (fn [res col]
               (assoc res col (average-coll (numerical-ds col))))
             {} cols))))

(defn summarise-model-performance-avgs [rag-datas]
  (let [model-ref (:model-ref (first rag-datas))]
    (-> rag-datas
        (tc/dataset)
        (tc/drop-columns #(re-find #"by-sentence" (name %)))
        (tc/select-columns :type/numerical)
        average-all-cols
        (tc/add-column :model-ref model-ref))))

(defn build-responses-eval-ds-avgs [responses-eval-dir]
  (let [responses (->> responses-eval-dir
                       io/file
                       file-seq
                       rest
                       (mapv (comp edn/read-string slurp))
                       (mapv summarise-model-performance-avgs))]
    (apply tc/concat responses)))

(def ds-performance-averages (build-responses-eval-ds-avgs "data/responses_evaluation"))

(kind/table ds-performance-averages)

rouge-l-recall	flesch-kincaid-grade-level	flesch-reading-ease	rouge-faithfulness	metric-llm-faithfulness-score	metric-llm-correctness-score	metric-llm-relevance-score	token-overlap-f1	token-overlap-faithfulness	bleu-score	token-overlap-recall	rouge-l-precision	token-overlap-precision	cosine-similarity	rouge-l-f1	bleu-faithfulness	model-ref
0.8239417672157288	14.509193420410156	30.045215606689453	0.8083333373069763	1.0	4.5	2.700000047683716	0.3923736810684204	0.824999988079071	0.2834573984146118	0.9157578349113464	0.30401644110679626	0.2837636470794678	0.8581802845001221	0.40897229313850403	3.135146835120395E-5	gemini-2.5-flash-preview-04-17
0.7446075677871704	15.492807388305664	27.18585777282715	0.670634925365448	0.699999988079071	4.400000095367432	2.700000047683716	0.25699713826179504	0.6595237851142883	0.20455192029476166	0.809962272644043	0.18923130631446838	0.19086696207523346	0.7771462798118591	0.2610572874546051	1.724415342323482E-4	llava
0.6585989594459534	12.719018936157227	38.68239212036133	0.8999999761581421	0.800000011920929	4.0	2.5	0.48910555243492126	0.800000011920929	0.4617713987827301	0.7250267267227173	0.4208480715751648	0.43597525358200073	0.7402637004852295	0.46610692143440247	5.130228237248957E-5	gemini-2.0-flash-lite
0.8267883062362671	15.08983325958252	24.900646209716797	0.5108433365821838	0.8999999761581421	4.599999904632568	2.700000047683716	0.19189156591892242	0.49193307757377625	0.11331550776958466	0.873235285282135	0.1329944133758545	0.11036045849323273	0.8017071485519409	0.2255573570728302	4.652861025533639E-5	mistral
0.7401713728904724	13.667346954345703	30.989845275878906	0.8883333206176758	1.0	4.099999904632568	2.5999999046325684	0.43316417932510376	0.8666666746139526	0.35321152210235596	0.829186201095581	0.3343445956707001	0.32255521416664124	0.844434380531311	0.4301661252975464	2.77885210380191E-5	gemma3:4b
0.7641585469245911	16.72433853149414	18.24077796936035	0.8014285564422607	0.7777777910232544	4.5	2.799999952316284	0.27142131328582764	0.770714282989502	0.18540184199810028	0.832671582698822	0.1887160837650299	0.1654357612133026	0.8286173939704895	0.29786062240600586	3.24876127706375E-5	gpt-3.5-turbo
0.8675227165222168	13.08331298828125	35.59113693237305	0.45357003808021545	0.800000011920929	4.0	2.700000047683716	0.09177528321743011	0.4481297433376312	0.054576508700847626	0.8706628084182739	0.08819595724344254	0.049087464809417725	0.7681989669799805	0.1584366261959076	4.776632704306394E-5	deepseek-r1
0.8518903255462646	15.363012313842773	29.911664962768555	0.8178571462631226	0.699999988079071	4.400000095367432	2.799999952316284	0.40320008993148804	0.8285714387893677	0.2871659994125366	0.930367648601532	0.2962702214717865	0.28042489290237427	0.834974467754364	0.4168442189693451	8.801817602943629E-5	llama3.1
0.8944202661514282	13.306819915771484	37.606502532958984	0.901190459728241	1.0	4.400000095367432	2.5	0.5528241395950317	0.9279761910438538	0.4853930175304413	0.9163230061531067	0.4865003228187561	0.4802079498767853	0.8847209811210632	0.5746822357177734	2.626385321491398E-5	gemini-2.0-flash
0.8093876838684082	17.22306251525879	22.876859664916992	0.7422980070114136	1.0	4.699999809265137	2.9000000953674316	0.29043659567832947	0.7595959305763245	0.1903877556324005	0.9099509716033936	0.23230594396591187	0.19369524717330933	0.839836835861206	0.3286050856113434	0.0020474481862038374	claude-3-haiku-20240307
0.6825458407402039	14.20469856262207	29.108247756958008	0.8657143115997314	0.8999999761581421	4.400000095367432	2.5999999046325684	0.41160085797309875	0.841269850730896	0.36301279067993164	0.7764495611190796	0.3486059010028839	0.34439048171043396	0.7830838561058044	0.4076654016971588	1.0013163773692213E-5	gpt-4o
0.7528297305107117	15.68631649017334	27.47216796875	0.6123015880584717	0.800000011920929	4.300000190734863	2.5	0.34956371784210205	0.6194444298744202	0.26341864466667175	0.876151978969574	0.26649415493011475	0.25289803743362427	0.8231343030929565	0.3607568144798279	8.60626096255146E-5	llama3.2
0.8491318821907043	13.80518627166748	31.911664962768555	0.8536802530288696	1.0	4.699999809265137	2.700000047683716	0.38320112228393555	0.9280391931533813	0.2799951434135437	0.9323809742927551	0.29193824529647827	0.2698879837989807	0.8561334013938904	0.4001650810241699	3.265933628426865E-5	gemini-2.5-pro-preview-03-25
0.6980282664299011	15.064505577087402	24.628639221191406	0.628333330154419	0.8999999761581421	4.699999809265137	2.9000000953674316	0.29340195655822754	0.5958333611488342	0.21232956647872925	0.8216220736503601	0.19381453096866608	0.1887860894203186	0.8139668107032776	0.294689804315567	6.333462806651369E-5	o3-mini
0.7963286638259888	14.412919998168945	28.09100341796875	0.8030555844306946	0.8999999761581421	4.400000095367432	2.799999952316284	0.37689483165740967	0.7830555438995361	0.26848381757736206	0.9036029577255249	0.2749701142311096	0.25850623846054077	0.8277978897094727	0.38727399706840515	2.2545595129486173E-5	granite3.2
0.7367693185806274	15.385704040527344	23.50239372253418	0.7857142686843872	0.8999999761581421	4.5	2.700000047683716	0.40638965368270874	0.7749999761581421	0.30492275953292847	0.8624509572982788	0.2630685865879059	0.2982247471809387	0.8380616903305054	0.3672952651977539	2.1593075871351175E-5	gpt-4o-mini
0.7743207216262817	18.492828369140625	16.849618911743164	0.8702597618103027	0.8999999761581421	4.599999904632568	2.799999952316284	0.2994838356971741	0.885064959526062	0.22269850969314575	0.8431162238121033	0.2348792999982834	0.2037002444267273	0.8248969316482544	0.3347106873989105	0.0013325664913281798	claude-3-5-haiku-20241022
0.8427241444587708	17.876224517822266	21.390300750732422	0.8999999761581421	0.8999999761581421	4.300000190734863	2.799999952316284	0.3259376883506775	0.8999999761581421	0.2285032570362091	0.919766902923584	0.24280297756195068	0.21922659873962402	0.8467821478843689	0.3485448658466339	0.0023539727553725243	claude-3-7-sonnet-20250219
0.6483379006385803	14.382723808288574	31.3478946685791	0.49647414684295654	0.20000000298023224	2.4000000953674316	2.200000047683716	0.21258415281772614	0.5069719552993774	0.17131498456001282	0.6721078157424927	0.18695475161075592	0.16818435490131378	0.733691394329071	0.24562521278858185	2.699718461371958E-4	gemma3:1b
0.6876193284988403	15.509061813354492	28.586748123168945	0.7070617079734802	0.8999999761581421	4.800000190734863	2.9000000953674316	0.36989516019821167	0.6689935326576233	0.32387083768844604	0.7959415316581726	0.3038523495197296	0.3066492974758148	0.8207412958145142	0.35781922936439514	8.054783684201539E-4	o4-mini-2025-04-16

(defn concat-responses-eval-data [responses-eval-dir]
  (let [responses (->> responses-eval-dir
                       io/file
                       file-seq
                       rest
                       (mapv (comp edn/read-string slurp)))]
    (reduce into responses)))

(defn add-model-platform [ds]
  (-> ds
      (tc/map-columns :platform [:model-ref]
                      (fn [m]
                        (->
                         (filter #(= (:model-ref %) m) llm/llm-models)
                         first
                         :platform)))))

(defn concat-responses-eval-ds-narrowed [responses-eval-dir]
  (let [ds (tc/dataset (concat-responses-eval-data responses-eval-dir))]
    (-> ds
        (tc/select-columns
         (concat
          (tc/column-names ds :type/numerical)
          [:model-ref :question])))))

(def responses-eval-data (concat-responses-eval-data "data/responses_evaluation"))

(def ds-responses-eval-narrowed (concat-responses-eval-ds-narrowed "data/responses_evaluation"))

(defn make-boxplot [metric]
  (->
   ds-responses-eval-narrowed
   add-model-platform
   (tc/order-by :model-ref)
   (plotly/base
    {:=width 800
     :=color :platform})
   (plotly/layer-boxplot
    {:=x :model-ref
     :=y metric})))

6.4.1 Deterministic Metrics (non-llm)

6.4.1.1 Reading Ease

The flesch-kincaid-grade-level and flesch-reading-ease metrics help show how readable the response is. A lower grade level and higher reading ease level makes the text more readable.

(make-boxplot :flesch-reading-ease)

(make-boxplot :flesch-kincaid-grade-level)

Example of max/min reading ease answers

(-> responses-eval-data
    (tc/dataset)
    (tc/select-columns [:flesch-reading-ease :answer])
    (tc/order-by :flesch-reading-ease)
    (tc/select-rows (range 1)))

_unnamed [1 2]:

:flesch-reading-ease	:answer
-9.58022523	Based on the information provided, I cannot give a specific answer about whether the government will put in place Level 6 (QQI) courses for healthcare assistants.

	The context indicates that currently:
	- The educational requirement for a Health Care Assistant is a Level 5 (QQI) qualification
	- The Office of Nursing and Midwifery Services Director (ONMSD) designs and delivers professional development programs ranging from level 5 to level 8
	- Healthcare assistants are supported through ONMSD Centers to achieve the Level 5 Major award in Health Service Skills

	While there is information about support for jobseekers pursuing higher education up to Level 8 and the ongoing revalidation of craft programs, there is no specific mention of plans to develop Level 6 courses specifically for healthcare assistants.

	If you require more information about progression pathways for healthcare assistants beyond Level 5, I would recommend contacting the ONMSD or the relevant department directly.

(-> responses-eval-data
    (tc/dataset)
    (tc/select-columns [:flesch-reading-ease :answer])
    (tc/order-by :flesch-reading-ease :desc)
    (tc/select-rows (range 1)))

_unnamed [1 2]:

:flesch-reading-ease	:answer
69.99384615	In 2023, there were 114,974 households in receipt of the HAP scheme.

Let’s try a high reading-ease answer with more than 100 words…

(-> responses-eval-data
    (tc/dataset)
    (tc/select-columns [:flesch-reading-ease :answer])
    (tc/map-columns :wc [:answer] (fn [ans]
                                    (-> (str/split ans #"\w+")
                                        (count))))
    (tc/select-rows #(> (:wc %) 100))
    (tc/order-by :flesch-reading-ease :desc)
    (tc/select-rows (range 1)))

_unnamed [1 3]:

:flesch-reading-ease	:answer	:wc
47.57846154	According to the provided context, a total of 161,600 children benefit from the School Transport Scheme. Here is the breakdown:	118

	1. Over 135,000 pupils are traveling on primary and post-primary services.
	2. 19,800 pupils have special educational needs.
	3. 6,800 pupils who have arrived in Ireland from Ukraine are included in this number as well.

	It’s important to note that these numbers might overlap for students with special educational needs who also travel on primary or post-primary services. The context does not provide a separate total for students with special educational needs and those from Ukraine, but it does give the individual numbers within the overall figure of 161,600.

6.4.1.2 Precision

(-> ds-performance-averages
    add-model-platform
    (plotly/base
     {:=width 800
      :=color :platform
      :=x :model-ref})
    (plotly/layer-bar
     {:=y :token-overlap-precision})
    (plotly/layer-bar
     {:=y :rouge-l-precision}))

6.4.1.3 Recall

(-> ds-performance-averages
    add-model-platform
    (plotly/base
     {:=width 800
      :=color :platform
      :=x :model-ref})
    (plotly/layer-bar
     {:=y :token-overlap-recall})
    (plotly/layer-bar
     {:=y :rouge-l-recall}))

6.4.1.4 Precision/Recall (F1)

(-> ds-performance-averages
    add-model-platform
    (plotly/base
     {:=width 800
      :=color :platform
      :=x :model-ref})
    (plotly/layer-bar
     {:=y :rouge-l-f1}))

(-> ds-performance-averages
    add-model-platform
    (plotly/base
     {:=width 800
      :=color :platform
      :=x :model-ref})
    (plotly/layer-bar
     {:=y :token-overlap-f1}))

6.4.2 LLM Generated Metrics

6.4.2.1 Faithfulness

(defn make-bar-avgs [metric]
  (->
   ds-performance-averages
   add-model-platform
   (tc/order-by metric)
   (plotly/base
    {:=width 800
     :=color :platform})
   (plotly/layer-bar
    {:=x :model-ref
     :=y metric})))

(make-bar-avgs :metric-llm-faithfulness-score)

6.4.2.2 Correctness

(make-bar-avgs :metric-llm-correctness-score)

6.4.2.3 Relevance

(make-bar-avgs :metric-llm-relevance-score)

6.4.3 Individual Performances

Let’s make a simple ‘dashboard’ type view to try to get a sense of each model’s performance at a glance.

We’ll introduce an ‘indicator’ marker to show if the model is performing okay for a metric. Perhaps in an actual evaluation system this could be some kind of target threshold that the model should meet. In this case, we’ll just use the averages of all the model performances, so that the indicator will simply indicate if the metric is above/below average.

(defn average-score [ds metrics]
  (->>
   (mapv #(ds %) metrics)
   (reduce into)
   (average-coll)))

(defn eval-averages [ds]
  (-> ds
      (tc/aggregate {:faithfulness #(average-score % [:metric-llm-faithfulness-score])
                     :correctness #(average-score % [:metric-llm-correctness-score])
                     :relevance #(average-score % [:metric-llm-relevance-score])
                     :semantic-similarity #(average-score % [:cosine-similarity])
                     :recall #(average-score % [:token-overlap-recall
                                                :rouge-l-recall])
                     :precision #(average-score % [:rouge-l-precision
                                                   :token-overlap-precision])
                     :f1 #(average-score % [:rouge-l-f1
                                            :token-overlap-f1])})
      (tc/rows :as-maps)
      first))

(def eval-averages-all (eval-averages ds-responses-eval-narrowed))

(defn indicator-symbol [colour]
  [:span {:style (str "color: " colour ";")} "&#11044"])

(def indicator-bad (indicator-symbol "red"))

(def indicator-medium (indicator-symbol "yellow"))

(def indicator-good (indicator-symbol "green"))

If above target - green If within less than 10% of target - amber If less than 10% target - red

(defn make-indicator-symbol [value target-value]
  (if (> value target-value) indicator-good
      (let [diff (abs (- target-value value))
            diff-percent (float (/ diff target-value))]
        (if (<= diff-percent 0.1)
          indicator-medium
          indicator-bad))))

(defn model-performance-summary [ds model-ref]
  (let [model-per       (filter #(= (:model-ref %) model-ref) ds)
        faithfulness    (count (filter #(= (:metric-llm-faithfulness-score %) 1) model-per))
        total-questions (count model-per)
        {:keys [correctness
                relevance
                semantic-similarity
                recall
                precision
                f1]} (-> model-per tc/dataset eval-averages)]
    [:div
     [:h1 (name model-ref)]
     [:p (str "Scores based on " total-questions " evaluation questions.")]
     [:table {:style "width: 70%;"}
      [:tr
       [:th "Metric"]
       [:th "Score"]
       [:th "Reference Average"]
       [:th "Status"]]
      [:tr
       [:td "Faithfulness"]
       [:td (str faithfulness "/" total-questions)]
       [:td (:faithfulness eval-averages-all)]
       [:td (make-indicator-symbol (/ faithfulness total-questions) (:faithfulness eval-averages-all))]]
      [:tr
       [:td "Correctness"]
       [:td correctness]
       [:td (:correctness eval-averages-all)]
       [:td (make-indicator-symbol correctness
                                   (:correctness eval-averages-all))]]
      [:tr
       [:td "Relevance"]
       [:td relevance]
       [:td (:relevance eval-averages-all)]
       [:td (make-indicator-symbol relevance
                                   (:relevance eval-averages-all))]]
      [:tr
       [:td "Semantic Similarity"]
       [:td semantic-similarity]
       [:td (:semantic-similarity eval-averages-all)]
       [:td (make-indicator-symbol semantic-similarity
                                   (:semantic-similarity eval-averages-all))]]
      [:tr
       [:td "Recall"]
       [:td recall]
       [:td (:recall eval-averages-all)]
       [:td (make-indicator-symbol recall
                                   (:recall eval-averages-all))]]
      [:tr
       [:td "Precision"]
       [:td precision]
       [:td (:precision eval-averages-all)]
       [:td (make-indicator-symbol precision
                                   (:precision eval-averages-all))]]
      [:tr
       [:td "F1"]
       [:td f1]
       [:td (:f1 eval-averages-all)]
       [:td (make-indicator-symbol f1
                                   (:f1 eval-averages-all))]]]]))

(mapv #(kind/hiccup (model-performance-summary responses-eval-data %))
     (distinct (map :model-ref responses-eval-data)))

[

gemini-2.5-flash-preview-04-17