6 Generation Evaluation
ns notebooks.rag-evaluation
(:require [libpython-clj2.require :refer [require-python]]
(:refer [py..] :as py]
[libpython-clj2.python :refer [ds]]
[notebooks.preparation :as templates]
[selmer.parser :as kind]
[scicloj.kindly.v4.kind :as gen]
[notebooks.generation :as plotly]
[scicloj.tableplot.v1.plotly :as llm]
[notebooks.llm-api :as edn]
[clojure.edn :as str]
[clojure.string :as tc]
[tablecloth.api :as io]
[clojure.java.io :as vdb])
[notebooks.vdb-evaluation :import
(
(dev.langchain4j.data.segment TextSegment)
(dev.langchain4j.store.embedding CosineSimilarity) (dev.langchain4j.model.embedding.onnx.allminilml6v2 AllMiniLmL6V2EmbeddingModel)))
For this section, I will be relying heavily on the continuous-eval (python) metrics and approach for starting to think about how to evaluate the RAG.
That repository also has some great links to articles explaining some of the concepts in more detail.
As the creators of the project write, there are several kinds of questions you might want to consider when evaluating answer generation:
Do I have to use GPT-4 or would a smaller model work too?
Should I fine-tune an LLM for my RAG application?
Which prompts minimize hallucination the most?
How sensitive are answers to different prompts?
Is the LLM already good enough if I provide the right contexts, and should I focus on improving Retrieval instead?
(source)
In this exercise, I will only really look at the question of what llm model might work best with the data that I have and the prompt/retrieval framework we have already set up.
We will focus on three categories of metrics:
Deterministic
Semantic
LLM-based
Deterministic metrics are similar to how we measured the retrieval performace; they simply measure the token overlap between answers generated by the LLM and some kind of reference/ground-truth answers.
The semantic metric is similar to the method of retrieving information from the vector database; it checks how similar two pieces of text are based on vector embeddings.
LLM-based metrics utilise another LLM to assign a score to the output. For example, to determine ‘answer-correctness’, we will ask an LLM to assign a score between 1-5 to a generated answer, based on reference answers that we provide ourselves.
6.1 Evaluation Dataset
Before going into the metrics further, we will first create a testing dataset that contains some questions and ground truth answers. I’ve used 10 fairly random questions based on some of the material in the starting dataset of questions and answers.
Ideally, we would use a much larger and more thoughtfully curated evaluation dataset, perhaps with input from domain experts across different question areas. The goal here, however, is simply to test out some evaluation workflows in clojure, so a basic evaluation dataset will have to do for now.
Below, we just load that dataset. The ‘questions.edn’ file is set up as a clojure map, where the questions are keys and the ground truth answers and values.
def evaluation-dataset
(let [data (edn/read-string (slurp "data/evaluation_questions/questions.edn"))
(keys data)
questions (vals data)]
ground-truth (mapv (fn [question truth] (-> {}
(assoc :question question)
(assoc :ground-truth truth)))
(
questions ground-truth)))
(kind/table evaluation-dataset)
question | ground-truth |
---|---|
How many people are eligible for payments under the Monther and Baby Homes scheme? |
|
What is the salary scale for an archaeologist in the local government sector? |
|
How many pupils in total benefit from the School Transport Scheme (STS)? Can you also break down the numbers in terms of students with special educational needs and students who are from Ukraine? |
|
Will the government put in place Level 6 (QQI) courses for healthcare assistants? |
|
How many households were in receipt of HAP payments in 2023? |
|
What is the government doing with regard to the National Drugs Strategy? |
|
What is the government doing to help improve GP services? |
|
What is the punishment for someone driving without a driving licence? |
|
Are there plans to further reduce public transport fares? |
|
How is the government encouraging local authorities to apply for the town and village renewal scheme? |
|
6.2 Generate LLM Answers
Next, we will write a helper function to save llm responses and generate some responses by different llm models. These are the responses that we will evaluate later. As you can see below, I tested 20 different models. Some were locally running small models (around 8B parameters max), and some were cloud-based models from Google, OpenAI and Anthropic.
defn ask-llm-save-responses! [model questions]
(let [responses (reduce (fn [res question]
(conj res
(
(gen/get-rag-answerassoc question :model-ref model)
(:db-docs)))
[] questions)str "data/responses/" model "_responses.edn")]
f-name ( (spit f-name responses)))
comment
("gemini-2.0-flash-lite" evaluation-dataset)
(ask-llm-save-responses! "llama3.1" evaluation-dataset)
(ask-llm-save-responses! "gpt-3.5-turbo" evaluation-dataset)
(ask-llm-save-responses! "gemma3:1b" evaluation-dataset)
(ask-llm-save-responses! "gpt-4o-mini" evaluation-dataset)
(ask-llm-save-responses! "gpt-4o" evaluation-dataset)
(ask-llm-save-responses! "o4-mini-2025-04-16" evaluation-dataset)
(ask-llm-save-responses! "o3-mini" evaluation-dataset)
(ask-llm-save-responses! "gemini-2.0-flash" evaluation-dataset)
(ask-llm-save-responses! "claude-3-7-sonnet-20250219" evaluation-dataset)
(ask-llm-save-responses! "claude-3-5-haiku-20241022" evaluation-dataset)
(ask-llm-save-responses! "claude-3-haiku-20240307" evaluation-dataset)
(ask-llm-save-responses! "llama3.2" evaluation-dataset)
(ask-llm-save-responses! "mistral" evaluation-dataset)
(ask-llm-save-responses! "llava" evaluation-dataset)
(ask-llm-save-responses! "deepseek-r1" evaluation-dataset)
(ask-llm-save-responses! "gemma3:4b" evaluation-dataset)
(ask-llm-save-responses! "granite3.2" evaluation-dataset)
(ask-llm-save-responses! "gemini-2.5-pro-preview-03-25" evaluation-dataset)
(ask-llm-save-responses! "gemini-2.5-flash-preview-04-17" evaluation-dataset)) (ask-llm-save-responses!
defonce responses-ds
(let [responses-dir "data/responses"
(->> responses-dir
responses (
(io/file)file-seq
rest
map (comp edn/read-string slurp))
(reduce into))]
( (tc/dataset responses)))
(tc/row-count responses-ds)
200
Each model answered the 10 questions from the evaluation dataset, so that’s 200 responses overall.
6.3 Continuous Eval Metrics Functions
Below, I am just creating a wrapper for the Continuous-eval deterministic metrics, and re-writing the LLM metrics in clojure, using the prompt templates that are provided in the continuous-eval repo
For demonstrating how the metrics work, we will use a couple of the generated responses as samples.
For the question “How many households were in receipt of HAP payments in 2023?”, the data available states that 57,617 households were in receipt of payments at the end of Q3 2023. In other words, the full data for 2023 was not available at that time. Most of the models seemed to be able to pick up that detail, but one of the lower-powered ones, gemma3(1 billion parameter model) didn’t qualify the figure to state that it was only for Q3.
Also, the question “Are there plans to further reduce public transport fares?” should be a simple ‘no’, based on the available data, but the gemma3:1b model also gets this one wrong.
def sample-gen-responses
(-> responses-ds
(and (or (= (:model-ref %) "llama3.1")
(tc/select-rows #(= (:model-ref %) "gemma3:1b"))
(or (re-find #"receipt of HAP payments" (:question %))
(re-find #"transport fares" (:question %))))))) (
-> sample-gen-responses
(:model-ref :question :answer])
(tc/select-columns [ (kind/table))
model-ref | question | answer |
---|---|---|
llama3.1 | Are there plans to further reduce public transport fares? | No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. |
llama3.1 | How many households were in receipt of HAP payments in 2023? | At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. |
gemma3:1b | Are there plans to further reduce public transport fares? | Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. |
gemma3:1b | How many households were in receipt of HAP payments in 2023? | In 2023, there were 114,974 households in receipt of the HAP scheme. |
6.3.1 Deterministic Metrics
:as det]) (require-python '[continuous_eval.metrics.generation.text.deterministic
:ok
defn add-deterministic-metrics [{:keys [answer retrieved-context ground-truth] :as rag-data}]
(let [faithfullness-spec {:answer answer
(:retrieved_context retrieved-context}
:answer answer
correctness-spec {:ground_truth_answers (if (seq ground-truth) ground-truth retrieved-context)}
into {} (py.. (det/DeterministicFaithfulness) (**compute faithfullness-spec)))
faithfulness-scores (into {} (py.. (det/DeterministicAnswerCorrectness) (**compute correctness-spec)))
correctness-scores (into {} (py.. (det/FleschKincaidReadability) (compute answer)))]
reading-scores (->
(merge
(
faithfulness-scores
correctness-scores
reading-scores
rag-data)
(clojure.set/rename-keys"flesch_reading_ease" :flesch-reading-ease
{"flesch_kincaid_grade_level" :flesch-kincaid-grade-level
"rouge_l_recall" :rouge-l-recall
"rouge_faithfulness" :rouge-faithfulness
"rouge_l_precision" :rouge-l-precision
"rouge_l_f1" :rouge-l-f1
"rouge_p_by_sentence" :rouge-p-by-sentence
"bleu_score_by_sentence" :bleu-score-by-sentence
"bleu_faithfulness" :bleu-faithfulness
"bleu_score" :bleu-score
"token_overlap_p_by_sentence" :token-overlap-p-by-sentence
"token_overlap_f1" :token-overlap-f1
"token_overlap_precision" :token-overlap-precision
"token_overlap_recall" :token-overlap-recall
"token_overlap_faithfulness" :token-overlap-faithfulness}))))
Example score for the sample responses:
-> (mapv add-deterministic-metrics (tc/rows sample-gen-responses :as-maps))
(
(tc/dataset):model-ref :question :answer :rouge-l-f1 :token-overlap-f1 :bleu-score])
(tc/select-columns [ (kind/table))
model-ref | question | answer | rouge-l-f1 | token-overlap-f1 | bleu-score |
---|---|---|---|---|---|
llama3.1 | Are there plans to further reduce public transport fares? | No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. | 0.3582089520962353 | 0.2926829268292683 | 0.18303581288409837 |
llama3.1 | How many households were in receipt of HAP payments in 2023? | At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. | 0.5925925884224966 | 0.5263157894736842 | 0.3523584062391067 |
gemma3:1b | Are there plans to further reduce public transport fares? | Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. | 0.25714285411836735 | 0.20833333333333331 | 0.14325120993710644 |
gemma3:1b | How many households were in receipt of HAP payments in 2023? | In 2023, there were 114,974 households in receipt of the HAP scheme. | 0.5714285665306124 | 0.5 | 0.5050058400713138 |
The ‘F1’ scores are the combination of ‘precision’ and ‘recall’ metrics. As we saw in previous sections, precision is how much of the generated answer is reflected in the ground truth (i.e., what % of the generated answer is not ‘superfluous’), and recall is how much of the ground truth is reflected in the generated answer. The F1 score is the harmonic mean of both these scores, with a score closer to 1 being better. The ‘BLEU’ score is also better when it is closer to 1.
In this case, even though these metrics don’t check for semantic meaning or logic, the metrics do indicate that the llama3.1 responses were slightly better than the gemma3 responses.
6.3.2 Semantic Similarity
We’ll also check, very roughly, the semantic similarity (based on cosine similarity) between the generated responses and the ground truth.
defn calculate-cosine-similarity [text-a vec-text-b]
(let [embedding-model (AllMiniLmL6V2EmbeddingModel/new)
(fn [text]
embedding-fn (->> (TextSegment/from text)
(
(. embedding-model embed)content)))]
(." " vec-text-b))))) (CosineSimilarity/between (embedding-fn text-a) (embedding-fn (str/join
defn add-semantic-similarity [{:keys [answer ground-truth] :as rag-data}]
(let [similarity-score (calculate-cosine-similarity answer ground-truth)]
(-> rag-data
(assoc :cosine-similarity similarity-score)))) (
:answer "Berlin in the capital of France."
(add-semantic-similarity {:ground-truth ["Paris is the capital of France."]})
:answer "Berlin in the capital of France.",
{:ground-truth ["Paris is the capital of France."],
:cosine-similarity 0.5875568139261295}
:answer "The capital of France is Paris."
(add-semantic-similarity {:ground-truth ["Paris is the capital of France."]})
:answer "The capital of France is Paris.",
{:ground-truth ["Paris is the capital of France."],
:cosine-similarity 0.989356549878067}
:answer "Paris is the capital of France."
(add-semantic-similarity {:ground-truth ["The capital of France is Paris."
"The Mona Lisa is in Paris."]})
:answer "Paris is the capital of France.",
{:ground-truth
"The capital of France is Paris." "The Mona Lisa is in Paris."],
[:cosine-similarity 0.7456770568295227}
-> (mapv add-semantic-similarity (tc/rows sample-gen-responses :as-maps))
(
(tc/dataset):model-ref :question :answer :cosine-similarity])
(tc/select-columns [ (kind/table))
model-ref | question | answer | cosine-similarity |
---|---|---|---|
llama3.1 | Are there plans to further reduce public transport fares? | No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. | 0.6855268462688894 |
llama3.1 | How many households were in receipt of HAP payments in 2023? | At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. | 0.9459118435287973 |
gemma3:1b | Are there plans to further reduce public transport fares? | Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. | 0.5193210040131704 |
gemma3:1b | How many households were in receipt of HAP payments in 2023? | In 2023, there were 114,974 households in receipt of the HAP scheme. | 0.7805409346707153 |
We can see a limitation with this metric here - even though the last gemma3:1b answer is factually inccorrect, it still is quite ‘semantically similar’ to the ground truth answer.
6.3.3 LLM Metrics
As I mentioned above, I’m using the same approach/prompts for the LLM-based metrics as is used in the continuous-eval project.
For example, the ‘faithfulness’ prompt can be seen below:
-> "prompts/faithfulness_sys.txt"
(slurp
#"\n" "\n\n")
(str/replace
(gen/quoted-response) kind/md)
You are tasked to evaluate whether the statement is fully supported by the context. With a reasoning following the response, respond with either Yes or No.
If the model has correctly identified that it can’t answer based on the context, then also respond with Yes.
– EXAMPLES –
Example 1:
Context: The Eiffel Tower, a wrought-iron lattice tower on the Champ de Mars in Paris, France, is one of the most famous landmarks in the world. It was designed by Gustave Eiffel and completed in 1889.
Statement: The Eiffel Tower can be found in the center of London, near the Thames River.
Response:
{
“reasoning”: “The statement contradicts with the context, which states that Eiffel Tower is in Paris, as opposed to the center of London.”,
“score”: “No”
}
Example 2:
Context: Photosynthesis is a process used by plants and other organisms to convert light energy into chemical energy that can later be released to fuel the organisms’ activities. This chemical energy is stored in carbohydrate molecules, such as sugars, which are synthesized from carbon dioxide and water.
Statement: Photosynthesis in plants primarily involves the conversion of light energy into chemical energy stored in forms such as sugar.
Response:
{
“reasoning”: “The statement is supported by the context, which states that photosynthesis converts light energy into chemical energy and that the chemical energy is stored in carbohydrate molecules, such as sugars.”,
“score”: “Yes”
}
defn add-llm-metric-correctness-score [{:keys [question answer ground-truth] :as rag-data} llm-model]
(let [system-prompt (slurp "prompts/ans_correctness_sys.txt")
(-> "prompts/ans_correctness_user.txt"
user-prompt (slurp
:question question
(templates/render {:answer answer
:ground-truth (if (seq ground-truth)
ground-truth:retrieved-context rag-data))}))
(
response (llm/ask-llm:model-ref llm-model
{:question user-prompt
:system-prompt system-prompt})
first (re-find #"(?<=[S|s]core(.{1,4}))[1|2|3|4|5]" response))
score (when score (parse-long score))]
score (-> rag-data
(assoc :metric-llm-correctness-explanation response)
(assoc :metric-llm-correctness-score score)))) (
defn add-llm-metric-faithfulness-score [{:keys [answer retrieved-context] :as rag-data} llm-model]
(let [system-prompt (slurp "prompts/faithfulness_sys.txt")
("\n" retrieved-context)
ret-ctx-joined (str/join -> "prompts/faithfulness_user.txt"
user-prompt (slurp
:answer answer
(templates/render {:retrieved-context-joined ret-ctx-joined}))
response (llm/ask-llm:model-ref llm-model
{:question user-prompt
:system-prompt system-prompt})
first (re-find #"(?<=[S|s]core(.{1,4}))[yes|no]" (str/lower-case response)))
score (when score (if (= score "y") 1 0))]
score (-> rag-data
(assoc :metric-llm-faithfulness-explanation response)
(assoc :metric-llm-faithfulness-score score)))) (
defn add-llm-metric-relevance-score [{:keys [answer question] :as rag-data} llm-model]
(let [system-prompt (slurp "prompts/ans_relevance_sys.txt")
(-> "prompts/ans_relevance_user.txt"
user-prompt (slurp
:answer answer
(templates/render {:question question}))
response (llm/ask-llm:model-ref llm-model
{:question user-prompt
:system-prompt system-prompt})
first (re-find #"(?<=[S|s]core(.{1,4}))[1|2|3]" response))
score (when score (parse-long score))]
score (-> rag-data
(assoc :metric-llm-relevance-explanation response)
(assoc :metric-llm-relevance-score score)))) (
defn add-llm-metrics [rag-data model]
(-> rag-data
(
(add-llm-metric-correctness-score model)
(add-llm-metric-faithfulness-score model)
(add-llm-metric-relevance-score model)assoc :evaluator-model model))) (
Finally, let’s wrap all of the above three metric types (deterministic, semantic, and llm-based) into a single function.
defn add-all-generation-evaluation-metrics [responses evaluation-model]
(mapv (fn [resp]
(-> resp
(
add-deterministic-metrics
add-semantic-similarity
(add-llm-metrics evaluation-model))) responses))
Now, let’s use these metrics to evaluate the two example question/answers we genreated earlier.
comment
(let [eval-model "gpt-4o"
("data/evaluation_example/example.edn"
output-fname
sample-with-metrics (add-all-evaluation-metrics:as-maps)
(tc/rows sample-gen-responses
eval-model)] (spit output-fname sample-with-metrics)))
def sample-gen-responses-metrics (edn/read-string (slurp "data/evaluation_example/example.edn"))) (
first sample-gen-responses-metrics) (
:metric-llm-faithfulness-explanation
{"{\n \"reasoning\": \"The statement is fully supported by the context. It confirms that there are no current plans to further reduce public transport fares, stating that the 20% average PSO fare initiative and the 90-minute fare will continue. It also mentions the recently published Fares Determination, which aims to provide more consistent and equitable fares, aligning with the information in the context.\",\n \"score\": \"Yes\"\n}",
:rouge-l-recall 0.9230769230769231,
:flesch-kincaid-grade-level 14.486250000000002,
:flesch-reading-ease 29.200416666666683,
:rouge-faithfulness 1.0,
:retrieved-context
" The funding allocated by Government also supports the fare initiatives we’ve introduced since late 2021 such as – - Reducing average fares on all PSO services by 20%, helping people with cost-of-living pressures; - Introducing, and most recently expanding, the Young Adult Card, reducing fares by a further 50% for young people just as they’re making choices about how they travel independently; and - Introducing a new 90-minute fare to enable easy interchange between different types of public transport. These initiatives have made public transport more affordable for more people and have undoubtably contributed toward the strong passenger growth we’re seeing across the network. There are no plans to amend or extend the Student Leap Card."
["On rail we’re also supporting improvements across the network on both commuter and inter-city services. We have also continued the 20% average PSO fare initiative and the 90-minute fare, all of which are available to those persons who have not yet reached the State pension age. However, there are no plans to further reduce public transport fares at this time."
"€613.463m has been secured for the provision of Public Service Obligation (PSO) and Local Link services. This package includes funding for the continuation of the 20% fare reduction on PSO services, the extension of the Young Adult Card on both PSO and commercial bus services to include 24-and-25-year-olds, and the continuation of the 90-minute fare until the end of 2024. Further, the recently published Fares Determination will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin."
"The Government is strongly committed to providing all citizens with reliable and realistic sustainable mobility options, and public transport plays a key role in the delivery of this goal. To support this objective, Government has provided increased levels of funding for public transport services in recent years. Under Budget 2024, a funding package of c."
"The report also found that any increase in public transport demand due to free fares would largely be achieved by reductions in walking (7%) and cycling (13%), resulting in just a 1% reduction in car travel. My Department’s funding for public transport services is therefore focussed on existing fare initiatives and service improvements such as BusConnects in the five cities, expanding town services in the larger towns, and rolling out Connecting Ireland in rural Ireland. These improvements are delivering real benefits for citizens across the country."],
:metric-llm-faithfulness-score 1,
:metric-llm-correctness-score 5,
:metric-llm-correctness-explanation
"{\n \"reasoning\": \"The generated answer is correct and complete. It states that there are no current plans to further reduce public transport fares, which matches the ground truth answer. Additionally, it provides extra context about existing fare initiatives and a recently published Fares Determination, which does not contradict the main point and adds useful information.\",\n \"score\": 5\n}",
:metric-llm-relevance-score 3,
:token-overlap-f1 0.2926829268292683,
:ground-truth
"There are no plans to further reduce public transport fares at this time"],
[:question "Are there plans to further reduce public transport fares?",
:token-overlap-faithfulness 1.0,
:metric-llm-relevance-explanation
"{\n \"reasoning\": \"The answer directly responds to the question by stating that there are currently no plans to further reduce public transport fares. It provides additional context about the existing fare initiatives and mentions the recently published Fares Determination aiming for consistent and equitable fares. This information is relevant and fully addresses the question.\",\n \"score\": 3\n}",
:bleu-score-by-sentence
6.682549838299777E-14 1.7938417188868533E-8 6.289332584320069E-8],
[:model-ref "llama3.1",
:rouge-p-by-sentence
0.8666666666666667 0.5833333333333334 0.7727272727272727],
[:bleu-score 0.18303581288409837,
:token-overlap-recall 1.0,
:evaluator-model "gpt-4o",
:rouge-l-precision 0.2222222222222222,
:token-overlap-precision 0.17142857142857143,
:system-prompt
"I want you to act as a responsible and trustworthy senior government official.\nPlease provide an answer to a citizen's question, using only the context provided.\nAnswer as if you are talking directly to the citizen and be neutral and formal as possible.\nIf you can't find a specific detail from the question, please acknowledge this and provide any\nother helpful information that may be related to the question.\nIf you can't find sufficient information in the context to answer the question at all,\nthen reply with \"I am unable to answer this question with the information I have available.\"\n\n CONTEXT: The funding allocated by Government also supports the fare initiatives we’ve introduced since late 2021 such as – - Reducing average fares on all PSO services by 20%, helping people with cost-of-living pressures; - Introducing, and most recently expanding, the Young Adult Card, reducing fares by a further 50% for young people just as they’re making choices about how they travel independently; and - Introducing a new 90-minute fare to enable easy interchange between different types of public transport. These initiatives have made public transport more affordable for more people and have undoubtably contributed toward the strong passenger growth we’re seeing across the network. There are no plans to amend or extend the Student Leap Card.\n\nOn rail we’re also supporting improvements across the network on both commuter and inter-city services. We have also continued the 20% average PSO fare initiative and the 90-minute fare, all of which are available to those persons who have not yet reached the State pension age. However, there are no plans to further reduce public transport fares at this time.\n\n€613.463m has been secured for the provision of Public Service Obligation (PSO) and Local Link services. This package includes funding for the continuation of the 20% fare reduction on PSO services, the extension of the Young Adult Card on both PSO and commercial bus services to include 24-and-25-year-olds, and the continuation of the 90-minute fare until the end of 2024. Further, the recently published Fares Determination will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin.\n\nThe Government is strongly committed to providing all citizens with reliable and realistic sustainable mobility options, and public transport plays a key role in the delivery of this goal. To support this objective, Government has provided increased levels of funding for public transport services in recent years. Under Budget 2024, a funding package of c.\n\nThe report also found that any increase in public transport demand due to free fares would largely be achieved by reductions in walking (7%) and cycling (13%), resulting in just a 1% reduction in car travel. My Department’s funding for public transport services is therefore focussed on existing fare initiatives and service improvements such as BusConnects in the five cities, expanding town services in the larger towns, and rolling out Connecting Ireland in rural Ireland. These improvements are delivering real benefits for citizens across the country.",
:token-overlap-p-by-sentence
0.8571428571428571 0.6923076923076923 0.7333333333333333],
[:answer
"No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users.",
:rouge-l-f1 0.3582089520962353,
:bleu-faithfulness 2.69439366191892E-8}
Example LLM Faithfulness evaluation (score can be ‘1 - faithfull’ or ‘0 - not faithfull’):
-> sample-gen-responses-metrics
(
(tc/dataset):model-ref :question :answer :metric-llm-faithfulness-score :metric-llm-faithfulness-explanation])
(tc/select-columns [ (kind/table))
model-ref | question | answer | metric-llm-faithfulness-score | metric-llm-faithfulness-explanation |
---|---|---|---|---|
llama3.1 | Are there plans to further reduce public transport fares? | No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. | 1 | { "reasoning": "The statement is fully supported by the context. It confirms that there are no current plans to further reduce public transport fares, stating that the 20% average PSO fare initiative and the 90-minute fare will continue. It also mentions the recently published Fares Determination, which aims to provide more consistent and equitable fares, aligning with the information in the context.", "score": "Yes" } |
llama3.1 | How many households were in receipt of HAP payments in 2023? | At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. | 1 | { "reasoning": "The statement is supported by the context, which provides the exact data point that at the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. Additionally, the context confirms that the latest available data covers only until the end of Q3 2023, aligning with the statement's note about the data coverage limitation.", "score": "Yes" } |
gemma3:1b | Are there plans to further reduce public transport fares? | Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. | 0 | { "reasoning": "The statement inaccurately interprets the context by suggesting there are plans to further reduce public transport fares. The context clearly states, 'there are no plans to further reduce public transport fares at this time.' The mention of 'recently published Fares Determination' refers to achieving more consistent and equitable fares, not a further reduction in fares. Additionally, it is correct that there are no plans to amend or extend the Student Leap Card, but this does not contradict the point about fare reduction plans.", "score": "No" } |
gemma3:1b | How many households were in receipt of HAP payments in 2023? | In 2023, there were 114,974 households in receipt of the HAP scheme. | 0 | { "reasoning": "The statement claims there were 114,974 households in receipt of the HAP scheme in 2023, but the context states that 114,974 HAP tenancies had been set-up since the scheme commenced. However, there were only 57,617 households actively in receipt of HAP support at the end of Q3 2023. Therefore, the statement confuses the total tenancies set up since inception with the actual number of active recipients in 2023.", "score": "No" } |
As we can see, the evaluation model correctly identified the errors in the gemma3:1b answers.
Example LLM Correctness evaluation (range between 1 and 5):
-> sample-gen-responses-metrics
(
(tc/dataset):model-ref :question :answer :metric-llm-correctness-score :metric-llm-correctness-explanation])
(tc/select-columns [ (kind/table))
model-ref | question | answer | metric-llm-correctness-score | metric-llm-correctness-explanation |
---|---|---|---|---|
llama3.1 | Are there plans to further reduce public transport fares? | No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. | 5 | { "reasoning": "The generated answer is correct and complete. It states that there are no current plans to further reduce public transport fares, which matches the ground truth answer. Additionally, it provides extra context about existing fare initiatives and a recently published Fares Determination, which does not contradict the main point and adds useful information.", "score": 5 } |
llama3.1 | How many households were in receipt of HAP payments in 2023? | At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. | 5 | { "reasoning": "The generated answer accurately provides the number of households in receipt of HAP payments at the end of Q3 2023 by stating 'nearly 57,617 households.' It also adds a note that this data is only up to the end of Q3 2023 and may not reflect the full year 2023, adding further context to the data. This additional context is a correct and informative expansion upon the ground truth answer.", "score": 5 } |
gemma3:1b | Are there plans to further reduce public transport fares? | Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. | 2 | { "reasoning": "The generated answer is incorrect because it mistakenly states that there are plans to reduce public transport fares, contrary to the ground truth answer which indicates that there are no plans for further reduction. The generated answer also provides additional information about fare consistency and equity, but these do not address the main question directly regarding reduction plans.", "score": 2 } |
gemma3:1b | How many households were in receipt of HAP payments in 2023? | In 2023, there were 114,974 households in receipt of the HAP scheme. | 2 | { "reasoning": "The generated answer is incorrect because it states that there were 114,974 households in receipt of the HAP scheme, while the ground truth answers mention that at the end of Q3 2023, there were nearly 57,617 households actively receiving HAP support. The figures are not aligned, indicating that the generated answer contains major errors.", "score": 2 } |
Example LLM Relevance evaluation (range between 1 and 3):
-> sample-gen-responses-metrics
(
(tc/dataset):model-ref :question :answer :metric-llm-relevance-score :metric-llm-relevance-explanation])
(tc/select-columns [ (kind/table))
model-ref | question | answer | metric-llm-relevance-score | metric-llm-relevance-explanation |
---|---|---|---|---|
llama3.1 | Are there plans to further reduce public transport fares? | No, there are no current plans to further reduce public transport fares at this time. The 20% average PSO fare initiative and the 90-minute fare will continue as is, but no new reductions in fares have been announced or planned. However, it's worth noting that a recently published Fares Determination aims to deliver more consistent and equitable fares for public transport users. | 3 | { "reasoning": "The answer directly responds to the question by stating that there are currently no plans to further reduce public transport fares. It provides additional context about the existing fare initiatives and mentions the recently published Fares Determination aiming for consistent and equitable fares. This information is relevant and fully addresses the question.", "score": 3 } |
llama3.1 | How many households were in receipt of HAP payments in 2023? | At the end of Q3 2023, there were nearly 57,617 households actively in receipt of HAP support. It's worth noting that this data only covers up to the end of Q3 2023 and may not reflect the exact number of households receiving HAP payments for the full year 2023. | 3 | { "reasoning": "The answer provides a specific number of households receiving HAP payments at the end of Q3 2023, which directly addresses the question about the number of households in receipt of HAP payments in 2023. It also clarifies the time frame of the data, indicating that it may not cover the entire year of 2023, thereby providing a complete and relevant response to the question.", "score": 3 } |
gemma3:1b | Are there plans to further reduce public transport fares? | Yes, according to the provided text, there are plans to further reduce public transport fares. The report states that “There are no plans to amend or extend the Student Leap Card.” It also mentions a “recently published Fares Determination” that will deliver more consistent and equitable fares for public transport users, and reduce the disparities which currently exist for some passengers, particularly in the outer commuter towns surrounding Dublin. | 2 | { "reasoning": "The answer correctly identifies that there are plans to further reduce public transport fares by mentioning a 'recently published Fares Determination' aimed at making fares more consistent and equitable. However, the reference to the Student Leap Card is irrelevant to the main question about reducing public transport fares. Thus, the answer partially answers the question, as it provides some relevant information but includes unnecessary details.", "score": 2 } |
gemma3:1b | How many households were in receipt of HAP payments in 2023? | In 2023, there were 114,974 households in receipt of the HAP scheme. | 3 | { "reasoning": "The generated answer provides a specific number of households that were in receipt of HAP payments in 2023. This directly answers the question by providing the exact information requested.", "score": 3 } |
Interestingly, even though the gemma3 responses were factually incorrect, they still received a high ‘relevance’ score from the evaluator model. In other words, it recognises that it was still attempting to answer the question in a ‘relevant’ manner, even though it got the facts wrong.
6.3.4 Running/Saving evaluations
defn run-and-save-evaluation-metrics! [responses model]
(let [model-ref (:model-ref (first responses))
(str "data/responses_evaluation/" model-ref "_evaluation.edn")
f-name (
resp (add-all-generation-evaluation-metrics responses model)] (spit f-name resp)))
defn run-and-save-all-evals! [responses-dir model]
(let [responses (->> (io/file responses-dir)
(file-seq
rest
mapv (comp edn/read-string slurp)))]
(mapv #(run-and-save-evaluation-metrics! % model) responses))) (
comment
(;; 43:55 (very roughly) to run around 15 models
;; cost - around 1.44 USD for 18 models * 10 questions each - 180 evaluations
"data/responses" "o4-mini-2025-04-16")) (run-and-save-all-evals!
6.4 Exploring Performance
In this part we’ll try to compare the 20 models based on their performance across the metrics.
We’ll start be defining a few helper functions.
defn average-coll [coll]
(float
(/ (apply + (remove nil? coll))
(count (remove nil? coll))))) (
defn average-all-cols [numerical-ds]
(let [cols (tc/column-names numerical-ds)]
(
(tc/datasetreduce (fn [res col]
(assoc res col (average-coll (numerical-ds col))))
( {} cols))))
defn summarise-model-performance-avgs [rag-datas]
(let [model-ref (:model-ref (first rag-datas))]
(-> rag-datas
(
(tc/dataset)re-find #"by-sentence" (name %)))
(tc/drop-columns #(:type/numerical)
(tc/select-columns
average-all-cols:model-ref model-ref)))) (tc/add-column
defn build-responses-eval-ds-avgs [responses-eval-dir]
(let [responses (->> responses-eval-dir
(
io/filefile-seq
rest
mapv (comp edn/read-string slurp))
(mapv summarise-model-performance-avgs))]
(apply tc/concat responses))) (
def ds-performance-averages (build-responses-eval-ds-avgs "data/responses_evaluation")) (
(kind/table ds-performance-averages)
rouge-l-recall | flesch-kincaid-grade-level | flesch-reading-ease | rouge-faithfulness | metric-llm-faithfulness-score | metric-llm-correctness-score | metric-llm-relevance-score | token-overlap-f1 | token-overlap-faithfulness | bleu-score | token-overlap-recall | rouge-l-precision | token-overlap-precision | cosine-similarity | rouge-l-f1 | bleu-faithfulness | model-ref |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.8239417672157288 | 14.509193420410156 | 30.045215606689453 | 0.8083333373069763 | 1.0 | 4.5 | 2.700000047683716 | 0.3923736810684204 | 0.824999988079071 | 0.2834573984146118 | 0.9157578349113464 | 0.30401644110679626 | 0.2837636470794678 | 0.8581802845001221 | 0.40897229313850403 | 3.135146835120395E-5 | gemini-2.5-flash-preview-04-17 |
0.7446075677871704 | 15.492807388305664 | 27.18585777282715 | 0.670634925365448 | 0.699999988079071 | 4.400000095367432 | 2.700000047683716 | 0.25699713826179504 | 0.6595237851142883 | 0.20455192029476166 | 0.809962272644043 | 0.18923130631446838 | 0.19086696207523346 | 0.7771462798118591 | 0.2610572874546051 | 1.724415342323482E-4 | llava |
0.6585989594459534 | 12.719018936157227 | 38.68239212036133 | 0.8999999761581421 | 0.800000011920929 | 4.0 | 2.5 | 0.48910555243492126 | 0.800000011920929 | 0.4617713987827301 | 0.7250267267227173 | 0.4208480715751648 | 0.43597525358200073 | 0.7402637004852295 | 0.46610692143440247 | 5.130228237248957E-5 | gemini-2.0-flash-lite |
0.8267883062362671 | 15.08983325958252 | 24.900646209716797 | 0.5108433365821838 | 0.8999999761581421 | 4.599999904632568 | 2.700000047683716 | 0.19189156591892242 | 0.49193307757377625 | 0.11331550776958466 | 0.873235285282135 | 0.1329944133758545 | 0.11036045849323273 | 0.8017071485519409 | 0.2255573570728302 | 4.652861025533639E-5 | mistral |
0.7401713728904724 | 13.667346954345703 | 30.989845275878906 | 0.8883333206176758 | 1.0 | 4.099999904632568 | 2.5999999046325684 | 0.43316417932510376 | 0.8666666746139526 | 0.35321152210235596 | 0.829186201095581 | 0.3343445956707001 | 0.32255521416664124 | 0.844434380531311 | 0.4301661252975464 | 2.77885210380191E-5 | gemma3:4b |
0.7641585469245911 | 16.72433853149414 | 18.24077796936035 | 0.8014285564422607 | 0.7777777910232544 | 4.5 | 2.799999952316284 | 0.27142131328582764 | 0.770714282989502 | 0.18540184199810028 | 0.832671582698822 | 0.1887160837650299 | 0.1654357612133026 | 0.8286173939704895 | 0.29786062240600586 | 3.24876127706375E-5 | gpt-3.5-turbo |
0.8675227165222168 | 13.08331298828125 | 35.59113693237305 | 0.45357003808021545 | 0.800000011920929 | 4.0 | 2.700000047683716 | 0.09177528321743011 | 0.4481297433376312 | 0.054576508700847626 | 0.8706628084182739 | 0.08819595724344254 | 0.049087464809417725 | 0.7681989669799805 | 0.1584366261959076 | 4.776632704306394E-5 | deepseek-r1 |
0.8518903255462646 | 15.363012313842773 | 29.911664962768555 | 0.8178571462631226 | 0.699999988079071 | 4.400000095367432 | 2.799999952316284 | 0.40320008993148804 | 0.8285714387893677 | 0.2871659994125366 | 0.930367648601532 | 0.2962702214717865 | 0.28042489290237427 | 0.834974467754364 | 0.4168442189693451 | 8.801817602943629E-5 | llama3.1 |
0.8944202661514282 | 13.306819915771484 | 37.606502532958984 | 0.901190459728241 | 1.0 | 4.400000095367432 | 2.5 | 0.5528241395950317 | 0.9279761910438538 | 0.4853930175304413 | 0.9163230061531067 | 0.4865003228187561 | 0.4802079498767853 | 0.8847209811210632 | 0.5746822357177734 | 2.626385321491398E-5 | gemini-2.0-flash |
0.8093876838684082 | 17.22306251525879 | 22.876859664916992 | 0.7422980070114136 | 1.0 | 4.699999809265137 | 2.9000000953674316 | 0.29043659567832947 | 0.7595959305763245 | 0.1903877556324005 | 0.9099509716033936 | 0.23230594396591187 | 0.19369524717330933 | 0.839836835861206 | 0.3286050856113434 | 0.0020474481862038374 | claude-3-haiku-20240307 |
0.6825458407402039 | 14.20469856262207 | 29.108247756958008 | 0.8657143115997314 | 0.8999999761581421 | 4.400000095367432 | 2.5999999046325684 | 0.41160085797309875 | 0.841269850730896 | 0.36301279067993164 | 0.7764495611190796 | 0.3486059010028839 | 0.34439048171043396 | 0.7830838561058044 | 0.4076654016971588 | 1.0013163773692213E-5 | gpt-4o |
0.7528297305107117 | 15.68631649017334 | 27.47216796875 | 0.6123015880584717 | 0.800000011920929 | 4.300000190734863 | 2.5 | 0.34956371784210205 | 0.6194444298744202 | 0.26341864466667175 | 0.876151978969574 | 0.26649415493011475 | 0.25289803743362427 | 0.8231343030929565 | 0.3607568144798279 | 8.60626096255146E-5 | llama3.2 |
0.8491318821907043 | 13.80518627166748 | 31.911664962768555 | 0.8536802530288696 | 1.0 | 4.699999809265137 | 2.700000047683716 | 0.38320112228393555 | 0.9280391931533813 | 0.2799951434135437 | 0.9323809742927551 | 0.29193824529647827 | 0.2698879837989807 | 0.8561334013938904 | 0.4001650810241699 | 3.265933628426865E-5 | gemini-2.5-pro-preview-03-25 |
0.6980282664299011 | 15.064505577087402 | 24.628639221191406 | 0.628333330154419 | 0.8999999761581421 | 4.699999809265137 | 2.9000000953674316 | 0.29340195655822754 | 0.5958333611488342 | 0.21232956647872925 | 0.8216220736503601 | 0.19381453096866608 | 0.1887860894203186 | 0.8139668107032776 | 0.294689804315567 | 6.333462806651369E-5 | o3-mini |
0.7963286638259888 | 14.412919998168945 | 28.09100341796875 | 0.8030555844306946 | 0.8999999761581421 | 4.400000095367432 | 2.799999952316284 | 0.37689483165740967 | 0.7830555438995361 | 0.26848381757736206 | 0.9036029577255249 | 0.2749701142311096 | 0.25850623846054077 | 0.8277978897094727 | 0.38727399706840515 | 2.2545595129486173E-5 | granite3.2 |
0.7367693185806274 | 15.385704040527344 | 23.50239372253418 | 0.7857142686843872 | 0.8999999761581421 | 4.5 | 2.700000047683716 | 0.40638965368270874 | 0.7749999761581421 | 0.30492275953292847 | 0.8624509572982788 | 0.2630685865879059 | 0.2982247471809387 | 0.8380616903305054 | 0.3672952651977539 | 2.1593075871351175E-5 | gpt-4o-mini |
0.7743207216262817 | 18.492828369140625 | 16.849618911743164 | 0.8702597618103027 | 0.8999999761581421 | 4.599999904632568 | 2.799999952316284 | 0.2994838356971741 | 0.885064959526062 | 0.22269850969314575 | 0.8431162238121033 | 0.2348792999982834 | 0.2037002444267273 | 0.8248969316482544 | 0.3347106873989105 | 0.0013325664913281798 | claude-3-5-haiku-20241022 |
0.8427241444587708 | 17.876224517822266 | 21.390300750732422 | 0.8999999761581421 | 0.8999999761581421 | 4.300000190734863 | 2.799999952316284 | 0.3259376883506775 | 0.8999999761581421 | 0.2285032570362091 | 0.919766902923584 | 0.24280297756195068 | 0.21922659873962402 | 0.8467821478843689 | 0.3485448658466339 | 0.0023539727553725243 | claude-3-7-sonnet-20250219 |
0.6483379006385803 | 14.382723808288574 | 31.3478946685791 | 0.49647414684295654 | 0.20000000298023224 | 2.4000000953674316 | 2.200000047683716 | 0.21258415281772614 | 0.5069719552993774 | 0.17131498456001282 | 0.6721078157424927 | 0.18695475161075592 | 0.16818435490131378 | 0.733691394329071 | 0.24562521278858185 | 2.699718461371958E-4 | gemma3:1b |
0.6876193284988403 | 15.509061813354492 | 28.586748123168945 | 0.7070617079734802 | 0.8999999761581421 | 4.800000190734863 | 2.9000000953674316 | 0.36989516019821167 | 0.6689935326576233 | 0.32387083768844604 | 0.7959415316581726 | 0.3038523495197296 | 0.3066492974758148 | 0.8207412958145142 | 0.35781922936439514 | 8.054783684201539E-4 | o4-mini-2025-04-16 |
defn concat-responses-eval-data [responses-eval-dir]
(let [responses (->> responses-eval-dir
(
io/filefile-seq
rest
mapv (comp edn/read-string slurp)))]
(reduce into responses))) (
defn add-model-platform [ds]
(-> ds
(:platform [:model-ref]
(tc/map-columns fn [m]
(->
(filter #(= (:model-ref %) m) llm/llm-models)
(first
:platform)))))
defn concat-responses-eval-ds-narrowed [responses-eval-dir]
(let [ds (tc/dataset (concat-responses-eval-data responses-eval-dir))]
(-> ds
(
(tc/select-columnsconcat
(:type/numerical)
(tc/column-names ds :model-ref :question]))))) [
def responses-eval-data (concat-responses-eval-data "data/responses_evaluation")) (
def ds-responses-eval-narrowed (concat-responses-eval-ds-narrowed "data/responses_evaluation")) (
defn make-boxplot [metric]
(->
(
ds-responses-eval-narrowed
add-model-platform:model-ref)
(tc/order-by
(plotly/base800
{:=width :platform})
:=color
(plotly/layer-boxplot:model-ref
{:=x :=y metric})))
6.4.1 Deterministic Metrics (non-llm)
6.4.1.1 Reading Ease
The flesch-kincaid-grade-level
and flesch-reading-ease
metrics help show how readable the response is. A lower grade level and higher reading ease level makes the text more readable.
:flesch-reading-ease) (make-boxplot
:flesch-kincaid-grade-level) (make-boxplot
Example of max/min reading ease answers
-> responses-eval-data
(
(tc/dataset):flesch-reading-ease :answer])
(tc/select-columns [:flesch-reading-ease)
(tc/order-by range 1))) (tc/select-rows (
_unnamed [1 2]:
:flesch-reading-ease | :answer |
---|---|
-9.58022523 | Based on the information provided, I cannot give a specific answer about whether the government will put in place Level 6 (QQI) courses for healthcare assistants. |
The context indicates that currently: | |
- The educational requirement for a Health Care Assistant is a Level 5 (QQI) qualification | |
- The Office of Nursing and Midwifery Services Director (ONMSD) designs and delivers professional development programs ranging from level 5 to level 8 | |
- Healthcare assistants are supported through ONMSD Centers to achieve the Level 5 Major award in Health Service Skills | |
While there is information about support for jobseekers pursuing higher education up to Level 8 and the ongoing revalidation of craft programs, there is no specific mention of plans to develop Level 6 courses specifically for healthcare assistants. | |
If you require more information about progression pathways for healthcare assistants beyond Level 5, I would recommend contacting the ONMSD or the relevant department directly. |
-> responses-eval-data
(
(tc/dataset):flesch-reading-ease :answer])
(tc/select-columns [:flesch-reading-ease :desc)
(tc/order-by range 1))) (tc/select-rows (
_unnamed [1 2]:
:flesch-reading-ease | :answer |
---|---|
69.99384615 | In 2023, there were 114,974 households in receipt of the HAP scheme. |
Let’s try a high reading-ease answer with more than 100 words…
-> responses-eval-data
(
(tc/dataset):flesch-reading-ease :answer])
(tc/select-columns [:wc [:answer] (fn [ans]
(tc/map-columns -> (str/split ans #"\w+")
(count))))
(> (:wc %) 100))
(tc/select-rows #(:flesch-reading-ease :desc)
(tc/order-by range 1))) (tc/select-rows (
_unnamed [1 3]:
:flesch-reading-ease | :answer | :wc |
---|---|---|
47.57846154 | According to the provided context, a total of 161,600 children benefit from the School Transport Scheme. Here is the breakdown: | 118 |
1. Over 135,000 pupils are traveling on primary and post-primary services. | ||
2. 19,800 pupils have special educational needs. | ||
3. 6,800 pupils who have arrived in Ireland from Ukraine are included in this number as well. | ||
It’s important to note that these numbers might overlap for students with special educational needs who also travel on primary or post-primary services. The context does not provide a separate total for students with special educational needs and those from Ukraine, but it does give the individual numbers within the overall figure of 161,600. |
6.4.1.2 Precision
-> ds-performance-averages
(
add-model-platform
(plotly/base800
{:=width :platform
:=color :model-ref})
:=x
(plotly/layer-bar:token-overlap-precision})
{:=y
(plotly/layer-bar:rouge-l-precision})) {:=y
6.4.1.3 Recall
-> ds-performance-averages
(
add-model-platform
(plotly/base800
{:=width :platform
:=color :model-ref})
:=x
(plotly/layer-bar:token-overlap-recall})
{:=y
(plotly/layer-bar:rouge-l-recall})) {:=y
6.4.1.4 Precision/Recall (F1)
-> ds-performance-averages
(
add-model-platform
(plotly/base800
{:=width :platform
:=color :model-ref})
:=x
(plotly/layer-bar:rouge-l-f1})) {:=y
-> ds-performance-averages
(
add-model-platform
(plotly/base800
{:=width :platform
:=color :model-ref})
:=x
(plotly/layer-bar:token-overlap-f1})) {:=y
6.4.2 LLM Generated Metrics
6.4.2.1 Faithfulness
defn make-bar-avgs [metric]
(->
(
ds-performance-averages
add-model-platform
(tc/order-by metric)
(plotly/base800
{:=width :platform})
:=color
(plotly/layer-bar:model-ref
{:=x :=y metric})))
:metric-llm-faithfulness-score) (make-bar-avgs
6.4.2.2 Correctness
:metric-llm-correctness-score) (make-bar-avgs
6.4.2.3 Relevance
:metric-llm-relevance-score) (make-bar-avgs
6.4.3 Individual Performances
Let’s make a simple ‘dashboard’ type view to try to get a sense of each model’s performance at a glance.
We’ll introduce an ‘indicator’ marker to show if the model is performing okay for a metric. Perhaps in an actual evaluation system this could be some kind of target threshold that the model should meet. In this case, we’ll just use the averages of all the model performances, so that the indicator will simply indicate if the metric is above/below average.
defn average-score [ds metrics]
(->>
(mapv #(ds %) metrics)
(reduce into)
( (average-coll)))
defn eval-averages [ds]
(-> ds
(:faithfulness #(average-score % [:metric-llm-faithfulness-score])
(tc/aggregate {:correctness #(average-score % [:metric-llm-correctness-score])
:relevance #(average-score % [:metric-llm-relevance-score])
:semantic-similarity #(average-score % [:cosine-similarity])
:recall #(average-score % [:token-overlap-recall
:rouge-l-recall])
:precision #(average-score % [:rouge-l-precision
:token-overlap-precision])
:f1 #(average-score % [:rouge-l-f1
:token-overlap-f1])})
:as-maps)
(tc/rows first))
def eval-averages-all (eval-averages ds-responses-eval-narrowed)) (
defn indicator-symbol [colour]
(:span {:style (str "color: " colour ";")} "⬤"]) [
def indicator-bad (indicator-symbol "red")) (
def indicator-medium (indicator-symbol "yellow")) (
def indicator-good (indicator-symbol "green")) (
If above target - green If within less than 10% of target - amber If less than 10% target - red
defn make-indicator-symbol [value target-value]
(if (> value target-value) indicator-good
(let [diff (abs (- target-value value))
(float (/ diff target-value))]
diff-percent (if (<= diff-percent 0.1)
(
indicator-medium indicator-bad))))
defn model-performance-summary [ds model-ref]
(let [model-per (filter #(= (:model-ref %) model-ref) ds)
(count (filter #(= (:metric-llm-faithfulness-score %) 1) model-per))
faithfulness (count model-per)
total-questions (:keys [correctness
{
relevance
semantic-similarity
recall
precision-> model-per tc/dataset eval-averages)]
f1]} (:div
[:h1 (name model-ref)]
[:p (str "Scores based on " total-questions " evaluation questions.")]
[:table {:style "width: 70%;"}
[:tr
[:th "Metric"]
[:th "Score"]
[:th "Reference Average"]
[:th "Status"]]
[:tr
[:td "Faithfulness"]
[:td (str faithfulness "/" total-questions)]
[:td (:faithfulness eval-averages-all)]
[:td (make-indicator-symbol (/ faithfulness total-questions) (:faithfulness eval-averages-all))]]
[:tr
[:td "Correctness"]
[:td correctness]
[:td (:correctness eval-averages-all)]
[:td (make-indicator-symbol correctness
[:correctness eval-averages-all))]]
(:tr
[:td "Relevance"]
[:td relevance]
[:td (:relevance eval-averages-all)]
[:td (make-indicator-symbol relevance
[:relevance eval-averages-all))]]
(:tr
[:td "Semantic Similarity"]
[:td semantic-similarity]
[:td (:semantic-similarity eval-averages-all)]
[:td (make-indicator-symbol semantic-similarity
[:semantic-similarity eval-averages-all))]]
(:tr
[:td "Recall"]
[:td recall]
[:td (:recall eval-averages-all)]
[:td (make-indicator-symbol recall
[:recall eval-averages-all))]]
(:tr
[:td "Precision"]
[:td precision]
[:td (:precision eval-averages-all)]
[:td (make-indicator-symbol precision
[:precision eval-averages-all))]]
(:tr
[:td "F1"]
[:td f1]
[:td (:f1 eval-averages-all)]
[:td (make-indicator-symbol f1
[:f1 eval-averages-all))]]]])) (
mapv #(kind/hiccup (model-performance-summary responses-eval-data %))
(distinct (map :model-ref responses-eval-data))) (
[
gemini-2.5-flash-preview-04-17
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 10/10 | 0.8492462 | ⬤ |
Correctness | 4.5 | 4.335 | ⬤ |
Relevance | 2.7 | 2.69 | ⬤ |
Semantic Similarity | 0.8581803 | 0.8173185 | ⬤ |
Recall | 0.8698498 | 0.8116715 | ⬤ |
Precision | 0.29389003 | 0.25759077 | ⬤ |
F1 | 0.40067297 | 0.34687445 | ⬤ |
llava
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 7/10 | 0.8492462 | ⬤ |
Correctness | 4.4 | 4.335 | ⬤ |
Relevance | 2.7 | 2.69 | ⬤ |
Semantic Similarity | 0.7771463 | 0.8173185 | ⬤ |
Recall | 0.7772849 | 0.8116715 | ⬤ |
Precision | 0.19004913 | 0.25759077 | ⬤ |
F1 | 0.2590272 | 0.34687445 | ⬤ |
gemini-2.0-flash-lite
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 8/10 | 0.8492462 | ⬤ |
Correctness | 4.0 | 4.335 | ⬤ |
Relevance | 2.5 | 2.69 | ⬤ |
Semantic Similarity | 0.7402637 | 0.8173185 | ⬤ |
Recall | 0.6918128 | 0.8116715 | ⬤ |
Precision | 0.42841166 | 0.25759077 | ⬤ |
F1 | 0.47760624 | 0.34687445 | ⬤ |
mistral
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 9/10 | 0.8492462 | ⬤ |
Correctness | 4.6 | 4.335 | ⬤ |
Relevance | 2.7 | 2.69 | ⬤ |
Semantic Similarity | 0.80170715 | 0.8173185 | ⬤ |
Recall | 0.85001177 | 0.8116715 | ⬤ |
Precision | 0.121677436 | 0.25759077 | ⬤ |
F1 | 0.20872447 | 0.34687445 | ⬤ |
gemma3:4b
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 10/10 | 0.8492462 | ⬤ |
Correctness | 4.1 | 4.335 | ⬤ |
Relevance | 2.6 | 2.69 | ⬤ |
Semantic Similarity | 0.8444344 | 0.8173185 | ⬤ |
Recall | 0.78467876 | 0.8116715 | ⬤ |
Precision | 0.3284499 | 0.25759077 | ⬤ |
F1 | 0.43166515 | 0.34687445 | ⬤ |
gpt-3.5-turbo
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 7/10 | 0.8492462 | ⬤ |
Correctness | 4.5 | 4.335 | ⬤ |
Relevance | 2.8 | 2.69 | ⬤ |
Semantic Similarity | 0.8286174 | 0.8173185 | ⬤ |
Recall | 0.79841506 | 0.8116715 | ⬤ |
Precision | 0.17707592 | 0.25759077 | ⬤ |
F1 | 0.28464097 | 0.34687445 | ⬤ |
deepseek-r1
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 8/10 | 0.8492462 | ⬤ |
Correctness | 4.0 | 4.335 | ⬤ |
Relevance | 2.7 | 2.69 | ⬤ |
Semantic Similarity | 0.76819897 | 0.8173185 | ⬤ |
Recall | 0.86909276 | 0.8116715 | ⬤ |
Precision | 0.06864171 | 0.25759077 | ⬤ |
F1 | 0.12510595 | 0.34687445 | ⬤ |
llama3.1
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 7/10 | 0.8492462 | ⬤ |
Correctness | 4.4 | 4.335 | ⬤ |
Relevance | 2.8 | 2.69 | ⬤ |
Semantic Similarity | 0.83497447 | 0.8173185 | ⬤ |
Recall | 0.89112896 | 0.8116715 | ⬤ |
Precision | 0.28834757 | 0.25759077 | ⬤ |
F1 | 0.41002214 | 0.34687445 | ⬤ |
gemini-2.0-flash
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 10/10 | 0.8492462 | ⬤ |
Correctness | 4.4 | 4.335 | ⬤ |
Relevance | 2.5 | 2.69 | ⬤ |
Semantic Similarity | 0.884721 | 0.8173185 | ⬤ |
Recall | 0.90537167 | 0.8116715 | ⬤ |
Precision | 0.48335415 | 0.25759077 | ⬤ |
F1 | 0.5637532 | 0.34687445 | ⬤ |
claude-3-haiku-20240307
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 10/10 | 0.8492462 | ⬤ |
Correctness | 4.7 | 4.335 | ⬤ |
Relevance | 2.9 | 2.69 | ⬤ |
Semantic Similarity | 0.83983684 | 0.8173185 | ⬤ |
Recall | 0.8596693 | 0.8116715 | ⬤ |
Precision | 0.2130006 | 0.25759077 | ⬤ |
F1 | 0.30952084 | 0.34687445 | ⬤ |
gpt-4o
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 9/10 | 0.8492462 | ⬤ |
Correctness | 4.4 | 4.335 | ⬤ |
Relevance | 2.6 | 2.69 | ⬤ |
Semantic Similarity | 0.78308386 | 0.8173185 | ⬤ |
Recall | 0.72949773 | 0.8116715 | ⬤ |
Precision | 0.3464982 | 0.25759077 | ⬤ |
F1 | 0.40963313 | 0.34687445 | ⬤ |
llama3.2
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 8/10 | 0.8492462 | ⬤ |
Correctness | 4.3 | 4.335 | ⬤ |
Relevance | 2.5 | 2.69 | ⬤ |
Semantic Similarity | 0.8231343 | 0.8173185 | ⬤ |
Recall | 0.81449085 | 0.8116715 | ⬤ |
Precision | 0.2596961 | 0.25759077 | ⬤ |
F1 | 0.35516027 | 0.34687445 | ⬤ |
gemini-2.5-pro-preview-03-25
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 10/10 | 0.8492462 | ⬤ |
Correctness | 4.7 | 4.335 | ⬤ |
Relevance | 2.7 | 2.69 | ⬤ |
Semantic Similarity | 0.8561334 | 0.8173185 | ⬤ |
Recall | 0.8907564 | 0.8116715 | ⬤ |
Precision | 0.2809131 | 0.25759077 | ⬤ |
F1 | 0.3916831 | 0.34687445 | ⬤ |
o3-mini
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 9/10 | 0.8492462 | ⬤ |
Correctness | 4.7 | 4.335 | ⬤ |
Relevance | 2.9 | 2.69 | ⬤ |
Semantic Similarity | 0.8139668 | 0.8173185 | ⬤ |
Recall | 0.75982517 | 0.8116715 | ⬤ |
Precision | 0.1913003 | 0.25759077 | ⬤ |
F1 | 0.29404587 | 0.34687445 | ⬤ |
granite3.2
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 9/10 | 0.8492462 | ⬤ |
Correctness | 4.4 | 4.335 | ⬤ |
Relevance | 2.8 | 2.69 | ⬤ |
Semantic Similarity | 0.8277979 | 0.8173185 | ⬤ |
Recall | 0.8499658 | 0.8116715 | ⬤ |
Precision | 0.26673818 | 0.25759077 | ⬤ |
F1 | 0.3820844 | 0.34687445 | ⬤ |
gpt-4o-mini
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 9/10 | 0.8492462 | ⬤ |
Correctness | 4.5 | 4.335 | ⬤ |
Relevance | 2.7 | 2.69 | ⬤ |
Semantic Similarity | 0.8380617 | 0.8173185 | ⬤ |
Recall | 0.79961014 | 0.8116715 | ⬤ |
Precision | 0.28064665 | 0.25759077 | ⬤ |
F1 | 0.38684246 | 0.34687445 | ⬤ |
claude-3-5-haiku-20241022
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 9/10 | 0.8492462 | ⬤ |
Correctness | 4.6 | 4.335 | ⬤ |
Relevance | 2.8 | 2.69 | ⬤ |
Semantic Similarity | 0.82489693 | 0.8173185 | ⬤ |
Recall | 0.8087185 | 0.8116715 | ⬤ |
Precision | 0.21928978 | 0.25759077 | ⬤ |
F1 | 0.31709728 | 0.34687445 | ⬤ |
claude-3-7-sonnet-20250219
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 9/10 | 0.8492462 | ⬤ |
Correctness | 4.3 | 4.335 | ⬤ |
Relevance | 2.8 | 2.69 | ⬤ |
Semantic Similarity | 0.84678215 | 0.8173185 | ⬤ |
Recall | 0.88124555 | 0.8116715 | ⬤ |
Precision | 0.23101479 | 0.25759077 | ⬤ |
F1 | 0.3372413 | 0.34687445 | ⬤ |
gemma3:1b
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 2/10 | 0.8492462 | ⬤ |
Correctness | 2.4 | 4.335 | ⬤ |
Relevance | 2.2 | 2.69 | ⬤ |
Semantic Similarity | 0.7336914 | 0.8173185 | ⬤ |
Recall | 0.6602229 | 0.8116715 | ⬤ |
Precision | 0.17756955 | 0.25759077 | ⬤ |
F1 | 0.22910468 | 0.34687445 | ⬤ |
o4-mini-2025-04-16
Scores based on 10 evaluation questions.
Metric | Score | Reference Average | Status |
Faithfulness | 9/10 | 0.8492462 | ⬤ |
Correctness | 4.8 | 4.335 | ⬤ |
Relevance | 2.9 | 2.69 | ⬤ |
Semantic Similarity | 0.8207413 | 0.8173185 | ⬤ |
Recall | 0.74178046 | 0.8116715 | ⬤ |
Precision | 0.30525082 | 0.25759077 | ⬤ |
F1 | 0.36385718 | 0.34687445 | ⬤ |
]
6.4.4 Evaluating the Evaluation Dataset
As a last step, let’s have a quick look to see if the metrics can tell us anything about our evaluation dataset itself.
For example, qhich question has the most wrong (non-faithfull) answers?
-> responses-eval-data
(
(tc/dataset):question :metric-llm-faithfulness-score])
(tc/select-columns [:metric-llm-faithfulness-score) ;; There is actually one missing here...
(tc/drop-missing :question])
(tc/group-by [:total-correct #(apply + (% :metric-llm-faithfulness-score))})
(tc/aggregate {:total-correct)) (tc/order-by
_unnamed [10 2]:
:question | :total-correct |
---|---|
Will the government put in place Level 6 (QQI) courses for healthcare assistants? | 11 |
What is the salary scale for an archaeologist in the local government sector? | 16 |
How many pupils in total benefit from the School Transport Scheme (STS)? Can you also break down the numbers in terms of students with special educational needs and students who are from Ukraine? | 17 |
How many households were in receipt of HAP payments in 2023? | 17 |
What is the government doing with regard to the National Drugs Strategy? | 17 |
What is the government doing to help improve GP services? | 17 |
How is the government encouraging local authorities to apply for the town and village renewal scheme? | 17 |
What is the punishment for someone driving without a driving licence? | 18 |
Are there plans to further reduce public transport fares? | 19 |
How many people are eligible for payments under the Monther and Baby Homes scheme? | 20 |
The question about healthcare assistants only had 11/20 correct answers. This is unsurprising in retrospect, as even I had trouble understanding this original question/answer.
Let’s look at a couple of examples/evaluation reasoning for the lowest-scoring question
-> responses-eval-data
(
(tc/dataset):model-ref :question :answer :metric-llm-faithfulness-score :metric-llm-faithfulness-explanation])
(tc/select-columns [:metric-llm-faithfulness-score)
(tc/drop-missing and (= (:question %) "Will the government put in place Level 6 (QQI) courses for healthcare assistants?")
(tc/select-rows #(= (:metric-llm-faithfulness-score %) 0)))
(:model-ref :answer :metric-llm-faithfulness-explanation])) (tc/select-columns [
_unnamed [9 3]:
:model-ref | :answer | :metric-llm-faithfulness-explanation |
---|---|---|
llava | As a senior government official, I am responsible and trustworthy in providing accurate and reliable information to citizens. Regarding your question on whether the government will put in place QQI level 6 courses for healthcare assistants, it is important to note that decisions regarding the introduction of new educational programmes are made by relevant stakeholders, such as Higher Education Institutions (HEIs) and education authorities. The Department does not place a cap on the number of places in HEIs, which means the decision to introduce new courses rests with the institutions themselves based on their capacity and academic criteria. If you have any further questions or concerns about educational requirements for healthcare assistants, please do not hesitate to contact me. | { |
“reasoning”: “The statement makes claims not found in the context—such as the speaker’s trustworthiness as a senior official and that decisions to introduce new QQI level 6 courses rest solely with HEIs and education authorities. The context only specifies existing Level 5 requirements, QQI’s role in revalidation, and that HEIs set their own intake numbers; it does not address who decides the introduction of new programmes or affirm the speaker’s personal credibility.”, | ||
“score”: “No” | ||
} | ||
gemini-2.0-flash-lite | I am unable to answer this question with the information I have available. | { |
“reasoning”: “The statement is a meta-comment about inability to answer, which is not a factual claim supported by the provided context. The context actually provides substantial information and does not indicate any inability to answer questions.”, | ||
“score”: “No” | ||
} | ||
gpt-3.5-turbo | At this time, the current educational requirement for a Health Care Assistant as per HSE eligibility criteria is the relevant Health Skills Level 5 (QQI) qualification. The purpose of this qualification is to provide learners with the skills and knowledge necessary to work under supervision in healthcare settings. | { |
“reasoning”: “While the statement correctly references the Level 5 (QQI) requirement and its purpose (working under supervision) and aligns with the fact that programme content is periodically updated, it introduces unsupported claims—specifically that there are no Level 6 courses for healthcare assistants and that future changes will be communicated through official channels in a particular way. These points are not stated or implied in the context.”, | ||
While there are currently no Level 6 courses specifically designated for healthcare assistants, it is important to note that programme content is periodically updated to incorporate technical or legislative changes. The Government is committed to expanding healthcare education opportunities, and any future developments in the qualification requirements for healthcare assistants would be communicated through official channels and in alignment with educational needs in the sector. | “score”: “No” | |
} | ||
If you are interested in furthering your healthcare education beyond Level 5, exploring higher education courses or professional development opportunities may be beneficial. Thank you for your inquiry. | ||
deepseek-r1 | { | |
Okay, I’m trying to figure out if the government will put in place Level 6 (QQI) courses for healthcare assistants based on the provided context. Let me read through the context again. | “reasoning”: “The context clearly states existing qualifications (Level 5, up to Level 8, and postgraduate Level 9) and makes no mention of any Level 6 courses, so one can conclude that no Level 6 courses are designated for healthcare assistants. Thus it is possible to answer based on the information provided, making the claim of being unable to answer unsupported.”, | |
“score”: “No” | ||
The context mentions that the HSE eligibility criteria require a Health Skills Level 5 (QQI) qualification. It talks about the Major Health Service Skills Award (HSSA), which is Level 5, and the ONMSD’s role in providing continuous professional development and clinical leadership through various educational programs up to Level 8. | } | |
Looking at the training for Health Care Assistants, it says they can support the nursing and midwifery function through the ONMSD Centres of Nursing and Midwifery Education with the achievement of a QQI Level 5 award. The total credit value is 120 ECTS across 8 modules. | ||
For higher education, jobseekers can go up to an honours degree (QQI level 8) but not beyond that for masters or postgrad levels. They mention the Professional Masters in Education as Level 9, but it doesn’t support master’s level studies. | ||
The context also talks about expanding third-level places in healthcare but notes that HEIs decide their own capacity without government capping them. | ||
So, where does Level 6 come into play? The QQI system goes up to Level 8 for awards. There’s no mention of Level 6 specifically being introduced or planned. The highest level mentioned is Level 5 (HSSA) and higher degrees up to Level 9 in professional master’s but not beyond that. | ||
The context doesn’t indicate any recent announcements or plans about introducing Level 6 courses for healthcare assistants. It mainly focuses on existing qualifications like Level 5, honours degrees, and postgraduate programs. There’s no information suggesting the government is moving towards a Level 6 qualification specifically for this role. | ||
Therefore, based on what’s provided, there’s no evidence that the government has put in place or plans to introduce Level 6 courses for healthcare assistants. | ||
Based on the context provided, the government does not currently have any Level 6 (QQI) courses designated for healthcare assistants. The information available refers to existing qualifications at Level 5 and higher degrees up to Level 8, as well as postgraduate programs like the Professional Master’s in Education at Level 9. There is no mention of Level 6 courses being introduced or planned specifically for this role. | ||
Answer: I am unable to answer this question with the information I have available. | ||
gpt-4o | I am unable to answer this question with the information I have available. | { |
“reasoning”: “The statement is a meta-comment about the assistant’s ability and is not supported or addressed by any factual information in the provided context.”, | ||
“score”: “No” | ||
} | ||
llama3.2 | The context provided doesn’t explicitly mention the possibility of introducing Level 6 (QQI) courses specifically for healthcare assistants. However, it does highlight the Office of Nursing and Midwifery Services Director’s focus on continuous professional development and clinical leadership at various levels, including level 5. | { |
“reasoning”: “The statement introduces information about discussions and proposals for Level 6 (QQI) courses that are not mentioned or supported by the provided context. While the context does note ONMSD programmes spanning levels 5 to 8 and the priority on expanding third‑level places, it does not reference any plans or discussions to introduce Level 6 courses specifically for healthcare assistants.”, | ||
That being said, I can tell you that there have been discussions and proposals regarding expanding higher education opportunities in the healthcare sector, including the creation of Level 6 (QQI) courses. However, I couldn’t find any specific information or confirmation from the government on whether such courses are currently planned or being considered for introduction. | “score”: “No” | |
} | ||
It’s also worth noting that the Department is prioritizing ensuring an expanded number of third-level places in healthcare programmes, which could potentially lead to more opportunities for healthcare assistants to pursue further education and training. However, the specifics of any proposed Level 6 (QQI) courses, if they were to be introduced, are not mentioned in this context. | ||
I recommend keeping an eye on official government announcements or updates from relevant authorities, such as the Office of Nursing and Midwifery Services Director, for any developments on this matter. | ||
granite3.2 | The context provided does not explicitly mention the introduction of Level 6 (QQI) courses specifically for Health Care Assistants. However, it does highlight that the Department prioritizes expanding third level places in healthcare programmes. These could potentially include Level 6 courses. For the most accurate information, I would recommend contacting the relevant Higher Education Institutions or the Department directly. They can provide updates on specific course offerings and any plans for introducing Level 6 (QQI) courses for Health Care Assistants. | { |
“reasoning”: “The context does not mention the introduction of Level 6 (QQI) courses for Health Care Assistants and only states that the Department prioritizes expanding third‑level places. The suggestion that Level 6 courses could be included and the recommendation to contact institutions for updates go beyond what the context explicitly supports.”, | ||
“score”: “No” | ||
} | ||
gpt-4o-mini | I am unable to answer this question with the information I have available. However, I can inform you that the current educational requirement for Health Care Assistants is the Level 5 Major Health Service Skills Award (QQI), and there is an ongoing focus on providing professional development and pathways for further education in healthcare. If you have further inquiries or require specific updates regarding potential Level 6 courses, I recommend reaching out to the relevant educational institutions or governing bodies for the most accurate information. | { |
“reasoning”: “The statement begins by claiming an inability to answer, but the context clearly provides the educational requirement. While the requirement (Level 5 Major Health Service Skills Award) and focus on professional development and further education are supported, the speaker’s claim of being unable to answer is not. Therefore, the statement is not fully supported by the context.”, | ||
“score”: “No” | ||
} | ||
gemma3:1b | That’s a very important and complex question, and it’s one we’re actively considering. As of today, November 2, 2023, the government has stated that Level 6 (QQI) courses for Healthcare Assistants are being considered for implementation. | { |
“reasoning”: “The context makes no mention of the government considering or planning QQI Level 6 courses for Healthcare Assistants, nor does it outline any phased implementation timeline. Therefore, the statement is not supported by the provided context.”, | ||
However, it’s crucial to understand the timeline and the scope of this consideration. The government has outlined a phased approach: | “score”: “No” | |
} | ||
* Phase 1 (2024-2025): Initial assessment and planning for Level 6 courses. This will involve a thorough review of existing training programs and identifying potential gaps. | ||
* Phase 2 (2026-2027): Development of the first Level 6 courses, focusing on key skills and competencies required for a wider range of healthcare roles. | ||
* Phase 3 (2028 onwards): Expansion and refinement of the Level 6 framework based on feedback and evolving industry needs. | ||
Currently, there’s no definitive date for when these courses will be rolled out. The government is prioritizing a careful and considered approach to ensure these new qualifications are aligned with the needs of the healthcare sector. | ||
I can provide you with more detailed information if you’d like, such as: | ||
* Links to the official government website regarding the Level 6 initiative. | ||
* A summary of the key priorities and goals for this phase of the program. | ||
* Information on the specific skills and competencies that will be covered in these new courses. |
We can see a major error here with my evaluation prompt. In some cases the model answers that “It cannot provide information using the information available” which should be an acceptable answer in this context (since the prompt instructs is that it should provide this default if it can’t answer)
I went back and added an extra instruction in the prompt to try account for these cases. But, it’s an important lesson in trying to think logically about the material in the prompts.
source: src/notebooks/rag_evaluation.clj