2 Dataset Preparation
ns notebooks.preparation
(:require [tablecloth.api :as tc]
(:as jt]
[java-time.api :as str]
[clojure.string :as kind])) [scicloj.kindly.v4.kind
2.1 Cleaning/preparation steps
The data is taken from the Oireachtas website. It contains key fields such as ‘question’, ‘answer’, ‘date’ and ‘topic’. There are around 10K quesitons/answers in the initial dataset, but many of these will be removed through some cleaning steps (below).
The questions/answers are written submissions by members of parliament on a wide variety of topics. The written answers are provided by Ministers, who are heads of various departments.
def datasource "data/20250302_PQs_10K_2024_answers.csv") (
2.1.1 Text Cleaning
2.1.1.1 Question Formatting
The questions are prefixed with a question number, and end with an id tag. The functions below aim to remove these.
“1. An example question? [1234/25]” ->> “An example question?”
def re-question-number "^\\d+. ") (
def re-question-id "\\[\\d+/\\d+\\]") (
def re-question-num-or-id (re-pattern (str re-question-id "|" re-question-number))) (
defn clean-question [q] (str/replace q re-question-num-or-id "")) (
2.1.1.2 Topic Labels
Some topic labels contain a trailing period at the end. We will also remove these.
defn clean-topic-label [label]
(when label
(if (re-find #"\.$" label)
(subs label 0 (dec (count label)))
( label)))
2.1.1.3 Department Names
At various times, especially following general elections, department functions can change. This typically also involves a change in the deparment’s name.
Because of this, it is hard to track most deparments consistently beyond the last five years or so. Some departments, such as ‘Health’ or ‘Justice’ remain largely the same.
In addition, older questions give the full department title, while more recent questions only give the first part of the title. For example, “Department for the Environment, Climate and Communications” becomes “Environment”.
In order to try consolidate some of the department names, we will also transform the older labels into single-word department names.
defn normalise-department-name [label]
(cond
(re-find #"^Minister for Expenditure" label) "Public Expenditure"
(re-find #"^Public$" label) "Public Expenditure"
(re-find #"^Minister for the" label) (first (re-find #"(?<=^Minister for the )(\w+)(?=,| |$)" label)) ;; To match "Minister for the Environment..."
(re-find #"^Minister for" label) (first (re-find #"(?<=^Minister for )(\w+)(?=,| |$)" label))
(:else label))
2.1.1.4 Answer Cleaning
The data for the question ‘answers’ was in xml format, and occasionally included things like table elements. While parsing these I ommotted them and left the string ‘{{OMMITTED …}}’ in their place. So, I will also add a step here to remove those parts of the string.
defn clean-incomplete-answers [answer]
(#"\{\{OMITTED.*element\}\}" "")) (str/replace answer
Some answers also contain the ‘non-breaking space’ character (ascii code 160), so we will try to replace these with spaces.
defn clean-nbs-answers [answer]
(#" " " ")) (str/replace answer
2.1.2 Duplicate questions
There are some questions that are duplicates. For example:
(kind/table->> (tc/map-columns (tc/dataset datasource {:key-fn keyword}) :question [:question] clean-question)
(:question
frequencies)
(sort-by second)
(reverse
take 2))) (
Deputy Michael Healy-Rae asked the Minister for Health the status of a hospital appointment for a person (details supplied); and if he will make a statement on the matter. | 6 |
Deputy Michael Fitzmaurice asked the Minister for Agriculture, Food and the Marine to provide a copy of the satellite images used by his Department in making the decision that a person (details supplied) was not eligible to apply for the Shannon callows compensation scheme on the parcels listed; and if he will make a statement on the matter. | 6 |
You can see from these that the issue is because there are separate details supplied that are not available here.
For the purposes of this exercise, it is better to remove these duplicates entirely, and we will do so below using tablecloth’s unique-by function.
2.1.3 Adding Question URLs
In case we want to reference the original source, we’ll also add the question urls to the dataset.
defn extract-question-num [q] (re-find #"^\d+" q)) (
defn extract-question-id [q] (re-find #"(?<=\[).*(?=\])" q)) (
defn make-url [date q-num]
(let [url-base "https://www.oireachtas.ie/en/debates/question/"
("https://www.oireachtas.ie/"]
url-default if (jt/< (jt/local-date date) (jt/local-date "2012-07-01"))
(str url-default)
(str url-base (str date) "/" q-num "/")))) (
2.2 Build Prepared Dataset
def ds
(-> datasource
(:key-fn keyword})
(tc/dataset {:answer)
(tc/drop-missing :q-num [:question] extract-question-num)
(tc/map-columns :q-id [:question] extract-question-id)
(tc/map-columns :url [:date :q-num] #(make-url %1 %2))
(tc/map-columns :question [:question] clean-question)
(tc/map-columns :topic [:topic] clean-topic-label)
(tc/map-columns :department [:department] normalise-department-name)
(tc/map-columns :answer)
(tc/drop-missing :answer [:answer] (comp clean-incomplete-answers clean-nbs-answers))
(tc/map-columns :question)
(tc/unique-by :date :question :answer :department :topic :url]))) (tc/select-columns [
2.3 General Stats
- Dates range from January 17 2024 to March 21 2024
- 9,823 total questions asked.
- The five most common question topics are: Departmental Data, Special Educational Needs, Health Services, Schools Building Projects, International Protection
- The five most commonly asked departments are: Health, Education, Housing, Transport, Justice
2.4 A quick look at the dataset
(tc/column-names ds)
:date :question :answer :department :topic :url) (
(tc/row-count ds)
9823
(tc/head ds)
data/20250302_PQs_10K_2024_answers.csv [5 6]:
:date | :question | :answer | :department | :topic | :url |
---|---|---|---|---|---|
2024-01-31 | Deputy Rose Conway-Walsh asked the Taoiseach if he will provide an update on the legislative programme. | I propose to take Questions Nos. 1 to 4, inclusive, together. The Government Legislation Programme, which was published on 16th January 2024, sets out Government legislative priorities for the current parliamentary session. The current programme includes 46 bills for priority publication and drafting across a number of areas including healthcare, access to housing, stronger safer communities and road safety, amongst many others during the Spring Dáil session. Since this Government came to office in June 2020, 185 bills have been published of which 173 have been enacted to date. During this current Dáil session we will build on this work through the priority publication of 22 pieces of legislation and the drafting of a further 24 bills. There are currently 28 bills at various stages across both Houses of the Oireachtas. These include legislation that will amend and improve our planning system; provide for the regulation of gambling in Ireland and protect the rights and safety of children, their parents and all those involved in a surrogacy arrangement. I will continue to work with all members to progress legislation through both Houses of the Oireachtas. The current Legislation Programme does not include any bills in preparation in the Department of the Taoiseach as there are no legislative matters in the Department remit that require to be prioritised at this time. The Department of the Taoiseach will continue to play a central role in supporting effective coordination and prioritisation of policy and legislative developments across Government through Government meetings, the Cabinet Committees structures and the Government Legislation Committee. | Taoiseach | Legislative Programme | https://www.oireachtas.ie/en/debates/question/2024-01-31/1/ |
2024-01-31 | Deputy Richard Boyd Barrett asked the Taoiseach if he will provide an update on the legislative programme. | I propose to take Questions Nos. 1 to 4, inclusive, together. The Government Legislation Programme, which was published on 16th January 2024, sets out Government legislative priorities for the current parliamentary session. The current programme includes 46 bills for priority publication and drafting across a number of areas including healthcare, access to housing, stronger safer communities and road safety, amongst many others during the Spring Dáil session. Since this Government came to office in June 2020, 185 bills have been published of which 173 have been enacted to date. During this current Dáil session we will build on this work through the priority publication of 22 pieces of legislation and the drafting of a further 24 bills. There are currently 28 bills at various stages across both Houses of the Oireachtas. These include legislation that will amend and improve our planning system; provide for the regulation of gambling in Ireland and protect the rights and safety of children, their parents and all those involved in a surrogacy arrangement. I will continue to work with all members to progress legislation through both Houses of the Oireachtas. The current Legislation Programme does not include any bills in preparation in the Department of the Taoiseach as there are no legislative matters in the Department remit that require to be prioritised at this time. The Department of the Taoiseach will continue to play a central role in supporting effective coordination and prioritisation of policy and legislative developments across Government through Government meetings, the Cabinet Committees structures and the Government Legislation Committee. | Taoiseach | Legislative Programme | https://www.oireachtas.ie/en/debates/question/2024-01-31/2/ |
2024-01-31 | Deputy Paul Murphy asked the Taoiseach if he will provide an update on the legislative programme. | I propose to take Questions Nos. 1 to 4, inclusive, together. The Government Legislation Programme, which was published on 16th January 2024, sets out Government legislative priorities for the current parliamentary session. The current programme includes 46 bills for priority publication and drafting across a number of areas including healthcare, access to housing, stronger safer communities and road safety, amongst many others during the Spring Dáil session. Since this Government came to office in June 2020, 185 bills have been published of which 173 have been enacted to date. During this current Dáil session we will build on this work through the priority publication of 22 pieces of legislation and the drafting of a further 24 bills. There are currently 28 bills at various stages across both Houses of the Oireachtas. These include legislation that will amend and improve our planning system; provide for the regulation of gambling in Ireland and protect the rights and safety of children, their parents and all those involved in a surrogacy arrangement. I will continue to work with all members to progress legislation through both Houses of the Oireachtas. The current Legislation Programme does not include any bills in preparation in the Department of the Taoiseach as there are no legislative matters in the Department remit that require to be prioritised at this time. The Department of the Taoiseach will continue to play a central role in supporting effective coordination and prioritisation of policy and legislative developments across Government through Government meetings, the Cabinet Committees structures and the Government Legislation Committee. | Taoiseach | Legislative Programme | https://www.oireachtas.ie/en/debates/question/2024-01-31/3/ |
2024-01-31 | Deputy Bríd Smith asked the Taoiseach if he will provide an update on the legislative programme. | I propose to take Questions Nos. 1 to 4, inclusive, together. The Government Legislation Programme, which was published on 16th January 2024, sets out Government legislative priorities for the current parliamentary session. The current programme includes 46 bills for priority publication and drafting across a number of areas including healthcare, access to housing, stronger safer communities and road safety, amongst many others during the Spring Dáil session. Since this Government came to office in June 2020, 185 bills have been published of which 173 have been enacted to date. During this current Dáil session we will build on this work through the priority publication of 22 pieces of legislation and the drafting of a further 24 bills. There are currently 28 bills at various stages across both Houses of the Oireachtas. These include legislation that will amend and improve our planning system; provide for the regulation of gambling in Ireland and protect the rights and safety of children, their parents and all those involved in a surrogacy arrangement. I will continue to work with all members to progress legislation through both Houses of the Oireachtas. The current Legislation Programme does not include any bills in preparation in the Department of the Taoiseach as there are no legislative matters in the Department remit that require to be prioritised at this time. The Department of the Taoiseach will continue to play a central role in supporting effective coordination and prioritisation of policy and legislative developments across Government through Government meetings, the Cabinet Committees structures and the Government Legislation Committee. | Taoiseach | Legislative Programme | https://www.oireachtas.ie/en/debates/question/2024-01-31/4/ |
2024-01-31 | Deputy Richard Boyd Barrett asked the Tánaiste and Minister for Foreign Affairs whether he is aware of the activities of the sole remaining animal shelter in Gaza, Sulala Animal Rescue, and its collaboration with Animals Australia; whether he has raised the issue of animal feed reaching the shelter through the Kerem Shalom checkpoint with the Israeli Government; and if he will make a statement on the matter. | I am deeply concerned by the humanitarian situation in Gaza. 100% of people in Gaza are estimated to be acutely food insecure, while a quarter of its population faces catastrophic hunger and starvation. I have no doubt that the shocking situation on the ground also adversely affects the welfare of animals. I have consistently underlined that we urgently need a massive and sustained scale-up of humanitarian assistance. I further note that, on 26 January, the International Court of Justice ordered provisional measures in the South Africa v Israel case, which, inter alia, ordered Israel take immediate and effective measures to enable the provision of urgently needed basic services and humanitarian assistance in Gaza. The Court’s order if legally binding and final. Israel must urgently implement all measures. While I am not aware of the situation of animal shelters in Gaza, it is clear that increased movement of goods through the Kerem Shalom checkpoint will be a key enabler in scaling up the level of humanitarian assistance entering Gaza and in ensuring the provision of basic services as ordered by the International Court of Justice. My focus will remain on securing humanitarian aid for the people of Gaza. | Foreign | Middle East | https://www.oireachtas.ie/en/debates/question/2024-01-31/5/ |
source: src/notebooks/preparation.clj