Week 8 - ISC Grants
ns tidy-tuesdays.2024.week08.grants
(:require
(:as str]
[clojure.string :as kind]
[scicloj.kindly.v4.kind :as tc]
[tablecloth.api :as hanami]
[scicloj.noj.v1.vis.hanami :as ht])) [aerial.hanami.templates
Introduction
This week’s dataset contains information about the R Consortium Infrastructure Steering Committee (ISC) Grant Program.
Grants have been provided since 2016
def DS (tc/dataset "data/2024/week08/isc_grants.csv" {:key-fn keyword})) (
(kind/table-> DS (tc/info :columns))) (
:name | :datatype | :n-elems | :categorical? |
---|---|---|---|
:year | :int16 | 85 | |
:group | :int16 | 85 | |
:title | :string | 85 | true |
:funded | :int32 | 85 | |
:proposed_by | :string | 85 | true |
:summary | :string | 85 | true |
:website | :string | 85 | true |
The ‘group’ category here refers to whether the grant was awarded in the spring cycle (1) or the fall cycle (2).
Keywords
This week’s entry contains the prompt “Are there any keywords that stand out in the titles or summaries of awarded grants? Have the funded amounts changed over time?”
In that spirit, let’s try look at the titles and summaries for keywords.
First, the titles.
def stopwords (str/split-lines (slurp "data/language/NLTK's list of english stopwords"))) (
defn split-words [string]
(->> (str/split string #" ")
(map str/lower-case)
(map #(str/replace % #"[():“”-]" ""))
(map #(str/replace % #"[\,\.\n]" " "))
(map str/trim)
(remove (into #{} stopwords))
(remove #{""}))) (
Top 10 Keywords in Titles
(kind/md"\n"
(str/join for [kw
(->>
(apply str (:title DS))
(
split-wordsfrequencies
sort-by val)
(reverse
take 10))
(:let [[word count] kw]]
str "* " word " (" count ")")))) (
- r (34)
- data (10)
- package (7)
- spatial (5)
- api (4)
- infrastructure (3)
- consortium (3)
- testing (3)
- test (2)
- training (2)
Unsurprisingly, ‘R’ is the top keyword. Let’s look at the titles containing the word spatial:
(kind/table-> DS
(re-find #"spatial" (str/lower-case (% :title))))
(tc/select-rows #(:title :year :funded]))) (tc/select-columns [
:title | :year | :funded |
---|---|---|
Building the “Spatial Data Science With R” Educational Materials and Pedagogical Infrastructure | 2022 | 25000 |
Database interoperability for spatial objects in R | 2020 | 6000 |
Tidy spatial networks in R | 2019 | 9000 |
R-global: analysing spatial data globally | 2019 | 10000 |
Strengthening of R in support of spatial data infrastructures management : geometa and ows4R R packages | 2018 | 20000 |
Most expensive ‘keywords’
Let’s try assign ‘values’ to the words using the funding amounts.
def word-costs
(let [all-costs
(-> DS
(:title :funded])
(tc/select-columns [:word-costs [:title :funded]
(tc/map-columns fn [title funded]
(let [words (split-words title)]
(zipmap words (repeat [funded])))))
(:word-costs)]
reduce (fn [result entry]
(merge-with into result entry))
(
{} all-costs)))
Now we have something like this:
take 10 word-costs) (
"maptools" [17000]]
(["ability" [5000]]
["organisational" [4000]]
["spatial" [25000 6000 9000 10000 20000]]
["interactive" [8000 9688 4000 9100]]
["accompanying" [13912]]
["abstract" [12000]]
["preserving" [772]]
["extensions" [15000]]
["products" [11200]]) [
Next, I’m not sure whether to ‘average’ the costs, or to sum them. I’ll try both.
def summed-costs
(reduce (fn [new-m [k v]]
(assoc new-m k
(reduce + v)))
(
{} word-costs))
def averaged-costs
(reduce (fn [new-m [k v]]
(assoc new-m k
(/ (reduce + v)
(count v))))
(
{} word-costs))
Highest cost words (summed):
take 10 (reverse (sort-by val summed-costs))) (
"r" 709361]
(["data" 167340]
["development" 135400]
["package" 114440]
["windows" 109200]
["dbi" 106500]
["infrastructure" 105800]
["macos" 74400]
["journal" 72000]
["spatial" 70000]) [
Lowest cost words (summed):
take 10 (sort-by val summed-costs)) (
"preserving" 772]
(["algorithmic" 772]
["historicalg" 772]
["knowledge" 772]
["transfering" 772]
["hub" 2000]
["dengue" 2000]
["automated" 3000]
["applications" 3000]
["autotest" 3000]) [
Highest cost words (averaged):
take 10 (reverse (sort-by val averaged-costs))) (
"ongoing" 62400]
(["infrastructural" 62400]
["windows" 54600]
["assistance" 50000]
["editorial" 50000]
["maintenance" 46800]
["build" 46800]
["level" 46800]
["project" 46800]
["top" 46800]) [
Lowest cost words (averaged):
take 10 (sort-by val averaged-costs)) (
"preserving" 772]
(["algorithmic" 772]
["historicalg" 772]
["knowledge" 772]
["transfering" 772]
["hub" 2000]
["dengue" 2000]
["automated" 3000]
["applications" 3000]
["autotest" 3000]) [
As expected, when taking the sum of the funding amounts for projects with these words in the titles, the top words are similar to those which occur most often.
What is perhaps more interesting is when the averaged costs are taken.
Out of interest, let’s take a look at the projects with ‘ongoing’ in the title:
(kind/table-> DS
(re-find #"ongoing" (str/lower-case (% :title))))
(tc/select-rows #(:title :year :funded]))) (tc/select-columns [
:title | :year | :funded |
---|---|---|
Ongoing infrastructural development for R on Windows and MacOS | 2018 | 62400 |
Just one entry! And an expensive project (containing the two most ‘expensive’ words)
Keywords in summaries
Let’s adopt the same methodology as above, but with the summary texts.
(kind/md"\n"
(str/join for [kw
(->>
(apply str (:summary DS))
(
split-wordsfrequencies
sort-by val)
(reverse
take 20))
(:let [[word count] kw]]
str "* " word " (" count ")")))) (
- r (216)
- data (127)
- project (93)
- package (75)
- packages (51)
- users (32)
- use (31)
- aims (31)
- new (28)
- development (26)
- community (24)
- provide (24)
- used (23)
- existing (23)
- tools (23)
- code (22)
- support (22)
- cran (22)
- model (21)
- also (20)
Let’s try make a word cloud with these words using Vega:
(kind/vega"https://vega.github.io/schema/vega/v5.json"
{:$schema :width 800
:height 400
:padding 0
:data [{:name "table"
:values [(apply str (:summary DS))]
:transform
:type "countpattern"
[{:field "data"
:case "upper"
:pattern "[\\w']{3,}"
:stopwords (str "(" (str/join "|" stopwords) ")")}
:type "formula" :as "angle"
{:expr "[-45, 0, 45][~~(random() * 3)]"}]}]
:scales [{:name "color"
:type "ordinal"
:domain {:data "table" :field "text"}
:range ["#d5a928", "#652c90", "#939597"]}]
:marks [{:type "text"
:from {:data "table"}
:encode {:enter
:text {:field "text"}
{:align {:value "center"}
:baseline {:value "alphabetic"}
:fill {:scale "color" :field "text"}}
:update {:fillOpacity {:value 1}}
:hover {:fillOpacity {:value 0.5}}}
:transform
:type "wordcloud"
[{:size [800 400]
:text {:field :text}
:rotate {:field "datum.angle"}
:font "Helvetica Neue, Arial"
:fontSize {:field "datum.count"}
:fontWeight 600
:fontSizeRange [12, 56]
:padding 2}]}]})
Finally, let’s look at the words in close proximity to ‘r’
def r-splits
(->> (apply str (:summary DS))
(
split-wordsfn [word] (= word "r")))
(partition-by (remove (fn [coll] (= (first coll) "r"))))) (
def words-before-r (map last (drop-last r-splits))) (
def words-after-r (map first (rest r-splits))) (
Words in Close Proximity to ‘R’ in summaries
(kind/vega"https://vega.github.io/schema/vega/v5.json"
{:$schema :width 800
:height 400
:padding 0
:data [{:name "table"
:values [(str/join #" " (concat words-after-r words-before-r))]
:transform
:type "countpattern"
[{:field "data"
:case "upper"
:pattern "[\\w']{3,}"
:stopwords ""}
:type "formula" :as "angle"
{:expr "[-45, 0, 45][~~(random() * 3)]"}]}]
:scales [{:name "color"
:type "ordinal"
:domain {:data "table" :field "text"}
:range ["#d5a928", "#652c90", "#939597"]}]
:marks [{:type "text"
:from {:data "table"}
:encode {:enter
:text {:field "text"}
{:align {:value "center"}
:baseline {:value "alphabetic"}
:fill {:scale "color" :field "text"}}
:update {:fillOpacity {:value 1}}
:hover {:fillOpacity {:value 0.5}}}
:transform
:type "wordcloud"
[{:size [800 400]
:text {:field :text}
:rotate {:field "datum.angle"}
:font "Helvetica Neue, Arial"
:fontSize {:field "datum.count"}
:fontWeight 600
:fontSizeRange [12, 56]
:padding 2}]}]})
Funding Levels by year
defn period-num->name [num]
(case num
(1 "Spring Cycle"
2 "Fall Cycle"))
defn aggregate-by-year [agg-fn]
(remove nil?
(reduce (fn [result data]
(let [year (first (:year data))
(-> data
agg (:group)
(tc/group-by :amount agg-fn})
(tc/aggregate {:as-maps))]
(tc/rows conj result
(:year year
{:period (period-num->name (:$group-name (first agg)))
:amount (:amount (first agg))}
when (= 2 (count agg))
(:year year
{:period (period-num->name (:$group-name (second agg)))
:amount (:amount (second agg))}))))
[]-> DS
(:year)
(tc/group-by :data))))
(kind/vega"https://vega.github.io/schema/vega-lite/v5.json"
{:$schema :data {:values (aggregate-by-year #(reduce + (% :funded)))}
:mark :bar
:width 500
:height 400
:encoding {:x {:field :year :type :ordinal
:axis {:labelAngle -45}
:title "Year"}
:y {:field :amount :type :quantitative
:title "Total Funding Amount"}
:color {:field :period}}
:title {:text "R ISC Grants"
:subtitle "Funding Levels by Year"}})
Number of projects funded by year
(kind/vega"https://vega.github.io/schema/vega-lite/v5.json"
{:$schema :data {:values (aggregate-by-year #(count (% :title)))}
:mark :bar
:width 500
:height 400
:encoding {:x {:field :year :type :ordinal
:axis {:labelAngle -45}
:title "Year"}
:y {:field :amount :type :quantitative
:title "Total Projects Funded"}
:color {:field :period}}
:title {:text "R ISC Grants"
:subtitle "Projects Funded by Year"}})
While the levels of funding can vary quite a bit, the number of projects funded per year seems more constant.
Let’s look at the ‘best value’ years by dividing the total funding by projects.
In this case, I won’t look at whether they were spring/fall projects, and will also ignore 2023 (only a half-year of data).
-> DS
(= 2023 (% :year)))
(tc/drop-rows #(:year)
(tc/group-by :funding-per-project #(int (/ (reduce + (% :funded))
(tc/aggregate {count (% :title))))})
(:year})
(tc/rename-columns {:$group-name
(hanami/plot ht/line-chart:X :year :XTYPE :temporal
{:Y :funding-per-project
:SIZE 4
:TITLE "Average Cost per Project"}))
2021 had the best ‘value’ projects…
-> DS
(= 2023 (% :year)))
(tc/drop-rows #(:year)
(tc/group-by :funding #(reduce + (% :funded))
(tc/aggregate {:projects #(count (% :title))})
:year})
(tc/rename-columns {:$group-name
(hanami/plot ht/point-chart:X :funding :XTYPE :quantitative
{:Y :projects :YTPYE :quantitative
:COLOR {:field :year}
:SIZE 150}))
Looks like 2021 was the best year in terms of the most number of projects at the lowest cost.
Successful Developers
Finally, let’s have a quick look at the most successful applicants. The third column ‘percentage total’ is the percentage of the total grant funding provided throughout the lifetime of this program.
(kind/tablelet [total-funding-provided (reduce + (:funded DS))]
(-> DS
(:proposed_by)
(tc/group-by :funding-secured #(reduce + (% :funded))
(tc/aggregate {:projects #(tc/row-count %)})
:percentage-total [:funding-secured]
(tc/map-columns int (* 100 (/ % total-funding-provided))))
#(:Proposed-By})
(tc/rename-columns {:$group-name :percentage-total :funding-secured] [:desc :desc])
(tc/order-by [range 10))))) (tc/select-rows (
:Proposed-By | :funding-secured | :projects | :percentage-total |
---|---|---|---|
Jeroen Ooms | 121200 | 3 | 10 |
Kirill Müller | 117500 | 6 | 10 |
Dianne Cook | 75000 | 2 | 6 |
Maëlle Salmon | 74050 | 4 | 6 |
Edzer Pebesma | 52000 | 5 | 4 |
Olga Vitek | 35000 | 1 | 2 |
Toby Hocking | 34000 | 1 | 2 |
Heather Turner | 32000 | 1 | 2 |
Michael Lawrence | 25000 | 2 | 2 |
Orhun Aydin | 25000 | 1 | 2 |