Statistics for Data Analysis with Kotlin

Statistics is the grammar of science. - Karl Pearson

Data is the new oil. Just type in "Data Science" inside the Google Trends search box, and you would believe it too. The interest over time for data science has taken a boost in the recent past, for reasons as obvious as the theory of evolution and the survival of the fittest! One blog even read that the job of a data scientist is the sexiest job of the 21st century!

What comes to our minds whenever we talk about which programming language to pick for data science? Let us be honest, Kotlin has never been the one. Python tops my list, or at least used to, followed by Java & Scala. According to this article, Kotlin even failed to break in the top 10!

I was introduced to Kotlin in 2017. Since then I have been in awe of the language. Interoperability with Java, elimination of the NullPointerException, Kotlin Multiplatform Mobile, you name it. And just when I thought Kotlin could not surprise me anymore, Kotlin-Statistics came into the picture.

Why try Kotlin when we already have Python?

Change is the only constant.

Why such an inclination towards Kotlin, you ask? There are multiple reasons.

Google I/O announced Kotlin as the official Android development language in 2017, and since then many developers have shifted to Kotlin.
The introduction of Kotlin Multiplatform has extended the use of the language to other platforms such as iOS and backend development.

According to Thomas Neild, the man behind Kotlin-Statistics,

It started out as an experiment to express meaningful statistical and data analysis with functional and object-oriented programming, while making the code legible and intuitive.

This blog will cover a basic overview of the various extension functions & operators that the Kotlin-Statistics library supports for collections.

Integration
Kotlin-Statistics can be integrated using Gradle or Maven, as follows:

// Gradle
dependencies {
    compile 'org.nield:kotlin-statistics:1.2.1'
}

// Maven
<dependency>
    <groupId>org.nield</groupId>
    <artifactId>kotlin-statistics</artifactId>
    <version>1.2.1</version>
</dependency>

It can also be integrated using JitPack to directly build a snapshot as a dependency:

// Gradle
repositories {		
    maven { url 'https://jitpack.io' }
}
dependencies {
    compile 'com.github.thomasnield:kotlin-statistics:-SNAPSHOT'
}

// Maven
<repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories>
<dependency>
    <groupId>com.github.thomasnield</groupId>
    <artifactId>kotlin-statistics</artifactId>
    <version>-SNAPSHOT</version>
</dependency>

Basic Operators

Kotlin-Statistics houses several operators that support Int, Long, Double, Float, BigDecimal, and Short numeric types. So let's dive directly into the code & understand each of the above with the help of examples:

var listOfDoubles = sequenceOf(1.0, 3.0, 5.0, 3.0, 7.0)

sum()

print(listOfDoubles.sum())

Output:
19.0

average()

print(listOfDoubles.average())

Output:
3.8

min()

print(listOfDoubles.min())

Output:
1.0

max()

print(listOfDoubles.max())

Output:
7.0

mode()

print(listOfDoubles.mode())

Output:
3.0

median()

print(listOfDoubles.median())

Output:
3.0

range()

print(listOfDoubles.range())

Output:
1.0..7.0

percentile()

print(listOfDoubles.percentile(3.0))

Output:
1.0

variance()

print(listOfDoubles.variance())

Output:
5.2

standardDeviation()

print(listOfDoubles.standardDeviation())

Output:
2.280350850198276

geometricMean()

print(listOfDoubles.geometricMean())

Output:
3.159818305749271

sumOfSquares()

print(listOfDoubles.sumOfSquares())

Output:
93.0

normalize()

listOfDoubles.normalize().forEach {
        print(it)
    }

Output:
-1.2278812270298407
-0.35082320772281156
0.5262348115842176
-0.35082320772281156
1.4032928308912467

simpleRegression()

print(sequenceOf(
            1.0 to 2.0,
            2.0 to 6.0,
            3.0 to 9.0
    ).simpleRegression().slope)

Output:
3.5

kurtosis

print(listOfDoubles.kurtosis)

Output:
-0.17751479289940963

skewness

print(listOfDoubles.skewness)

Output:
0.404796008910937

Slicing Operators

Enough of the basic stuff. Let's level up! Slicing operators can be used to modify or delete the items of mutable sequences such as lists and perform operations on them.
To understand the slicing operators, let's take a real-life scenario. We are studying the data of 10 football players, which belong to one of these three teams: Invincibles, Speedsters & Rockstars. We have created a data class, Player, which takes in 3 parameters – the player's name, the team to which he belongs, and the total number of goals he has scored for his team.

data class Player(val name: String, val teamName: String, val goalsScored: Int)

// List of players
val playersList = listOf(
        Player("Bruce", "Invincibles", 46),
        Player("Peter", "Speedsters", 67),
        Player("Jack", "Speedsters", 34),
        Player("Watson", "Invincibles", 71),
        Player("Daniel", "Rockstars", 64),
        Player("Robert", "Speedsters", 60),
        Player("Kent", "Rockstars", 41),
        Player("Jason", "Invincibles", 39),
        Player("Rusty", "Speedsters", 46),
        Player("James", "Rockstars", 61)
    )

Conventionally, Kotlin does not provide direct methods to calculate stats such as the total number or the average number of goals scored by each team, the minimum/maximum number of goals scored per player, so on and so forth. If not for Kotlin-Statistics, it would've needed us to first group the collections based on the team name, operate on it to find the total number of goals scored by each team, and then calculate the sum or the average number of goals. Kotlin-Statistics reduces this additional effort, and makes it efficient to perform these statistical calculations. Let's analyse these operators in detail with the help of the following examples:

sumBy()

val totalNumberOfGoalsScoredByEachTeam = playersList.sumBy(keySelector = { it.teamName }, 
									intSelector = { it.totalGoalsScored })
print(totalNumberOfGoalsScoredByEachTeam)

Output:
{Invincibles=156, Speedsters=207, Rockstars=166}

countBy()

val numberOfPlayersPerTeam = playersList.countBy(keySelector = { it.teamName })
print(numberOfPlayersPerTeam)

Output:
{Invincibles=3, Speedsters=4, Rockstars=3}

averageBy()

val averageNumberOfGoalsScoredByEachPlayerForEachTeam =
            playersList.averageBy(
                keySelector = { it.teamName },
                intSelector = { it.totalGoalsScored })
print(averageNumberOfGoalsScoredByEachPlayerForEachTeam)

Output:
{Invincibles=52.0, Speedsters=51.75, Rockstars=55.333333333333336}

geometricMeanBy()

val geometricMeanOfGoalsScoredByEachTeam =
            playersList.geometricMeanBy(keySelector = { it.teamName },
                valueSelector = { it.totalGoalsScored })
print(geometricMeanOfGoalsScoredByEachTeam)

Output:
{Invincibles=16.822603841260722, Speedsters=14.705441169852742, Rockstars=12.503332889007368}

minBy()

val minimumNumberOfGoalsScoredByAPlayerFromEachTeam =
            playersList.minBy(keySelector = { it.teamName },
                valueSelector = { it.totalGoalsScored })
print(minimumNumberOfGoalsScoredByAPlayerFromEachTeam)

Output:
{Invincibles=39, Speedsters=34, Rockstars=41}

maxBy()

val maximumNumberOfGoalsScoredByAPlayerFromEachTeam =
            playersList.maxBy(keySelector = { it.teamName },
                valueSelector = { it.totalGoalsScored })
print(maximumNumberOfGoalsScoredByAPlayerFromEachTeam)
    
Output:
{Invincibles=71, Speedsters=67, Rockstars=64}

rangeBy()

val rangeOfGoalsScoredByEachTeam = playersList.rangeBy(keySelector = { it.teamName },
            valueSelector = { it.totalGoalsScored })
print(rangeOfGoalsScoredByEachTeam)
    
Output:
{Invincibles=39..71, Speedsters=34..67, Rockstars=41..64}

varianceBy()

val varianceOfGoalsScoredByEachTeam = playersList.varianceBy(keySelector = { it.teamName },
            valueSelector = { it.totalGoalsScored })
print(varianceOfGoalsScoredByEachTeam)
    
Output:
{Invincibles=283.0, Speedsters=216.25, Rockstars=156.33333333333334}

standardDeviationBy()

val standardDeviationOfGoalsScoredByEachTeam =
            playersList.standardDeviationBy(keySelector = { it.teamName s },
                valueSelector = { it.totalGoalsScored })
print(standardDeviationOfGoalsScoredByEachTeam)
    
Output:
{Invincibles=16.822603841260722, Speedsters=14.705441169852742, Rockstars=12.503332889007368}

Is that all that Kotlin-Statistics has to provide? The answer is no. Binning Operations, Naive Bayes Classifier, Clustering – Kotlin-Statistics has it all covered.

Maybe Kotlin-Statistics is just another brick in the wall of data science, maybe it's the entire wall, it's a bit early to say. But one thing is for sure that the future is definitely bright for Kotlin.

References:

Statistics for Data Analysis with Kotlin

Basic Operators

Slicing Operators

Active Record Encryption in Rails 7

Understanding applications of useRef hook in React with examples