edoardo@home:~$

Statistics homework 8

Theory:

  1. Distributions of the order statistics: look on the web for the most simple (but still rigorous) and clear derivations of the distributions, explaining in your own words the methods used.

  2. Do a research about the general correlation coefficient for ranks and the most common indices that can be derived by it. Do one example of computation of these correlation coefficients for ranks.

Practice:

  1. Given a random variable, extract m samples of size n and plot the empirical distribution of its mean (histogram), the first and the last order statistics. Comment on what you see.

  2. Discover a new important stochastic process by yourself! Consider the general scheme we have used so far to simulate some stochastic processes (such as the relative frequency of success in a sequence of trials, the sample mean and the random walk) and now add this new process to our process simulator. Same scheme as previous program (random walk), except changing the way to compute the values of the paths at each time. Starting from value 0 at time 0, for each of m paths, at each new time compute N(i) = N(i-1) + Random step(i), for i = 1, …, n, where Random step(i) is now a Bernoulli random variable with success probability equal to λ * (1/n) (where λ is a user parameter, eg. 50, 100, …). At time n (last time) and one (or more) other chosen inner time 1<j<n (j is a program parameter) create and represent with histogram the distribution of N(i). Represent also the distributions of the following quantities (and any other quantity that you think of interest):

    • Distance (time elapsed) of individual jumps from the origin
    • Distance (time elapsed) between consecutive jumps (the so-called “holding times”)

Practice theory: Find out on the web what you have just generated in the previous application. Can you find out about all the well known distributions that “naturally arise” in this process ?



Theory 1

The order statistics are the items from a random sample arranged in increasing order. The focus here is to present the distribution functions and probability density functions of order statistics. The order statistics are important tools in non-parametric statistical inferences.

We only consider random samples obtained from a continuous distribution (i.e. the distribution function is a continuous function). Let \(X_{1}, X_{2}, \dots, X_{n}\) be a random sample of size \(n\) from a continuous distribution with distribution function \(F(x)\). We order the random sample in increasing order and obtain \(Y_{1}, Y_{2}, \dots, Y_{n}\), where \(Y_{1} = \min{X_{1}, X_{2}, \dots, X_{n}}\), \(Y_{n} = \max{X_{1}, X_{2}, \dots, X_{n}}\) and \(Y_{1} \leq Y_{2} \leq \dots \leq Y_{n}\).

The distribution function of the order statistics

The distribution function of \(Y_{i}\) is an upper tail of a binomial distribution. If the event \(Y_{i} \leq y\) occurs, then there are at least \(i\) many \(X_{j}\) in the sample that are less than or equal to \(y\). Consider the event that \(X \leq y\) as a success and \(F(y)=P[X \leq y]\) as the probability of success. Then the drawing of each sample item becomes a Bernoulli trial. Thus, the following is the distribution function of \(Y_{i}\):

\[F_{Y_{i}}(y) = P[Y_{i} \leq y] = \sum_{k=i}^{n} \binom{n}{k}F(y)^{k}(1-F(y))^{n-k}\]

The probability density function of the order statistics

The PDF of Y_{i} is given by:

\[f_{Y_{i}}(y) = \frac{n!}{(i-1)!(n-i)!}F(y)^{i-1}(1-F(y))^{n-i}f_{X}(y)\]

[1] https://probabilityandstats.wordpress.com/2010/02/20/the-distributions-of-the-order-statistics/

[2] https://probabilityandstats.wordpress.com/2010/02/20/the-distributions-of-the-order-statistics/

Theory 2

In statistics, ranking is the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. For example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively.

To understand this concept better, let’s assume that we have a set of values: 25, 24, 23, 27, 29, 28. The ordered set will be: 23, 24, 25, 27, 28, 29. Now we can assign a rank to all the values in the original set based on the position in the ordered set, therefore 25 has the rank 3, 24 has the rank 2, 23 has the rank 1, 27 has the rank 4, 29 has the rank 6 and 28 has the rank 5.

Ranks can be used to make correlations, as said in [2] in statistics, a rank correlation is any of several statistics that measure an ordinal association; the relationship between rankings of different ordinal variables or different rankings of the same variable, where a ranking is the assignment of the ordering labels first, second, third, etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them.

There are many measures of rank correlations, the most common are the Pearson and the Spearman rank.

Pearson rank correlation coefficient

Pearson Correlation is the coefficient that measures the degree of relationship between two random variables. The coefficient value ranges between +1 to -1. Pearson correlation is the normalization of covariance by the standard deviation of each random variable.

\[PCC(X Y) = \frac{COV(X, Y)}{SD_{x}*SD_{y}}\]

Spearman rank correlation coefficient

SRCC covers some limitations of PCC. It does not carry any assumptions about the distribution of the data. SRCC is a test that is used to measure the degree of association between two variables by assigning ranks to the value of each random variable and computing PCC out of it.

Given two random variable X, Y. Compute the rank of each random variable, such that the least value has rank 1. Then apply the Pearson correlation coefficient on Rank(X), Rank(Y) to compute SRCC.

\[SRCC(X, Y) = PCC(rank(X), rank(Y))\]

[1] https://en.wikipedia.org/wiki/Ranking#Ranking_in_statistics

[2] https://en.wikipedia.org/wiki/Ranking#Ranking_in_statistics

[3] https://towardsdatascience.com/pearson-and-spearman-rank-correlation-coefficient-explained-60811e61185a



Practice 1

How it works

Practice 2

How it works

The histogram from the bottom describes the Distance (time elapsed) of individual jumps from the origin.

The histogram from the top describes Distance (time elapsed) between consecutive jumps

Download projects

Download project



Practice Theory

The distribution we made in the above application is very similar to a Poisson Process.

A Poisson Process is a model for a series of discrete event where the average time between events is known, but the exact timing of events is random. The arrival of an event is independent of the event before (waiting time between events).

These kinds of processes meet the following criteria:

  1. Events are independent of each other.
  2. The average rate (events per time period) is constant
  3. Two events cannot occur at the same time.

The last point — events are not simultaneous — means we can think of each sub-interval of a Poisson process as a Bernoulli Trial, that is, either a success or a failure.

Poisson distribution

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

A discrete random variable X is said to have a Poisson distribution, with parameter \(\lambda > 0\) if it has a probability mass function given by:

\[f(k, \lambda) = P[X=k] = \frac{\lambda^{k}e^{-\lambda}}{k!}\]

The Poisson distribution may be useful to model events such as:

  • The number of meteorites greater than 1 meter diameter that strike Earth in a year
  • The number of patients arriving in an emergency room between 10 and 11 pm
  • The number of laser photons hitting a detector in a particular time interval

[1] https://towardsdatascience.com/the-poisson-distribution-and-poisson-process-explained-4e2cb17d459

[2] https://en.wikipedia.org/wiki/Poisson_distribution