Bug 201582 - ministat does not calculate proper median value (patch included)
Summary: ministat does not calculate proper median value (patch included)
Status: Closed FIXED
Alias: None
Product: Base System
Classification: Unclassified
Component: bin (show other bugs)
Version: CURRENT
Hardware: Any Any
: --- Affects Many People
Assignee: Marcelo Araujo
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-07-15 07:24 UTC by Marcus Reid
Modified: 2015-11-24 02:32 UTC (History)
2 users (show)

See Also:


Attachments
Make ministat return proper value for median (470 bytes, patch)
2015-07-15 07:24 UTC, Marcus Reid
no flags Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Marcus Reid 2015-07-15 07:24:48 UTC
Created attachment 158793 [details]
Make ministat return proper value for median

From http://www.mathgoodies.com/lessons/vol8/median.html :

"The median of a set of data is the middlemost number in the set. The median is also the number that is halfway into the set. To find the median, the data should be arranged in order from least to greatest. If there is an even number of items in the data set, then the median is found by taking the mean (average) of the two middlemost numbers."

Ministat currently returns the second of the two middle numbers if there is an even number of entries.  This patch takes the middle two and returns their average.

*** usr.bin/ministat/ministat.c.orig	2015-07-14 23:49:11.246171000 -0700
--- usr.bin/ministat/ministat.c	2015-07-15 00:16:20.895494000 -0700
***************
*** 193,199 ****
  Median(struct dataset *ds)
  {
  
! 	return (ds->points[ds->n / 2]);
  }
  
  static double
--- 193,200 ----
  Median(struct dataset *ds)
  {
  
! 	if(!(ds->n % 2)) return ((ds->points[ds->n / 2]) + (ds->points[(ds->n / 2)-1]))/2;
! 	else return (ds->points[ds->n / 2]);
  }
  
  static double
Comment 1 Marcelo Araujo freebsd_committer freebsd_triage 2015-11-21 03:54:38 UTC
I will take it.
Comment 2 Marcelo Araujo freebsd_committer freebsd_triage 2015-11-23 04:30:32 UTC
Hi,

ministat(1) actually is doing in the right way.
I made couple tests with other computer languages and they return the same value as ministat returns.

As a simple example using two different ways in Python:
>>> import statistics
>>> items = [1,2,13,4,5,6,7]
>>> statistics.median(items)
4

Also I made my own:
>>> def middle(L):
...     L = sorted(L)
...     n = len(L)
...     m = n -1
...     return (L[n/2] + L[m/2]) / 2.0
... 
>>> print middle(items)
4.0

Ministat result:
    N           Min           Max        Median           Avg        Stddev
x   7             1             7             4             4     2.1602469
+   7             1             7             4             4     2.1602469

So, I can't see where is the problem.


Best,
Comment 3 Marcus Reid 2015-11-23 06:27:59 UTC
You're not testing for the problem case there.  Here, let me demonstrate:

>>> import statistics
>>> items = [1,2,3,4]
>>> statistics.median(items)
2.5


[mreid@sol /usr/home/mreid]$ ministat
1
2
3
4
x <stdin>
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|x                                                                          x                                                                           x                                                                          x|
|                |________________________________________________________________________________________________A_____________________________________M__________________________________________________________|                |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4             1             4             3           2.5     1.2909944


As you can see, ministat has a median of 3 whereas the real median is 2.5, as seen in the python example.

It's the case where there is no middle number (an even number of items in the dataset) that is broken.

Thanks!
Comment 4 Marcelo Araujo freebsd_committer freebsd_triage 2015-11-23 06:36:57 UTC
Thanks for the feedback.
Yes, you are right with this case!
Comment 5 commit-hook freebsd_committer freebsd_triage 2015-11-24 02:32:02 UTC
A commit references this bug:

Author: araujo
Date: Tue Nov 24 02:30:59 UTC 2015
New revision: 291231
URL: https://svnweb.freebsd.org/changeset/base/291231

Log:
  Compute the median of the data set as the midpoint between the two middle
  values when the data set has an even number of elements.

  PR:		201582
  Submitted by:	Marcus Reid <marcus@blazingdot.com>
  Reviewed by:	imp
  Approved by:	bapt (mentor)

Changes:
  head/usr.bin/ministat/ministat.c
Comment 6 Marcelo Araujo freebsd_committer freebsd_triage 2015-11-24 02:32:47 UTC
Committed thanks!