Basic Stats in J
My son’s homework the other day involved the basic statistical functions of mean, median, mode and range. One of the most classic array language functions is, of course, calculating the mean, which is famously three functions: mean =. +/ % #
in J.
The range is also pretty simple; you just want the difference between the largest and smallest values in the input, so you get range =. >./ - <./
in J.
The next two are not as clean. I’ll show you what I came up with and then we can compare to what’s going on in the stats/base library.
Median
Let’s start with the median. The idea is you take the middle value from the sorted list. “Middle value” makes I don’t see an obvious trick for finding the middle value in J. I see basically two ways to do it: 1) calculate the midpoint and index it, or 2) use a recursive grossness to tear bits of the start and end. The latter feels more like a functional programming solution with linked lists, so I’m guessing the more array-ish way of doing it is to calculate. But there’s a second problem, which maybe the dear reader has already seen but I had to actually get halfway there before I realized, which is that even length arrays do not have a value at the midpoint. This invites some overthinking.
My son’s book suggests that in this situation you average the two midmost values. On the one hand, it feels to me like the whole point of the median is that it is a value from the set. On the other hand, only having a median 50% of the time feels unsatisfactory. Having two results for median 50% of the time also seems bad. I checked Wikipedia, and they give a definition as a partial function:
if is odd, if is even,
This suggests a direct translation that would involve using agenda @.
on the divisibility of the length. Another approach would be to always get two values and then if they happen to be the same, average them anyway, which feels like less work to me. If the list is odd-length, we should obtain a midpoint with a half index in it, as in a list of length 7 producing a midpoint of 3.5. We can subtract 0.5 from that to get the actual index of 3. However, if the list has a length of 8, we will get a midpoint of 4. Subtracting 0.5 from that will produce 3.5. If we take the floor and ceiling of this, we will get 3 and 4, which are the indices of the values around the middle. So it feels like the right approach here is to find the midpoint minus half, take the floor and ceiling of that, take the values at those indexes in the sorted list and average them. Which is a lot of words to say “find the median” but seems like the only way, going down this road. This yields the slightly
unwieldy verb:
median =. {{ (+/ % #) q {~ (<.,>.) 0.5-~ -: # q =. /:~ y }}
So the idea here is take the argument, sort it, call it q
because we’re going to need it again later, get the length of it, half it, take 0.5 from it, get the floor and ceiling of that number, take those from q
, then average them. I’m not in love with this. Let’s see what we have in the stats/basic file:
min=: <./
max=: >./
midpt=: -:@<:@#
median=: -:@(+/)@((<. , >.)@midpt { /:~)
This is an interesting piece of code. We can see some similarity, though defining max/min and not using them is a choice. For midpoint, we have half of count minus one, which saves a few constants from my method. Same trick for getting the floor and ceiling in one: (<. , >.)
. Nice compositional style here, no need for the actual variables. Also no need for a real average, we just add up the two values and cut in half. All in all, this solution is shorter and makes much better use of the primitives than mine, which isn’t too surprising, and manages a tacit style nicely. I’m not sure how much value is obtained by moving midpoint out though.
Tentative conclusion: pay attention to constants like 1/2. If you know you have two items, you can use halve -:
.
Mode
Now we have a real bummer: the mode. The mode is defined as the most frequent value in the set. We can safely assume that this is going to return a list of results because there can be many items with the most frequent value. I want to utilize key /..
as in a post from a few days ago to get the histogram, basically, of the input, and then we’ll need to utilize the two parts separately: find the largest count, and then use that to lookup the values with the largest count. For some reason I’m picturing this happening inside a verb that gets called dyadically, with the counts on the left and the groups of identical values on the right. It will be very interesting to see how this compares to the method in the stats/basics package.
So first, let’s get the counts with the values:
(#,{.)/.~ q =. 9 1 8 16 42 9 1 2 8 7 7 5 4
2 9
2 1
2 8
1 16
1 42
1 2
2 7
1 5
1 4
Now this is slightly inconvenient, because I would really like to have the counts on one side and the values on the other so I can use insert /
to put a function between them. We wind up needing to transpose this intermediate so we get the counts in the first entry and the values in the second:
|: (#,{.)/.~ q
2 2 2 1 1 1 2 1 1
9 1 8 16 42 2 7 5 4
Now we need a function to find the greatest, which is simply >./
. Now we have a selection problem, which demands filtering the input, which is a job for copy #
. So we test for where the count on the left is greatest, find occurrences of that value in the left, and filter with copy against the right, and the overall solution is this:
mode =. {{ (([ = >./@[) # ])/|: (#,{.)/.~ y }}
mode q
9 1 8 7
This could probably be simplified; we could invert copy and remove some parens, for instance. I’m not entirely sure that we need to transpose the array.
Surprisingly, stats/base does not contain a definition for mode, but on the wiki I found this one: mode=: ~. {~ [: (i. >./) #/.~
which only returns the first value. But let’s look at how it works. First, we calculate the count over the self-groups, which gets us the frequencies in the histogram (but not their values). Then we have a capped fork. I don’t find these especially readable yet. Here it is now clear that we are finding the index of the largest value within the frequencies. Then we are inverting take and passing the nub of the original input to the right of take and the index of the largest value to the left. However, the use of i.
here guarantees we get one result, no matter how many modes our set has. Thus, we get a pretty incomplete answer on our sample input:
mode =. ~. {~ [: (i. >./) #/.~
mode q
9
However, we know the answer we want will have #~
in the position where {~
is, and we know that we should have = where we have i.
, so let’s try that:
mode =. ~. #~ [: (= >./) #/.~
mode q
9 1 8 7
So that appears to have worked, which surprises me a little.
The cap here has actually done us a lot of good because not only has it saved us from writing @:
, it has also saved us from quite a few parentheses, as this demonstrates:
(~. #~ [: (= >./) #/.~) q
9 1 8 7
(~. #~ (= >./) @: #/.~) q
9 1 8 16 42 2 7 5 4
(~. #~ (= >./) @: (#/.~)) q
9 1 8 7
Tentative conclusion: /..
key dyad is very cute, but it’s easier to work with one piece of data at a time. Get better at capped forks, since they can be clearer than at @:
when @:
would require more parentheses. Forks are helpful.