# pytorch_runstats

`torch_runstats`

implements memory-efficient online reductions on tensors. Notable features:

Arbitrary sample shapes beyond single scalars

Reduction over arbitrary dimensions of each sample

“Batched”/”binned” reduction into multiple running tallies using a per-sample bin index. This can be useful, for example, in accumulating statistics over samples by some kind of “type” index or for accumulating statistics per-graph in a

`pytorch_geometric`

-like batching scheme . (This feature is similar to torch_scatter .)Option to ignore NaN values with correct sample counting

Note

The implementations currently heavily uses in-place operations for peformance and memory efficiency. This probably doesn’t play nice with the autograd engine — this is currently likely the wrong library for accumulating running statistics you want to backward through. (See TorchMetrics for a possible alternative.)

## Examples

### Basic

import torch from torch_runstats import Reduction, RunningStats # Interspersed ones and zeros with a ratio of 2:1 ones to zeros data = torch.cat([torch.ones(5), torch.zeros(3), torch.ones(5), torch.zeros(2)]) data.unsqueeze_(-1) rs = RunningStats( dim=(1,), reduction=Reduction.MEAN, ) # Accumulate the statistics over the data in batches # Note that each call to accumulate_batch also returns the statistic for the current batch: print(rs.accumulate_batch(data[:5])) # => tensor([[1.]]) rs.accumulate_batch(data[5:7]) rs.accumulate_batch(data[7:13]) rs.accumulate_batch(data[13:]) print(rs.current_result()) # => tensor([[0.6667]]) # Accumulated data can be cleared rs.reset() # An empty object returns the identity for the reduction: print(rs.current_result()) # => tensor([[0.]])

### “Binned”

A main feature of `torch_runstats`

is accumulating different samples in a batch into different “bins” — different running statistics — based on a provided index:

```
import torch
from torch_runstats import Reduction, RunningStats
data = torch.cat([torch.ones(5), torch.zeros(3), torch.ones(5), torch.zeros(2)])
data.unsqueeze_(-1)
sample_type = torch.cat([torch.zeros(8, dtype=torch.long), torch.ones(7, dtype=torch.long)])
rs = RunningStats(
dim=(1,),
reduction=Reduction.MEAN,
)
rs.accumulate_batch(data, accumulate_by=sample_type)
# The first entry is for "bin" (sample_type) 0, the second for 1:
print(rs.current_result()) # => tensor([[0.6250], [0.7143]])
# These values are what we expect:
print(5/8, 5/7) # => 0.625 0.714
```

### Reduce over arbitrary dimensions

A reduction can also be taken over a sample dimension:

```
import torch
from torch_runstats import Reduction, RunningStats
data = torch.cat([torch.ones(5, 3, 2), torch.zeros(3, 3, 2)], dim=0)
rs = RunningStats(
dim=(3, 2),
reduction=Reduction.MEAN,
reduce_dims=0, # reduce the sample dimension of size 3
)
rs.accumulate_batch(data)
# Note that the reduction has a bin index (len 1),
# and the sample dimension of shape 2,
# but that the dimension of size 3 has been reduced out:
print(rs.current_result()) # => tensor([[0.6250, 0.6250]])
```

### Ignore NaNs

When the `ignore_nan`

option is enabled, `RunningStats`

will only count and reduce over non-NaN elements:

```
import torch
from torch_runstats import Reduction, RunningStats
NaN = float("nan")
data = torch.Tensor([
[1.0, NaN, NaN],
[NaN, NaN, NaN],
[1.0, NaN, 1.0],
[1.0, 3.0, 1.0],
[1.0, NaN, NaN]
])
accumulate_by = torch.LongTensor([0, 0, 1, 1, 1])
rs = RunningStats(
dim=(3,),
reduction=Reduction.MEAN,
reduce_dims=0, # reduce the sample dimension of size 3
ignore_nan=True
)
rs.accumulate_batch(data, accumulate_by=accumulate_by)
# In the first bin, we see that the mean was taken over only one sample-
# the one non-NaN sample, giving a value of 1.0
#
# In the second bin, we see that we got the mean of the non-NaN
# elements: (1 * 5 + 3) / 6 = 1.33333...
print(rs.current_result()) # => tensor([1.0000, 1.3333])
```

## Class Reference

Currently supported `Reduction`

s are:

classtorch_runstats.Reduction(value)Enum indicating a reduction over \(N\) values \(x_i\).

Currently supported reductions:

`Reduction.MEAN`

: \(\frac{1}{N}\sum_i^{N}{x_i}\)

`Reduction.RMS`

: \(\sqrt{\frac{1}{N}\sum_i^{N}{x_i^2}}\)

Support for bincounting integers and combined one-pass mean/standard deviation are planned.

The core of the library is the `RunningStats`

class:

classtorch_runstats.RunningStats(dim: Union[int, Tuple[int, ...]] = 1,reduction: Reduction = Reduction.MEAN,reduce_dims: Union[int, Sequence[int]] = (),ignore_nan: bool = False)Compute running statistics over batches of samples.

- Parameters:

dim– the shape of a single sample. If an integer, interpreted as`(dim,)`

.

reduction– the statistic to compute

reduce_dims–extra dimensions within each sample to reduce over. If an integer, interpreted as

`(reduce_dims,)`

.This is a tuple of dimension indexes that are interpreted as dimension indexes within each

sample:`reduce_dims=(1,)`

implies that in a batch of size`(N, A, B, C)`

with`dim = (A, B, C)`

the`N`

and`B`

dimensions will be reduced over. (To reduce over`A`

instead, you would use`reduce_dims=(0,)`

to reduce over the first non-batch dimension.)By default an empty tuple, i.e., reduce only over the batch dimension.

ignore_nan– if True, NaNs in the data will be ignored, both in the accumulation and the sample count. If False (default), NaNs will propagate as normal.

- accumulate_batch(
batch: torch.Tensor,accumulate_by: Optional[torch.Tensor] = None) torch.Tensor Accumulate a batch of samples into the running statistics.

- Parameters:

batch– tensor of shape`(N_samples,) + self.dim`

. The batch of samples to process.

accumulate_by– tensor of indexes of shape`(N_samples,)`

. If provided, the nth sample will be accumulated into the`accumulate_by[n]``th bin. If ``None`

(the default), all samples will be accumulated into the first (0th) bin. The indexes should be non-negative integer.- Returns:
tensor of shape

`(N_bins,) + self.output_dim`

giving the aggregated statisticsfor this input batch. Accumulated statistics up to this point can be retreived with`current_result()`

.

`N_bins`

is`accumulate_by.max() + 1`

— the number of bins in the batch — and not the overall number of bins`self.n_bins`

.

- batch_result(
batch: torch.Tensor,accumulate_by: Optional[torch.Tensor] = None) torch.Tensor Accumulate a batch of samples into the running statistics.

- Parameters:

batch– tensor of shape`(N_samples,) + self.dim`

. The batch of samples to process.

accumulate_by– tensor of indexes of shape`(N_samples,)`

. If provided, the nth sample will be accumulated into the`accumulate_by[n]``th bin. If ``None`

(the default), all samples will be accumulated into the first (0th) bin. The indexes should be non-negative integer.- Returns:
tensor of shape

`(N_bins,) + self.output_dim`

giving the aggregated statisticsfor this input batch. Accumulated statistics up to this point can be retreived with`current_result()`

.

`N_bins`

is`accumulate_by.max() + 1`

— the number of bins in the batch — and not the overall number of bins`self.n_bins`

.

- current_result() torch.Tensor
Get the current value of the running statistic.

- Returns:
A tensor of shape

`(self.n_bins,) + self.output_dim`

. The nth bin contains the accumulated statistics for all processed samples whose`accumulate_by`

was n.

propertydim: Tuple[int, ...]The shape of a single input sample for this

`RunningStats`

propertyn: torch.TensorThe number of samples processed so far in each bin.

- Returns:
A

`LongTensor`

of shape`(self.n_bins,)`

propertyn_bins: intThe number of

`accumulate_by`

bins currently maintained by this object.

propertyoutput_dim: Tuple[int, ...]The shape of the output statistic for a single bin.

propertyreduce_dims: Tuple[int, ...]Indexes of dimensions in each sample that will be reduced.

- reset(
reset_n_bins: bool = False) None Forget all previously accumulated state.

This method does

notforget`self.n_bins`

unless`reset_n_bins`

is True.

- Parameters:

reset_n_bins– whether to reset this object to one accumulation bin. This defaults to False on the assumption that a reset object will likely be used to process data with a similar or equal number of bins.

- to(
device=None,dtype=None) None Move this

`RunningStats`

to a new dtype and/or device.

- Parameters:

dtype– like`torch.Tensor.to`

device– like`torch.Tensor.to`