2

I'm wondering some best practices or approaches for data where, for example, as in the below image the low value bins are most common, but you are interested in the whole distribution.

Here is a visual of the raw data:

enter image description here

When we take logy, we can see there is more potentially interesting behaviors in the higher values. But now the axes make interpretation more difficult.

enter image description here

It's quite a broad question, but

  • how can we model such distributions?
  • can we fit multiple distributions to the log histogram?

Original dataset can be found here: https://www.kaggle.com/datasets/wilomentena/uk-government-petitions The plot is for non-rejected petitions

CC BY-SA 4.0
2

1 Answer 1

2

I'm queasy about logging histogram frequencies if only because (not only because) where does frequency zero show?

You'd have to worry about whether bin width is chosen in the only good way.

I doubt that thinking about modelling the logged distribution is a good direction to try, for those and other reasons.

On the other hand, there are plenty of good precedents for showing roots of histogram frequencies, as convenient if not natural. A good start is that the root of 0 is 0. John Tukey talked about rootograms in the 1960s, and before square him root scales were used in physics in the early 20th century.

Another direction is to ask about transforming the outcome. Is the number of signatures ever zero? If not, logging the count might be helpful. If so, whether zeros imply a different population is an issue. (For example, tobacco consumption is many zeros for non-smokers and a skewed distribution for smokers.)

Whether what looks like possible multimodality is genuine is a good question. It's hard for us to think about without knowing more about data generation: for example, is the sample or the population in some sense heterogeneous?

CC BY-SA 4.0
2
  • No zeros in this particular dataset, I'll have a look into rootograms. The data should be homogeneous (?), but upon reflection I think the 100,000 bump can be explained by the way the signatures are counted (i.e. a known behavior)
    – WBM
    Commented Aug 1 at 10:56
  • If you follow @NickCox 's advice about taking the logs of the number of signatures, you'll also find a "bump" at 10,000.
    – JimB
    Commented Aug 4 at 6:51

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.