Plotting/modelling a histogram with large bins of small values

Question

I'm wondering some best practices or approaches for data where, for example, as in the below image the low value bins are most common, but you are interested in the whole distribution.

Here is a visual of the raw data:

When we take logy, we can see there is more potentially interesting behaviors in the higher values. But now the axes make interpretation more difficult.

It's quite a broad question, but

how can we model such distributions?
can we fit multiple distributions to the log histogram?

Original dataset can be found here: https://www.kaggle.com/datasets/wilomentena/uk-government-petitions The plot is for non-rejected petitions

What does number of signatures mean? Can you provide more context on the data? How was data generated? — forecaster, Commented Aug 1 at 10:59
Added the dataset to the post (it's signatures on petitions to the UK government) — WBM, Commented Aug 1 at 12:05

Nick Cox · Accepted Answer · 2024-08-01 11:01:56Z

I'm queasy about logging histogram frequencies if only because (not only because) where does frequency zero show?

You'd have to worry about whether bin width is chosen in the only good way.

I doubt that thinking about modelling the logged distribution is a good direction to try, for those and other reasons.

On the other hand, there are plenty of good precedents for showing roots of histogram frequencies, as convenient if not natural. A good start is that the root of 0 is 0. John Tukey talked about rootograms in the 1960s, and before square him root scales were used in physics in the early 20th century.

Another direction is to ask about transforming the outcome. Is the number of signatures ever zero? If not, logging the count might be helpful. If so, whether zeros imply a different population is an issue. (For example, tobacco consumption is many zeros for non-smokers and a skewed distribution for smokers.)

Whether what looks like possible multimodality is genuine is a good question. It's hard for us to think about without knowing more about data generation: for example, is the sample or the population in some sense heterogeneous?

No zeros in this particular dataset, I'll have a look into rootograms. The data should be homogeneous (?), but upon reflection I think the 100,000 bump can be explained by the way the signatures are counted (i.e. a known behavior) — WBM, Commented Aug 1 at 10:56
If you follow @NickCox 's advice about taking the logs of the number of signatures, you'll also find a "bump" at 10,000. — JimB, Commented Aug 4 at 6:51

Stack Exchange Network

Plotting/modelling a histogram with large bins of small values

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
distributions
data-visualization
histogram
or ask your own question.

Hot Network Questions

Strictly Necessary Cookies

Performance Cookies

Functional Cookies

Targeting Cookies

Plotting/modelling a histogram with large bins of small values

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged distributionsdata-visualizationhistogram or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
distributions
data-visualization
histogram
or ask your own question.