-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use orjson instead of json, when available #17955
Conversation
f6f7cb9
to
82969b8
Compare
For `mypy -c 'import torch'`, the cache load time goes from 0.44s to 0.25s as measured by manager's data_json_load_time If I time dump times specifically, I see a saving of 0.65s to 0.07s. Overall, a pretty reasonable perf win -- should we make it a required dependency? I don't know if the sqlite cache path is used at all, but let me know if I need a cleverer migration than renaming the table
b0db034
to
4103892
Compare
This comment has been minimized.
This comment has been minimized.
Sounds very promising! I can also perform some measurements.
I wonder how well maintained orjson is, and does it ship binary wheels for all the platforms we care about? We might be adding ARM Linux wheels at some point, and it would be nice if all our dependencies would ship with binary wheels (though it's perhaps not essential for Linux, as long as there are x86-64 wheels).
Sqlite caching is very much used, and I'm thinking of enabling it by default in the future. In certain use cases it's significantly faster than a file-per-module cache, and we use it at work. |
return orjson.dumps(obj, option=orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS) # type: ignore[no-any-return] | ||
else: | ||
# TODO: If we don't sort keys here, testIncrementalInternalScramble fails | ||
# We should document exactly what is going on there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lmk if you know off the top of your head why sorting keys is important!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be just so that tests produce the keys in a predictable order on older Python 3 versions where dict didn't preserve insertion order.
I think orjson does ship wheels for all platforms we care about. It would be nice if Python packaging had a concept of a "default" extra for this kind of thing, though |
This comment has been minimized.
This comment has been minimized.
I'm seeing a 10-15% improvement to the performance of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think this PR should be good to go.
Questions to resolve now:
- Is just using
files2
in sqlite a sufficient migration
Open questions that we can resolve later:
- Documenting why sort_keys is important
- Adding an extra that includes orjson (or relying on it by default)
- Adding test coverage for the optional feature
This comment has been minimized.
This comment has been minimized.
This seems fine. It's an internal implementation detail, and caches aren't compatible between mypy versions. Can you check if |
return orjson.dumps(obj, option=orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS) # type: ignore[no-any-return] | ||
else: | ||
# TODO: If we don't sort keys here, testIncrementalInternalScramble fails | ||
# We should document exactly what is going on there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be just so that tests produce the keys in a predictable order on older Python 3 versions where dict didn't preserve insertion order.
Thanks, there was a missing spot where I'd forgotten to change to files2. I'm a little confused at how cache-convert is meant to work, I think it might be a little broken on master? Looking... |
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
Okay, fixed the cache convert problem on master in #17974 |
For `mypy -c 'import torch'`, the cache load time goes from 0.44s to 0.25s as measured by manager's data_json_load_time. If I time dump times specifically, I see a saving of 0.65s to 0.07s. Overall, a pretty reasonable perf win -- should we make it a required dependency? See also #3456
def json_dumps(obj: object, debug: bool = False) -> bytes: | ||
if orjson is not None: | ||
if debug: | ||
return orjson.dumps(obj, option=orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS) # type: ignore[no-any-return] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hauntsaninja @JukkaL I think that Mypy will not use too much memory because the memory will be released somewhere sooner or later (and, obviously, the cache is not so large to eat a lot of memory), but keep in mind that
python -c 'import orjson; [orjson.dumps(i) for i in range(30000000)]'
python -c 'import orjson; [orjson.dumps(i).decode() for i in range(30000000)]'
are very different things in the terms how the memory is used and when it is released.
The first command will keep all dumped objects in the memory, plus a crazy memory overhead. E.g. the first command needs +20 GiB of memory to run, but the second command will eat only ~2 GiB.
As I said, Mypy should not be affected because the memory will be freed (I hope, ha-ha) so this PR should be great. Actually, this problem can happen only if we dumps
all in a single function. Just a friendly heads up to make sure you check the memory usage in other applications if you will need to dumps
lots of objects in a FOR
loop in a single function :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, that's probably from ijl/orjson#483 (comment) Looks like that's not resolved, someone should open a PR against orjson fixing the thing godlygeek points out
For
mypy -c 'import torch'
, the cache load time goes from 0.44s to 0.25s as measured by manager's data_json_load_time. If I time dump times specifically, I see a saving of 0.65s to 0.07s. Overall, a pretty reasonable perf win -- should we make it a required dependency?I don't know if the sqlite cache path is used at all (what's the status?), but let me know if I need a cleverer migration than renaming the table
See also #3456