-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-19217: Speed up assertEqual on long sequences #27434
base: main
Are you sure you want to change the base?
Conversation
I'd recommend changing the PR title to something affirmative, e.g. "Speed up assertEqual on long sequences" |
I will! That's definitely more descriptive, thank you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To sum up comments:
- Please add the new variant to
assertMultiLineEqual
as well; - Please adapt the test cases to be pathological (you can use my suggestion) with the original code for both strings and lists;
- Please change the test to simply execute the diff, don't measure time at all. If it becomes too slow, we'll know just by the fact it became super slow.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be poked with soft cushions! |
@ambv before I go any further, I went ahead and made the change from https://github.com/jdevries3133/cpython/tree/[bpo-19217](https://bugs.python.org/issue19217)-ideas By the way, there is one other problematic use of Lines 1134 to 1136 in 48a6255
Reproducing the issue is similar to the others: import unittest
class TestDictDiffSlow(unittest.TestCase):
def test_slow_assert_equal_dict(self):
d1 = { i : i * 2 for i in range(10000) }
d2= { i : i for i in range(10000) }
self.assertDictEqual(d1, d2)
if __name__ == '__main__':
unittest.main() I'm happy to fix all the failing tests, and there are quite a few, but the change from Before:
After:
Before:
After:
Other OptionsI don't know if these are any good, but maybe there are other options to fix this bug: Check size of InputCheck size of input value before passing to
Make
|
Good thinking about having a threshold above which we can switch to I guess what I'm saying is that it isn't at all clear how our threshold should be calculated. How about this:
As part of testing here you should deliberately try to come up with examples where the "25 diff lines" limit is still too big. If you can't find any after a good round of trying to break it, then maybe you can double the limit. And try breaking it then. I mean, you can spend as much or as little time on this as you wish. But I agree with you that we can tweak this into a solution that is both fast and preserves nice diffs for most use cases. |
* Now, we switch between difflib.ndiff and difflib.unified_diff based on input size. * Threshold is sort of arbitrary, but seems to be working in the limited test cases I've written * The full test suite is passing, with only very minor tweaks to existing tests!
@ambv I converted this to a draft, but I do have an implementation of what we discussed now. You'll notice a few TODOThis is why I've marked it as a draft:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like where this is going. Looking forward to the next iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't changed my mind since the last time I commented on one of the related issue reports (which was some years ago): if unittest is determined to compare strings, regardless of the types of the arguments passed to assertEqual()
, then it should define the way it constructs such strings, and use its own dirt-simple, worst-case linear-time string diff algorithm. Just, e.g., look for the first mismatching character.
You can try to out-think difflib, but at best you'll make "the problem" less likely. If that's good enough for you, fine by me 😉. The approach here, to my eyes, should work pretty well for most cases.
Note that people here are generally comparing the very best case for unified diff to the very worst case for ndiff: no strings in common. Unified diff does only one layer of differencing, and if no elements are in common it has next to nothing to do. But ndiff does two layers: if there are no strings in common, it goes on to compare every string in one argument to every string in the other, to find a pair "closest to" matching.
But unified diff can be quadratic time too, when there are lots of matching subsequences.
It's convoluted to construct such cases, because the autojunk=True
default works against it (but also damages the quality of the diffs). Here's a simple demonstration by forcing autojunk=False
, using so-called "Fibonacci strings", which are a worst case for many string-matching algorithms:
from difflib import SequenceMatcher
fs = ['a', 'b']
from time import perf_counter as now
while True:
a = fs[-2]
b = fs[-1]
start = now()
r = SequenceMatcher(None, a, b, autojunk=False).ratio()
finish = now()
print(len(b), finish - start)
fs.append(a + b)
The much-worse-than-linear behavior then quickly becomes obvious at non-trivial, but still smallish, string lengths:
...
144 0.0011279999998805579
233 0.0033622999999352032
377 0.008270400000583322
610 0.027310899999974936
987 0.05379209999955492
1597 0.1655982000002041
2584 0.4110897999998997
4181 1.1430293000003076
6765 2.670060499999636
10946 7.549530700000105
17711 19.679138700000294
28657 51.00185919999967
46368 137.25895500000024
75025 383.6125170999994
...
That was done on a far-from-quiet machine, but the point should be obvious enough.
Actually, using from difflib import SequenceMatcher
from time import perf_counter as now
xs = 'x'
while True:
start = now()
sm = SequenceMatcher(None, xs, xs, autojunk=False)
r = sm.ratio()
finish = now()
print(len(xs), finish - start)
xs *= 2 |
we've been calling difflib from assertEqual since probably ~2009. This issue & PR is just trying to improve upon corner cases where things go wrong for some people. Most people don't run into this or at least frustratingly work around it when they first do. this should ideally just reduce the need for that going forward. an ultimate failsafe if the diff computation is taking too long would be to bail out of the computation after "a short while" and revert to a simple linear "first difference is here" message. doing that requires https://bugs.python.org/issue24904 unless diffing were to be sent to another process (bad idea) that could be killed. |
unified diff is not "fast" just "less likely slow" or "slow in different circumstances". so this doesn't ensure the problem never happens, it just reduces its chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tweaked the wording in the NEWS entry a little.
approximately O((diff)^2), where `diff` is the product of the number of | ||
differing lines, and the total length of differing lines. On the other | ||
hand, unified_diff's cost is the same as the cost of producing `diff` | ||
by itself: O(a + b). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this isn't true. Both have worst-case quadratic time, but on different kinds of input, and it's less likely to get provoked in the simpler kind of differencing unified_diff
tries.
Well, actually actually 😉, the possible worst case of ndiff
is worse than that, at least cubic time.
And another, to show that from difflib import ndiff
from time import perf_counter as now
import sys
sys.setrecursionlimit(2000) # so the 512 case completes
count = 16
while True:
ax = [f"a{i:010d}" for i in range(count)]
bx = [s.replace("a", "b") for s in ax]
start = now()
ignore = list(ndiff(ax, bx))
finish = now()
print(count, finish - start)
count *= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we're basically trading one regressive performance situation for another.
Someone can always wind up unhappy. Without the ability to bail out if computation is taking too long, we should take Tim's suggestion of going to a linear "first differing element" highlight approach when inputs are "too large", followed by our generic a != b printing code.
I blindly suggest: max(len(a), len(b)) > 500
prevent calling unified_diff and max(len(a), len(b)) > 200
prevents calling ndiff. Just based on the worst case timings.
those default values used to tune that could be TestCase attributes that people could override themselves if they want (assumed to be rare), much like maxDiff is today.
You can get sharper estimates by cheating a bit: build your own sm = difflib.SequenceMatcher(None, a, b) That much is worst-case linear time. Then b2j = sm.b2j
sum(len(b2j[line]) for line in a if line in b2j) is the total number of times a unified diff will go around its innermost loop. It's cheating because If that's "cheap enough", you can go on to take a guess at worst-case ndiff time (which is never cheaper than unified diff time). Go through ("replace", i1, i2, j1, j2) Then len1 = i2 - i1
len2 = j2 - j1
ouch = len1 * len2 * min(len1, len2) sets If that's "cheap enough", ndiff can be called without worry. |
I should add that my sketch of ndiff's upper bound ignores the As a practical matter, though, under the current implementation I think that can be ignored. |
…thon into bpo-19217-slow-assertEq
Wow, thank you Tim and Gregory – it's a joy to learn from your discussion on this issue; I'm having fun!! So, here's basically what we have on the table: To graph it out:
So, first things first, @tim-one, do you have any interest in having Option #2 being part of the difflib API? If so, there's nothing more to do here – the path forward would be to add that feature to difflib and consume it from unittest. Otherwise, I feel like this is the path forward I'd take with this PR, should continuing appear to be the best option:
Note that I'm thinking not to implement Tim's O(n) cost calculation and "first differing element" style diff for expensive inputs. I wouldn't rule it out, I just feel like this heuristic in combination with the autojunk heuristic doesn't edge cases behind, but that assumption might turn out to be incorrect – it'd need to be tested more thoroughly. Tim and Gregory, what do you think? |
No, I don't see folding any of this into Far as I'm concerned, the user wrote a low-quality test if the things it's asserting are equal are so large that it's not dead obvious what to show them if the assertion fails. I'm fine with dumping a million lines of text with no indication of where a difference may be - it's what they asked for, after all 😉. As a thoroughly practical matter, why not just use unified diff, period? It's hard to provoke into pathological behavior, provided It seems bizarre on the face of it too to expect a user to make sense of that there may be multiple diff formats produced, depending on things they have no understanding of.
BTW, if you want to be more extreme, I was wrong: the |
The only issue is (if I understand correctly) that currently, millions of lines of text are not dumped as they're generated. Rather, the program hangs – because the big diff is not streamed to stdout as it's being generated; it's just being loaded into memory. Is there a way to just blast the diff into stdout as it's being generated instead of holding it in memory? This would be a better user experience imo, because at least the user can see what is happening. To your point, users are obviously being a bit abusive of the testing library, but it should also be assumed that assertEqual will do something other than just lock up in linear time. Maybe we can dispatch diff generation to another thread, and quit generating the diff after a timeout? I agree that it's more of an issue if the test suite is gross and convoluted, but there are a lot of gross and convoluted tests out there. I see a lot of generators in difflib, but I'm not sure if something like this is possible? I'm sorry to backpedal to entirely new solutions, and I know this isn't really the place to discuss it, but your criticism is 100% valid and I'd like to forge a path forward.... moreover, I want to slay the 9-year-old bug :(
I have tried this; it causes a lot of breakage of unittest's own test suite, which just makes it annoying, but it could be done. Hopefully asserting against failed test output is not a thing outside of unittest's own test suite....... |
Non-linear diff algorithms don't work that way. It isn't a sequential algorithm, it's trying to find a best/minimalish fit of differences. Otherwise the diff between 1 3 5 7 9 and 1 3 4 5 7 9 would be -5 -7 -9 +4 +5 +7 +9 done via a linear O(n) algorithm. Sure someone would see it when running tests locally in their terminal, but so much of testing is done on CI test automation systems these days where it's all or nothing and you don't bother to get a result or see progress - you just wait for the entire suite to give a pass or fail and if it fails you look at its processed output and ultimately logs if needed. Just use unified_diff is one solution as Tim notes. nothing wrong with that. it'd work (yay autojunk logic?)! One other thing to consider is that we intentionally limit the size of printed artifacts for the base case of assertEqual via the https://github.com/python/cpython/blob/main/Lib/unittest/util.py 100% agreed that people get away with writing test cases that go overboard today. but this is easy to do when writing tests that pass as it's only on failures winding up in a poor behavior corner case of the library that you'd even think about it being an issue. until then everything is extremely convenient to all involved. |
Ironically, difflib used to - long ago - pump out differences "left to right" as it went along. The algorithm is based on finding the longest contiguous matching substring, then recursively do the same on the pieces "to the left" and "to the right" of that match. So there's nothing in theory to prevent delivering the deltas on the left part before the right part starts. But as the comments in It's still the case that ndiff's "fancy replace" algorithm can blow the recursion stack, though. In any case, the @gpshead, As to spinning off a thread to do the diff, that's why this issue report will never be closed - everyone who looks at it eventually refuses to settle for a "good enough!" partial solution 😉. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has merge conflicts now.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
@tim-one I have not stopped thinking about this problem.
OK so it seems pretty well established that any heuristic will only achieve minimizing the set of possible inputs that would cause a big slowdown. My question to Tim and the other core developers in this thread is: do you see any possible solution that could potentially cause this ticket to be closed? In this category (ticket closers), I think that spawning, babysitting, and possibly aborting a worker thread is possibly the most bulletproof idea. The only thing I worry about is whether python's execution model will cause the parent thread to be blocked while the diffing library dead spins, causing a race condition where the child can't be aborted. I think this wouldn't be the case, but I'm sure that someone here can advise as to whether that might be a problem. On the broad topic of heuristics, it seems clear to me that any heuristic is effectively just a band-aid that won't block the full set of problematic inputs without being extremely aggressive, like the heuristic that @gpshead suggested:
At the same time, I consider an extremely aggressive heuristic or simply abandoning To wrap up the heuristic discussion, it seems like the forgone conclusion is that heuristics are just icing on the cake – they don't address the first principles issues, and a heuristic too aggressive might constitute a breaking change. Tim & other core devs in the thread, my ultimate question is, do you sponsor a thread-based solution as a complete solution to the root problem we're trying to address here? Is there any other solution you'd sponsor as a complete enough solution to close the ticket? If not, are there any solutions that you'd sponsor enough to want it to be merged since it improves the ergonomics here, even if it's not a complete solution? |
No threads. stdout from tests are not an API. Anyone depending on that has made a mere change detector, that's their problem. Regardless we wouldn't significantly change the output in a bugfix backport. A heuristic to avoid potential regressive performance behaviors for something simpler is the best that can be done. |
This incorporates changes started by @eamanu:
Additionally, I added two commits to make this ready-to-merge
https://bugs.python.org/issue19217