gh-146385: Switch back to re to detect shlex.quote slow path#146408
gh-146385: Switch back to re to detect shlex.quote slow path#146408bonzini wants to merge 1 commit intopython:mainfrom
re to detect shlex.quote slow path#146408Conversation
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
@AA-Turner Can you check if the results are consistent on Windows as well please? |
|
Here is what I suggest:
I will also try to check what happens on older systems (openSUSE + older GCC/clang). |
Commit 06a26fd ("pythongh-118761: Optimise import time for ``shlex`` (python#132036)") when the input has to be quoted. This is because the regular expression search was able to short-circuit at the first unsafe character. Go back to the same algorithm as 3.13, but make the "import re" and compilation of the regular expression lazy. Testing s.isascii() makes shlex.quote() twice as fast in the non-ASCII case, but costs up to 25% of the full run time (because it necessitates an earlier isinstance check) if the string *is* ASCII. The latter is probably the common case, so drop the check.
|
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
|
Yeah I didn't use |
|
For now, let's check whether the regression is consistent across platforms. |
It is, and also it's really a complexity change. |
|
What I want to know is why Adam reported numbers that went in the other direction. Ideally I also want to know the stdev of the numbers using |
|
I suggested in #132036 (comment) to replace the
|
vstinner
left a comment
There was a problem hiding this comment.
I would prefer a more regular global variable for _find_unsafe regex:
_find_unsafe = None
def quote(s):
"""Return a shell-escaped version of the string *s*."""
global _find_unsafe
if not s:
return "''"
if _find_unsafe is None:
# deferred import and compilation to optimize import time
import re
_find_unsafe = re.compile(r'[^\w@%+=:,./-]', re.ASCII).search
if _find_unsafe(s) is None:
return s
# use single quotes, and put single quotes into double quotes
# the string $'b is then quoted as '$'"'"'b'
return "'" + s.replace("'", "'\"'\"'") + "'"|
Can you provide a benchmark testing the no-quote and quote cases?
Sadly, Adam didn't provide the benchmark code. |
It's in the linked issue #146385, though it's quite basic (using I used this idiom because it decays to exactly the code in 3.13, but I can certainly change it if there's consensus. |
Yes, but I wanted the pyperf benchmark to see the numbers in terms of stdev and mean. |
|
I converted the timeit benchmark to a pyperf benchmark: import pyperf
import shlex
runner = pyperf.Runner()
runner.bench_func('quote: with spaces', shlex.quote, 'the quick brown fox jumps over the lazy dog')
runner.bench_func('no quote: without spaces', shlex.quote, 'thequickbrownfoxjumpsoverthelazydog')
runner.bench_func('quote: non-ASCII', shlex.quote, 'mötley')
runner.bench_func('no quote: short', shlex.quote, 'a')I ran the benchmark on Python 3.13 and 3.15 provided by Fedora 43 (optimized by LTO+PGO). I was lazy and didn't use CPU isolation for this benchmark. py313:
py315:
Comparison table using Python 3.13 as reference:
Results in a REPL: >>> print(shlex.quote('the quick brown fox jumps over the lazy dog')) # quote
'the quick brown fox jumps over the lazy dog'
>>> print(shlex.quote('thequickbrownfoxjumpsoverthelazydog')) # no quote
thequickbrownfoxjumpsoverthelazydog
>>> print(shlex.quote('mötley')) # quote
'mötley'
>>> print(shlex.quote('a')) # no quote
a |
|
I ran a benchmark on this PR using Python built with ref:
change:
Comparison table:
|
|
Thanks, the results make sense - as mentioned in the description of the PR, I am ignoring the non-ASCII case on purpose and optimizing for the other three, given the typical uses of |
Commit 06a26fd ("gh-118761: Optimise import time for
shlex(#132036)") when the input has to be quoted. This is because the regular expression search was able to short-circuit at the first unsafe character.Go back to the same algorithm as 3.13, but make the "import re" and compilation of the regular expression lazy.
Testing
s.isascii()makesshlex.quote()twice as fast in the non-ASCII case, but costs up to 25% of the full run time (because it necessitates an earlierisinstancecheck) if the string is ASCII. The latter is probably the common case, so drop the check.shlex.quotefrom 3.13 to 3.14 #146385