Skip to content

Different behavior between repo.remote().url and repo.remote().urls when dealing with UNC paths #1019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
HudsonMC16 opened this issue Jun 8, 2020 · 12 comments

Comments

@HudsonMC16
Copy link

Not sure if this is specifically realted to UNC paths or not, but here is what I'm observing.

Clone a repo from a networked server to your local machine. Then:

>>> from Pathlib import Path
>>> path_str = local_repo.remote('origin').url
>>> print(path_str)
'\\\\\\\\network\\\\path\\\\to\\\\remote\\\\repo'
>>> Path(path_str).exists()
False
>>> Path(path_str).resolve()
WindowsPath('C:/network/path/to/remote/repo')

whereas:

>>> from pathlib import Path
>>> path_str = next(local_repo.remote('origin').urls)
>>> print(path_str)
'\\\\network\\path\\to\\remote\\repo'
>>> Path(path_str).exists()
True
>>> Path(path_str).resolve()
WindowsPath('//network/path/to/remote/repo')

Thanks!

@Byron
Copy link
Member

Byron commented Jun 9, 2020

Thanks a lot for posting this here!

I had a quick look and believe the reason is the following: When asking for urls git will be invoked to ask for its view of the world.

When asking for url it will use the potentially manipulated path that was used to instantiate the Repo instance.

The issue seems to be stemming from GitPython incorrectly transforming the UNC path, potentially treating it like some sort of Linux path.

@HudsonMC16
Copy link
Author

So seems it would make sense for each function to access the url in the same way. Do we have a preference? Should the .url code be fixed to quit mangling the paths, or be rewritten to invoke git, similar to .urls?

@Byron
Copy link
Member

Byron commented Jun 10, 2020

I think it does make sense that their output is consistent, with urls being correct in this case.

Since calling implementing url like next(urls) is not the same both in terms of performance nor (probably) the result itself, I would clearly prefer url to be fixed to handle UNC paths correctly on windows. This might have the positive side-effect of fixing additional bugs that that stem from incorrect path handling.

@HudsonMC16
Copy link
Author

Maybe I'm being dense, and I would love to try and contribute, but I am not seeing where Remote inherits its url property. The code you linked above was from the Submodule class?

@Byron
Copy link
Member

Byron commented Jun 12, 2020

Sorry for the url link, indeed it should rather have been a link to this location, which is the fallback in case the attribute does not exist. In that case, it seems the URL is read directly from the configuration, _config_reader.get(attr). That's interesting, as I would be surprised if GitPython would do any processing there. Can it be that the URL attribute is indeed an incorrect path?

@HudsonMC16
Copy link
Author

Entry from the actual git config file looks like this:

[remote "origin"]
	url = \\\\network\\path\\to\\repo
	fetch = +refs/heads/*:refs/remotes/origin/*

It looks like gitpython implements its own configparser inherited from configparser's RawConfigParser, including its own _read method which loops over the config file line by line, but opens it using the 'rb' mode.

This appears to be where the path is getting messed up. Opening the file using the 'rt' mode does not mangle the path, but I don't know enough about the format of git config files or any backwards compatibility issues gitpython might be dealing with to know if swapping 'rb' for 'rt' is a good idea. I expect not? Comments here indicate binary reading is assumed, but I'm not sure why.

@Byron
Copy link
Member

Byron commented Jun 17, 2020

Thanks for the research! This means that if we would use "rt" instead, the issue would somewhat magically disappear.

The reason the file is read as binary is that otherwise, the operating system (or Python, maybe) makes assumptions about the encoding of the text file, even though git itself doesn't know or care. Therefore reads can fail even though otherwise they won't.

GitPython obviously tries to decode the file as UTF-8 effectively on a line-by-line basis, which may fail as well so it isn't necessarily better.

Something I don't understand is how "RT" can clean up a path like "\\path\dir" into "\path\dir", which seems way more than a decoder would ever do.

In a way, I don't even understand why such a path would be written in the first place, unless backslash is a reserved character in git config files.

@HudsonMC16
Copy link
Author

HudsonMC16 commented Jun 17, 2020

So, escape characters on different platforms and encodings make my head spin, but I expect this is a windows-only problem due to windows use of backslashes in paths.

Even on a local repo, with no UNC complications, it looks like (under a utf-8 lens) git adds extra escape backslashes. So the entry in the config file looks like this in binary:
\turl = C:\\\\path\\\\to\\\\repo\n
and this in utf-8:
url = C:\\path\\to\\repo

So, I'm not sure that the gitpython configparser is decoding as utf-8. I'm still trying to follow the config lines through, but that's my best guess at this point.

@HudsonMC16
Copy link
Author

I forgot to include, I think the reason the 'rt' mode works, is because it automatically correctly escapes the backslashes, similar to decoding using the python 'unicode-escape' encoding. So, for example: b'\\\\'.decode('unicode-escape') evaluates to '\\'

@Byron
Copy link
Member

Byron commented Jun 18, 2020

Awesome! rt is implemented by Python, not by the OS, it seems. And unicode-escaping needs backslashes to insert codepoints like \uffff, among others. All that seems to be done so that the configuration file itself does not actually have to be UTF-8 encoded, but works with ascii instead.

What would that mean for a fix? Doesn't that mean that one could simply decode with 'unicode-escape' instead?

I couldn't really find details on what's expected of git configuration files, but would have wanted to post this here to verify the solution above would be acceptable.

@HudsonMC16
Copy link
Author

Yeah, I know what you mean about the git configuration files. I went looking for something also, but couldn't find anything.

We could potentially decode with 'unicode-escape' like you suggest. I guess we could also catch the 8 backslash pattern and replace it explicitly. That seems a little hacky, but less likely to accidentally break other stuff, I think. Another sort of hacky, but maybe more complete solution is to explicitly replace every four sequential binary backslashes with a single forward slash. Builtins and stdlib on windows (pathlib.Path, open, etc.) appear to work with strings structured this way.

@Byron
Copy link
Member

Byron commented Jun 22, 2020

Sometimes one has to dig deeper :D, so I found this portion of the git source code which seems to mean that it's simply decoding these with \\ ➡ \ .

Thus for us it might indeed be safest to do an additional processing step, as you suggest. As git itself is also doing it, it might not even be 'hacky' after all.

If you would like to give it a shot, I would very much welcome a PR to start hammering home a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants