Recent comments posted to this site:
Thanks for this great tool! I was wondering what the differences are between using type=directory, type=rsync, or a bare git repo for directories?
I guess I can't use just a regular repo because my USB drive is formatted as vfat -- which threw me for a loop the first time I heard about git-annex about a year ago, because I followed the walkthrough, and it didn't work as expected and gave up (now I know it was just a case of PEBKAC). It might be worth adding a note about vfat to the "Adding a remote" section of the walkthrough, since the unstated assumption there is that the USB drive is formatted as a filesystem that supports symlinks.
Thanks again, my scientific data management just got a lot more sane!
I have the same use case as Asheesh but I want to be able to see which filenames point to the same objects and then decide which of the duplicates to drop myself. I think
git annex drop --by-contents
would be the wrong approach because how does git-annex know which ones to drop? There's too much potential for error.
Instead it would be great to have something like
git annex finddups
While it's easy enough to knock up a bit of shell or Perl to achieve this, that relies on knowledge of the annex symlink structure, so I think really it belongs inside git-annex.
If this command gave output similar to the excellent fastdup utility:
Scanning for files... 672 files in 10.439 seconds
Comparing 2 sets of files...
2 files (70.71 MB/ea)
/home/adam/media/flat/tour/flat-tour.3gp
/home/adam/videos/tour.3gp
Found 1 duplicate of 1 file (70.71 MB wasted)
Scanned 672 files (1.96 GB) in 11.415 seconds
then you could do stuff like
git annex finddups | grep /home/adam/media/flat | xargs rm
My main concern with putting this in git-annex is that finding duplicates necessarily involves storing a list of every key and file in the repository, and git-annex is very carefully built to avoid things that require non-constant memory use, so that it can scale to very big repositories. (The only exception is the unused command, and reducing its memory usage is a continuing goal.)
So I would rather come at this from a different angle.. like providing a way to output a list of files and their associated keys, which the user can then use in their own shell pipelines to find duplicate keys:
git annex find --include '*' --format='${file} ${key}\n' | sort --key 2 | uniq --all-repeated --skip-fields=1
Which is implemented now!
(Making that pipeline properly handle filenames with spaces is left as an exercise for the reader..)
Hi,
I guess the problem is with git-annex-shell. I tried to do 'git annex get file --from name_ssh_repo', and I got the following:
bash: git-annex-shell: command not found; failed; exit code 127
The same thing happens if I try to do 'git annex whereis'
git-annex-shell is indeed installed. How can I make my shell recognize this command?
Thanks a lot!
git annex fsck complained that I had only one copy per file even though I had created my clone, already. Once I git pulled from the second repo, not getting any changes for obvious reasons, git annex fsck was happy. So I am not sure how my addition was incorrect. -- RichiH
Well, I spent a few hours playing this evening in the 'reorg' branch in git. It seems to be shaping up pretty well; type-based refactoring in haskell makes these kind of big systematic changes a matter of editing until it compiles. And it compiles and test suite passes. But, so far I've only covered 1. 3. and 4. on the list, and have yet to deal with upgrades.
I'd recommend you not wait before using git-annex. I am committed to provide upgradability between annexes created with all versions of git-annex, going forward. This is important because we can have offline archival drives that sit unused for years. Git-annex will upgrade a repository to current standard the first time it sees it, and I hope the upgrade will be pretty smooth. It was not bad for the annex.version 0 to 1 upgrade earlier. The only annoyance with upgrades is that it will result in some big commits to git, as every symlink in the repo gets changed, and log files get moved to new names.
(The metadata being stored with keys is data that a particular backend can use, and is static to a given key, so there are no merge issues (and it won't be used to preserve mtimes, etc).)
When git annex get does nothing, it's because it doesn't know a place to get the file from.
This can happen if the git-annex branch has not propigated from the place where the file was added.
For example, if on the laptop you had run git pull ssh master, that would only pull the master branch, not the git-annex branch.
An easy way to ensure the git-annex branch is kept in sync is to run git annex sync
What about Cygwin? It emulates POSIX fairly well under Windows (including signals, forking, fs (also things like /dev/null, /proc), unix file permissions), has all standard gnu utilities. It also emulates symlinks, but they are unfortunately incompatible with NTFS symlinks introduced in Vista due to some stupid restrictions on Windows.
If git-annex could be modified to not require symlinks to work, the it would be a pretty neat solution (and you get a real shell, not some command.com on drugs (aka cmd.exe))
The directory and rsync special remotes intentionally use the same layout. So the same directory could be set up as both types of special remotes.
The main reason to use this rather than a bare git repo is that it supports encryption.