angelwolfgeek (
angelwolfgeek) wrote2011-02-05 11:44 am
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
random thoughts - building a better file extender/greyhole equivilent
File extender and greyhole are ways to combine a bunch of drives to make a single 'unified' drive. File extender used to be part of windows home server, and greyhole is for linux - the problem with that is it uses samba, which seems a very odd way to do it.
As i see it, rather than using samba, or some other complicated method, there might be a fairly simple, elegant way to do it. Firstly, lets assume our system is reasonably partition agnostic, and assume we're working with directories as opposed to partitions.
I'd assume we'd have two types of data stores in our drive unification system - the 'public' directory seen by the user which contains symlinks to the data, and a series of private directories which act as the data repository. We'd also need a daemon that would monitor the public directory for changes, and transfer files to the data repositories. This daemon would also attempt to spread the data out in a desired manner .
The daemon does two things - checks the 'public' directory for non symlinked files that are new creates and stores a checksum, then copies then to a suitable data repository, then symlinks it. It also does the reverse - checking for deleted or added files in the data repository, and acting as needed - either moving them over to a temporary directory or deleting them - the user is not expected to add files to the repository, or change them. It also keeps a manifest of files and checksums to aid with deduplication and to keep track of the physical location of files. inotify may be a suitable mechanism for it
where the data repository is remote, some method of locking may be necessary - any file opened by another process may not be synced.
Prefered mode of file sync may be rsync, to avoid excessive data transfer where files are merely edited.
configuration files will have the following information - the 'public' location of the folder the user sees his data in, and associated data repositories. Each repository will have a location on the file system, as well as a maximum size - the system will use the amount of free space, or this size in determining the maximum amount of space a repository may occupy - whichever is less.
in order to ensure repositories are evenly used, while there may be more complicated ways of doing it which are more efficent, i'd propose a round robin system, with preference to the repository with the biggest difference between the maximum space available to it, and used space. There may be more efficient ways, or manual weightage, but that would not be an early release feature
None of this would need anything not available in the typical linux system, and since it does not involve file systems in any way, can even be used by non privileged users
Potential issues we'd need to look at are users messing with the file repositories (keeping an eye on the repositories might help, and doing the same monitoring) , duplication of files, and possibly an option for dealing with having more than one copy of the file.
Of course, i can't code, but once again, i'm keeping the idea documented, in case circumstances change ;p
As i see it, rather than using samba, or some other complicated method, there might be a fairly simple, elegant way to do it. Firstly, lets assume our system is reasonably partition agnostic, and assume we're working with directories as opposed to partitions.
I'd assume we'd have two types of data stores in our drive unification system - the 'public' directory seen by the user which contains symlinks to the data, and a series of private directories which act as the data repository. We'd also need a daemon that would monitor the public directory for changes, and transfer files to the data repositories. This daemon would also attempt to spread the data out in a desired manner .
The daemon does two things - checks the 'public' directory for non symlinked files that are new creates and stores a checksum, then copies then to a suitable data repository, then symlinks it. It also does the reverse - checking for deleted or added files in the data repository, and acting as needed - either moving them over to a temporary directory or deleting them - the user is not expected to add files to the repository, or change them. It also keeps a manifest of files and checksums to aid with deduplication and to keep track of the physical location of files. inotify may be a suitable mechanism for it
where the data repository is remote, some method of locking may be necessary - any file opened by another process may not be synced.
Prefered mode of file sync may be rsync, to avoid excessive data transfer where files are merely edited.
configuration files will have the following information - the 'public' location of the folder the user sees his data in, and associated data repositories. Each repository will have a location on the file system, as well as a maximum size - the system will use the amount of free space, or this size in determining the maximum amount of space a repository may occupy - whichever is less.
in order to ensure repositories are evenly used, while there may be more complicated ways of doing it which are more efficent, i'd propose a round robin system, with preference to the repository with the biggest difference between the maximum space available to it, and used space. There may be more efficient ways, or manual weightage, but that would not be an early release feature
None of this would need anything not available in the typical linux system, and since it does not involve file systems in any way, can even be used by non privileged users
Potential issues we'd need to look at are users messing with the file repositories (keeping an eye on the repositories might help, and doing the same monitoring) , duplication of files, and possibly an option for dealing with having more than one copy of the file.
Of course, i can't code, but once again, i'm keeping the idea documented, in case circumstances change ;p
no subject
http://superuser.com/questions/46441/rsync-as-a-background-process