Librarians have been classifying the world's knowledge since forever, and they developed faceted classification systems (Ranganathan, 1933) to deal with these issues.
A notable one that hierarchies embed the point of view of the classifier, e.g. the Dewy Decimal's ridiculous classifications of religions (codes 200–299), making minute distinctions like 285 (Presbyterian, Reformed, Congregational), 286 (Baptist, Disciples of Christ, Adventist), then putting all non-Christian religions under a handful of afterthought headings 292–299: 294 for Hinduism, Buddhism, Sikkhism and other religions of Indian origin, 295 for Zoroastrianism and its descendants, 296 for Judaism, 297 for Islam and Bahaism lumped together, and 299 for New Age.
Unfortunately, most of us do not have access to the services of a librarian to develop a taxonomy that corresponds to our own point of view then classify our files accordingly, which is why simple hierarchical taxonomies have endured and faceted ones seldom beyond specialized applications like Digital Asset Management.
Dewey Decimal System has only one (often-misunderstood) task: organizing bookshelf space. It does not have a task of classifying world information.
The reason that fairly narrow "Presbyterian, Reformed, Congregational" topic has one DDS code and an extremely wide "Hinduism, Buddhism, Sikkhism and other religions of Indian origin" topic has one DDS code is simple: an average library has similar book-width for both topics.
> Dewey Decimal System has only one (often-misunderstood) task: organizing bookshelf space. It does not have a task of classifying world information.
How is this distinct from organizing world information? If the goal is to shelve books in a way that adds value to the information-seeker, you have to arrange them according to some definition of similarity. That decision about what makes one thing similar to another encodes a view of the world.
> How is this distinct from organizing world information?
Any practical classification system starts with corpus at hand, and tries to organize that. Any idealistic classification system that aspires to be "complete" (at least in theory) starts with theoretically possible corpus, and tries to organize that.
For example, if you want to classify the pictures you have in your personal photo library, it would make sense to have e.g. a folder named "2020", for the pictures you've made that year. But it would not make sense to have a folder named "1900", as it's unlikely you made any photos then. This is a practical classification.
But if you want to classify all of the world's photos, you may need to start with a folder named "1826" to hold the View from the Window at Le Gras[1]. Or maybe earlier, for the paintings. That's a theoretical classification.
DDS is a practical classification, and as such, it started with the corpus at hand — whatever public libraries in the USA had at Dewey's time. This is how it's "distinct from organizing world information".
—————
[1]: It's not clear if the photo was taken in 1826 or 1827, so our theoretical classification would need to account for that, or for photos where the year is unknown, or estimated, etc.
The DDS codes subject to shelf location. For a given library, materials selection is largely relevant to its own interests, and in many cases, location and local culture have a strong influence over that.
(As I've discussed in an earlier comment, DDS is not the only shelving system used, though it's a commonly encountered one in the US, and serves as a basis for numerous others. There are also non-subject based shelving systems, though those are typically not publically-accessed.)
> The physical shelves in a library only serve a single, small geographic area, not the whole world
What part of that sentence is not contradicted by inter-library loans? The books come off the shelves, and must be found there. They serve a wide geographic area, sometimes even the whole world (my mother was a librarian at several very specialized libraries and they participated in both receiving and giving loans involving other libraries around the world. It is true that the physical shelves in a library in (say) Germany will not be organized using DDS, but it remains true that the physical shelves can serve a wide geographic area.
And this is not even to comment on those specialized libraries that people will travel from around the world to visit. Same physical shelves, world-wide service area. These can include libraries focused on individuals (famous historical figures, for example), or academic libraries with particularly rich holdings in certain areas, or libraries that just happen to have the only instance of a set of books/documents.
The Dewey Decimal Classification (DDC) is not a special-holdings cataloguing system. It is not even an academic holdings cataloguing system (in the US the Library of Congress Classification System, LCSS, is overwhelmingly used for this). DDC is used, in the majority of cases, for public libraries, serving local communities, in the US.
The LCCS, though influenced by DCC, is distinct. It still reflects a US-centrism, though with different emphasis. History of the Americas occupies two of the twenty alphabetic major classifications, and each of political science, law, education, agriculture, technology, military, and (separately) navel science, has its own major classification, reflecting the interests of a government ... and that government's library shelving concerns.
Specialised libraries rarely use DCC in my experience. Medical libraries often use their own specific classification, and academic libraries in the US as noted above typically use LCCS, though some retain DCC or their own ideosyncratic classifications (both are generally being phased out for LCCS).
For the University of California, an academic library with a strong inter-campus ILL programme, local circulaty exceeds ILL within the multicampus system by over an order of magnitude (1.6 million vs. 135k https://libraries.universityofcalifornia.edu/about/facts-and...). For a local public library with few exceptional holdings, ILL circulation is likely a far smaller fraction of circulation. (Some have regional lending arrangements with peer libraries, independent of ILL. This still remains a small fraction of total circulation.) Much use of materials is within the library itself, captured only (if at all) in reshelving statistics, though those are hard to find.
(I used the UC library system only because it's among the few that have available statistics.)
The point remains that the principle focus of a cataloguing system is for local use and management of a bibliographic collection. ILL happens to be an incidental and compatible use. It is not of itself a major factor in classification system development.
If you have any substantive argument to the contrary, I'd be happy to hear it. You've not yet made one.
If 90%+ of your use case (for a major academic library) is local, and <10% is remote (though within the same general geographical region and topical interests), do you solve for the <10% solution or the >90% solution?
If you're a regional nonspecialist nonacademic library where that split is far more likely 99%/1%, which do you solve for?
I am telling you flatly that your assumption and premise is false. Libraries, in the overwhelming majority of institutions, serve local communities, in the overwhelming majority of transactions. I've provided data to back my argument. You ... continue to hand-wave.
Perhaps only in that books are sometimes lost for years due to being put back in the wrong place! ‘Defragmenting’ the shelf space in a public library is a long and painful task, both for when there is a lack of room and for when things get out of order!
has anyone considered adding a series of color bands to the labels for coarse/medium/fine grained placement information? For example "coarse" is nonfiction, "medium" is travel, and "fine" is Bermuda, or whatever. If you see a book with a red label in the blue section, or whatever, you know it's misplaced.
If the labels are placed on the book at a similar location, then any book that is out of its proper would be significantly more obvious without having to resort to reading the label of every single book.
Of course the techie solution would be to barcode every book and then have a robot on each shelf that scans the barcodes after hours...
We can doubt that the goal is to add value, and think it is simply to give some top-down structure that makes sense to the people using it. I admit that when I went into the stacks to find the book I was looking tor, browsing around on nearby shelves often led to additional good finds. These were topologically near. I think that would be the case for most good grouping schemes, though the choices may change.
Times have changes, the things that make sense have changed in various ways. At least it still organizes shelves.
There's no inherent need to arrange books by similarity, and there are numerous library systems that don't. (Most have "closed stacks", where patrons submit requests for materials.) The only requirement is that books be retrievable on request and can be returned to their assigned location when re-shelved.
A book can occupy only one physical space at one time. If it has an assigned shelf location for storage and retrieval, it can have only one such location.
The Dewey Decimal Classification assigns physical location by assigned subject, for better or worse. There's no essential reason to do this, and there are libraries which assign storage location by arbitrary identifiers, by size (this also tends to happen with DDS-based libraries, where there are specific "oversized" shelving sections, as well as storage for specific media types: maps, photographs, other graphic media (typically large-format), audio, video, software, and data.
For information storage, your storage subsystem (spinning rust, SSD, tape, cloud, CDROM / DVD archive, etched crystal, whaevs) handles the physical storage location element. A filesystem, if it exists, is already a layer of abstraction over that, and is already freed of many of the limitations of physical storage, though largely as a matter of convention and convenience we tend to act as if those still exist, e.g., a file exists in one and only one directory and that directories are hierarchically organised. Both are typical but not inherently necessary.
Tag-based classification drops yet another level of abstraction on top of the filesystem.
Problems with tags emerge in part from their very flexibility. It's possible to apply any given tag to any given work. Informal tagging systems, or "folksonomies", tend to be highly idiosyncratic, inconsistent, redundant, repetitous, and frequently break out in pain points with time.
Looking at this question myself, I see benefit in:
- Reasonably structured metadata. If you ever want to start a riot amongst librarians, declare your metadata schema as "reasonably structured". That said, author/creator, title, creation date(s), publisher, and some attempt at topic or subject classification will likely be useful. Checksums, size, and fingerprints (say, specfied ngram structures) might also be useful. See "Dublin Core" for an example which has both adherants and critics. For any possible set of metadata, you will all but certainly be able to find exceptions or inapplicability.
- "Search is identity". That is, any given set of tags or metadata might be considered a search, and any search will have one of three possible results: 0 matches (failure), 1 match (an identity search), or > 1 match (a result set). How many more than one has some reflection on how useful that result set is. In the same sense that 33 bits will identify any individual person on Earth, you'd need about 27 bits to identify any of the roughly 150 million published works. If your universe is the larger set of recorded but not published works, you'll need to expand your bits accordingly. (A recent estimate I've seen is that for every person on Earth, there's about 1.7 MB/s of data being recorded presently.)
- Names themselves are largely conventions. This might apply to any of the various names associated with a book: its title, its author, its publisher, the publication country or city, the date (and calendar system) associated with with publication, traditions, disciplines, educational institutions, references, etc., associated with it. All of these can and do change. (Quick: what are the names by which Plato, Avicenna, George Sand, Mark Twain, St. Petersburg, New York, Mumbai, and the Wilson School of Government are known by?) Good names are useful in that they are useful conventions and are commonly understood when received by others.
I have to admit that randomly browsing the stacks, flipping through the pages, has led to many serendipitous discoveries, that otherwise might not have occurred had I had to go through the trouble of requesting, waiting, and receiving books, subject to lending limits, etc. I miss doing that, actually. But at this point in my life, most public library stacks disappoint me, and the university library is an hour drive's away, though I do checkout books occasionally from it.
Shelf-browsing is a true joy, and I miss it as well. It's best supported at a quality academic library (though a local liberal arts or community college may afford its own rewards).
That said, I've learned the art of bibliographic search, based on references, citations, and less formal mentions, and find that quite fruitful.
There are also recommendations that turn up through some platforms (I'm partial to Z-Library's), which can be useful, though come with cautions. (Popularity is a poor proxy for either truth or value, though it lets you know what others may have read.)
Learning how to fruitfully use cataloguing systems is also tremendously useful. I make heavy use of Worldcat (https://www.worldcat.org/, DDG bang search !worldcat), Google Scholar, Microsoft's academic research tool when I remember it exists, Wikipedia (I read it for the references ;-), and keyword searches on Open Library, Project Gutenberg, and LibGen.
I'm finding myself relying far, far less on general web search than, say, 10--20 years ago. These aren't completely useless, but the trend is quite pronounced, over all major search engines (DDG is my default, though I'll use others on occasion).
Of course you're right and I meant American Public libraries; I imagine Chabad-Lubavitch Library would probably lead to a different system. And the time I was referring was the time of Melvil Dewey, although I imagine not much would have changed by Thomas Dewey's time :)
Once I saw this guy on a bridge about to jump. I said, "Don't do it!" He said, "Nobody loves me." I said, "God loves you. Do you believe in God?"
He said, "Yes."
I said, "Are you a Christian or a Jew?"
He said, "A Christian."
I said, "Me, too! Protestant or Catholic?"
He said, "Protestant."
I said, "Me, too! What franchise?"
He said, "Baptist."
I said, "Me, too! Northern Baptist or Southern Baptist?"
He said, "Northern Baptist."
I said, "Me, too! Northern Conservative Baptist or Northern Liberal Baptist?"
He said, "Northern Conservative Baptist."
I said, "Me, too! Northern Conservative Baptist Great Lakes Region, or Northern Conservative Baptist Eastern Region?"
He said, "Northern Conservative Baptist Great Lakes Region."
I said, "Me, too! Northern Conservative†Baptist Great Lakes Region Council of 1879, or Northern Conservative Baptist Great Lakes Region Council of 1912?"
He said, "Northern Conservative Baptist Great Lakes Region Council of 1912."
This is funny but for most Christians wildly incorrect (I'll add another funny one that used to be somewhat correct in my case towards the end.)
As a Christian I'll more or less want to save anyone who is in danger, no questions asked, as long as it doesn't put anyone else in danger. The reason is simple: besides being the obvious right thing to do as a human it also means we'll either have a good person living here for a while longer or a bad person will get another chance to change their ways (and have a very powerful reminder to do it).
I believe this is true for most of us.
As promised, an actual funny joke that used to hit home with me :
As someone enters heaven they go past multiple groups and the tour guide say: here are the righteous from this group and here are the righteous from that group until suddenly he tip toes past a door, saying: "here's <x> group and they think they are alone here".
(That said I'm still afraid many will be up for a rude awakening whatever group they feel they belong to if they aren't don't actually walk the walk.)
Captain Obvious suggests that Emo Philips was not joking about throwing anyone off the cliff, but about the very fact that political subdivisions in religions pretend to be about which moral values are more correct, instead of being dirty power games of self-proclaimed authorities covered up by thick layers of ideology. Our imaginary gods do not care either way and there is no evidence that we should care either.
skinkestek suggests that based on observation, not only in the case of religion but also in a number of other cases, captain Obvious is not visible to everyone, not even on HN.
You might think this is obvious but I think I have seen worse ideas taken "hook, line and sinker".
After all, common sense seems to be a rather scarce resource.
One example of a data structure implementing faceted classification would be the multitree [0]. Unfortunately multitrees seem to receive far less support than 2 other data structures it intermediates: trees and DAGs.
k-d trees [1] are close but use cases seem to predominantly target data with inherently ordinal (rather than nominal) dimensions.
Further abstraction could lead to the knowledge graph [2] or graph databases.
In all cases, the availability of "low-code" tools (in the domain of single-user personal information management, at least) seems sparse. I have been looking for some time, but the search continues.
Graph-based PKMS have recently exploded in popularity and sophistication. Athens Research (the OSS counterpart to Roam), and especially ObsidianMD (with its plugin ecosystem) are a couple examples of systems that might suit your purposes well.
rea.ch is a new Graph based PKMS that addresses some of the same issues so everything from File tagging to notes association as part of a second brain/zettelkasten schema.
It's no substitute for a librarian, but I am interested in eventually trying out the Johnny.Decimal system [0] to build my own taxonomy based on what I use most often. (Basically choose 10 categories, and then subdivide those ten more times, by usage. These get numeric prefixes, to simplify navigation.) It's not for everyone, but the author says [1] he's used it for years.
With enough discipline Johnny Decimal can work wonders, specially if you organize your work in the same way across different apps (say email, files, or productivity app)
Once you get the habit of using it, the most frequently used codes stick into your mind so it's effortless. And in case you don't remember it, there is a clear, fixed path to find the required resources.
Still, coming up with a taxonomy is hard and essential. Should I keep my home bills under home/invoices or invoices/home ? The answer does not matter, but you need to stick to it across all your areas of interest.
> Unfortunately, most of us do not have access to the services of a librarian to develop a taxonomy that corresponds to our own point of view then classify our files accordingly
And no doubt if he'd been in India designing a system for an Indian library, there'd be more distinction between Hindu, Buddhist, and Sikh sects (nevermind those the not lumped together) and Christianity, especially denominations not usually found in India, would be the 'handful of afterthought headings', the ones with few books, and the books themselves doing more lumping together having less focus and depth on one niche.
It seems that for any DAG, a hierarchy could be derived that would minimized either the number of soft-links, or a weighted score of how many contained files are in a soft-linked path.
And, similar to how a sugiyama graph automatically redraws when an edge is added or removed, a filesystem hierarchy could be automatically restructured to minimize the above scores when files are added or removed from the various folders.
> The downside of folders is trying to figure out where things belong in the hierarchy, or trying to update that hierarchy to a new standard.
That's why I stick to Documents/{folder1..folder∞} and folders don't have hierarchical sub folders, just contextual folders. Eg: Documents/taxes 2021/{invoices, stuff}, Documents/taxes 2021/, Documents/Cthulluh Roleplaying/{pdf files of characters}, Documents/Covid vaccination certificates,
Yes, it's messy but I don't have the mental burden of a holding a tree in my head or a tagging system.
This is one way to think about the hybrid model of folders + tags that is currently available:
1. Folders tell you where the file is stored.
2. Tags tell you what is the file.
So basically use the Tags to add more data (metadata) about your files, so that if you forget where the file is, you can still search for it by what is in it. This also slightly helps in easing the burden of trying to figure out where to store a file (e.g. "Do I put a home video in my 'Videos' folder or my 'Personal' folder?" - if you tag it properly, you can put it in either, as you can use the tags to figure out where the video is later).
> I've been hearing this for years, but I've never had the mental model for it. Maybe I just have to try it out.
Gmail uses tags rather than folders, nice easy way to have a play if you've already got a google account.
Conceptually, I think of it as basically being able to have lots of different hierarchical folder structures applied to a set of objects at the same time. Or to flip it round, a normal folder structure is like using tags but where you are limited to one tag (the folder that holds the object) per object.
This is only if you use it hierarchically. No one prevents you from having a tag projects and another tag newproject. Your emails just have both tags assigned and you are good to go
Sure, but GP's point (and it's something that annoyed me in Fastmail too) is that if you do have an 'hierarchical tag' - which could be in addition to unrelated tags - you don't get the '<sub-tag> is a <super-tag>' that you might expect.
The tags themselves end up in an hierarchy without it affecting the tagged contents.
I think what you're seeing is the opposite: tags aren't hierarchical. Tags can be organized hierarchically, for organizational purposes, and you can use rules to ensure that everything tagged "projects-FOO" is also tagged "projects", but the tags as they apply to tagged objects aren't hierarchical.
10 years ago I was prototyping something very similar for Windows.
If I recall, CBFS and Dokan were the closest things you had to FUSE on Windows back then. Alternatively considered emulating a network drive. Like you pointed out, it had to be transparent to your existing software's Open / Save dialogs (although there was some effort to hook the standard ones to give users a place to apply tags when saving).
We've been stuck in the same old directory paradigm for a long time. There are some use cases where the traditional hierarchical approach is desirable (e.g. when you need to "visit" a set of files exactly once, like to browse through a folder to clean it up, enumerate for backup, calculate sizes, etc). But it's a constraint when a file belongs in more than one place.
My favourite file organization hack is to have a "scratch" folder containing everything that I can lose with no problem. Stuff only comes out of the scratch folder if I know I can't redownload it again and I might need it.
This way I can just wipe the whole scratch folder whenever I want, and there are much less files I actually need to organize.
At it's simplest, a filesystem can be thought of as a string key (small) to value (large blob) database.
Directories are just a method of breaking the collection into usefully small / specific subsets for humans (mostly) and computers (a secondary but important performance optimization).
There's no technical reason some form of Standard couldn't be added to, or on top of, a filesystem that added tags as an alternate view or foreign index against the data.
The holdup is that there isn't any single standard for adding or searching by tags. The current directory structure is the lowest common denominator and has existed for longer than many of us have been alive. Change is _hard_ and those file open/close dialogues and interface methods still have to work, and they can work, at least for limited numbers of tags or maybe by some addition / removal special path syntax... However then it's stupid hard for the humans. Hard for the use case the tags are supposed to simplify.
Maybe this change can be added to what everyone really needs: a set of GUI calls that work on all the platforms, that isn't writing a damned web-page and using a DLL-hell-bloat bundled web-browser to do the cross platform lifting. Something that can work across all the desktops (as a target) and maybe all the mobiles (as another target) and maybe tablets too (slightly different target, somewhere between).
I think the api that allows for more flexibility of implementation underneath is a hierarchical multi tag system with one root of the tag reserved for the existing filesystem, eg. "fs". Paths on /fs/dir/dir/file could serve as definitive storage reference and a way into the existing filesystem apis, while other tag systems can be imposed by the user in parallel trees of tags - all of which refer to the underlying fs tree under their implementations.
While I don't remember that specific project's name, I do remember that a lot of that was possible because files in a filesystem often aren't just some hierarchical folder structure. There is often a tree-like structure to be able to find a single file quickly amongst many files, but there was no technical reason why it wouldn't also support off-tree indexing based on tags.
I imagine one of the problems you'd run in to is mass updates since deleting a directory with many files in it below a single tree node is very much limited to that part of the filesystem's tree, but doing it with tags causes updates all over the place.
As well as the thing Apple does in Finder where it has a tag database and dynamic 'tag' directories where it shows you everything that was tagged with that tag. Then the search function would allow selecting files based on tags so you can do many-tag searches and only find the files that match all of them. I think that one is based on an in-filesystem metadata stream.
I think the article was about making it easier for users. It may not be easy anymore if they have to use directories AND tags. Or it may work, if the only files they normally tag are their documents.
Tags have a problem with names. How do you stop users from messing with files that belong to the OS or some application?
Should we require that they all have the tags "OS" or "app" on them? Should binaries have the tag "binary"?
What happens when the user tries to delete one app but chooses "tag 'app' and name startswith 'word'"?
Should all application files named "word"-something from different applications be removed?
Should the user have known to use a meta-query with "tag 'app' and tagname:startswith 'word'"?
I do see a lot of problems but of course some things would be easier.
Users should never touch system files nor have to look at them. I never want to see any asfawe.sdwae file ever again. Its like keeping power tools in the cup board or having screwdrivers in your utensils drawer. Its like keeping the manual in the washing machine. You open the door and instantly know it's a ridiculous idea.
The user should be able to do what he wants, anything should go, there should not be any bad moves besides deleting their own stuff, they should be able to simply delete everything. Make it a partition with a warning prompt.
This might work for documents but a difficulty I see is how you would handle system files. See, with documents it's fine to always return a collection of files, but for a system file say /etc/passwd, you want to make sure you get exactly (status quo guarantees at most which is almost but not quite as good) one file.
Sounds like a really bad idea for tools like mv and rm. Suddenly you have moved or deleted much more than you thought. Unless you support guids or similar also but end users are normally really bad at reading and writing those. It is very easy to mistake one guid for another.
Usefullness of this is highly content specific. It maybe works for mp3s or videos/photos you made yourself and contain some metadata. I can't imagine tag based organization of all the random 2 mil. files I have that don't fit into these neat categories.
I don't need access to most of these files unless I'm working on something relevant to them. When I work on X, I go to directory X and everything I need sits below X in some hierarchy. I'm never interested in anything that's below X unless I'm working on X. I don't want stuff from X to pollute some global namespace just because some mp3 or PDFs are present under X. That's true for hundreds of personal projects and tools I've made.
Directory hierarchy is a pretty neat abstraction to me and all the tools I use already support it well.
Does it mean that you will store multiple copies of a file, if it ends up being useful for multiple projects?
About the random 2 mil. files you have lying around, the tags you should set should describe not the files themselves, but the reason why you chose to keep those files.
The reason why I'm chosing to store the files is the project they're part of. I don't need them otherwise. If the project is inactive or dead, I'll compress the folder into some archive and delete the folder.
If something is meant to be useful for multiple projects it either gets symlinked, or put into its own project.
Anyway, I'm not saying tagging as a primary categorization interface is not useful at all. It's just that it has limited applications where it would be definitely more useful than traditional filesystem hierarchy.
I have a similar system to GP, and the answer is yes: I will duplicate files that are relevant to multiple projects, with the exception of research papers, which all live together in Zotero.
I think the author is just a little bit behind the times in terms of organizing large corpuses of inter-related data. Tagging as a general idea is too fast-and-loose without a system/structure to organize the tags. Semantic Web stacks are one way to improve on this using taxonomies and ontologies, query languages and data specs; they almost mention it in the alternative systems ("touples") but don't dig into just how hard it is to manage data using unstructured references (or even structured ones!).
Simple hierarchies are ..... simple. Tags are simple too, but they quickly devolve into new complexities as people try to figure out how to apply them, find them, organize them. Hierarchies aren't typically as difficult to manage because it boxes you into re-creating the same mental model for organization, just with different classifiers for each level of the hierarchy. They're less flexible, but they're easier to grok, maintain, and use.
My major concern is that there isn't really a need to "fix" hierarchies, it's just a nagging problem that someone doesn't want to deal with, so their solution is to make something more complicated.... and more complicated might not make it better. It should also be feasible to design applications to organize the files without having to rewrite filesystems.
I couldn't claim to know what is "correct" and what isn't! I've just worked on projects to organize large collections of interrelated datasets (for example, to update correlations between concepts, to make search engines more effective, to identify related or dependent item relationships, etc) and for our project we used a Semantic Web stack. Browsing GitHub for "knowledge graph" or "knowledge management" seems to pop up some cool looking projects, but I think everyone is still trying to figure out what works for a particular use case rather than generally.
I hope a real data scientist can reply with whatever the latest and greatest solutions are. Semantic Web tech for knowledge graphs are continuing to evolve, but also kind of old, and they're still mostly used for research projects. Part of that is probably because the terminology is unusual, and implementation leads you down a long rabbit hole of new and confusing concepts. So that's why I'm thinking that just sticking to a boring inefficient hierarchy might not be so bad...
Hierarchies vs tags has come up for me in my recent effort to scan and organize thousands of old family photos (100+ years old).
My main goal is to make the data as future proof as possible, in my mind that means going with some standard filesystem hierarchies but supplementing it with photo metadata using exiftool. I don't trust any tagging solution to be useable in 5+ years that isn't based on EXIF/XMP standards and that only really works for photos (and not all formats). While you can force write XMP standard metadata onto unsupported file types--there is a push to standardize XMP across all files--metadata support is far from standard across operating systems or tools.
For my purposes I have a folders with Ahnentafel numbers so they can be quickly found: https://en.wikipedia.org/wiki/Ahnentafel
I do my best to organize photos into the family/generation that is most represented by the photo and then into "events" if possible.
I then use lightroom to tag the photos with as much information that I can, especially the names of the people in the photos.
This way I figure, even the least technical person can find photos from a specific family but with slightly more effort, they can also search for people across the folder hierarchy.
I've been meaning to try out PhotoStructure[1] for a while now as it looks like it might just be a good place to finally put some effort into organizing my collection. Nothing is guaranteed to last forever but it looks like the author is planning ahead in this case such that even if he goes out of business, all the work I put into organizing would not be meaningless (data is stored in a sqlite db and code will be open sourced in event of business closure).
Any change you make to your assets in the PhotoStructure UI is stored both in your library database as well as either within the file itself using standard EXIF/IPTC headers, or as a standard sidecar format (like .XMP or .MIE), to ensure your work isn't wasted if you migrate to another tool in the future. More details are here: https://photostructure.com/faq/system-of-record/
I'm happy to field any questions about PhotoStructure or metadata Management in general, either on the forum or on discord (links on the website).
As far as future proof goes, Danbooru has been on the web for 15 years and on github for 10 and is still actively developed. There is a web interface and also a webapi. But the (lack of) access control may be a problem. Maybe with appropriate nginx auth config you can ensure only the family can access it.
I don't like the idea of file hashes for storing files. That really only works well for a limited set of files completely ignoring things like databases, note files, etc. If you want to generate a name, UUIDs would make way more sense. Additionally, even for files which never change, what happens when a bit gets flipped inside of the file? That would presumably change the hash without knowledge of the file system until you try and read it. For drives intended for consumers, you really don't want to have more data loss than is absolutely necessary in case of minor corruption.
On a different note, I use Tiddlywiki with tags a lot for my personal notes and I think a hybrid approach with tags and a hierarchy works best. Hierarchies are useful for generating meaningful unique names where it is nice to have the notes be meaningful. Another thing that would be useful with a hybrid system is to have rules like /Skyrim/Guilds/ThievesGuild/NPC/Mercer automatically assign tags to Mercer by virtue of his placement like {"Skyrim", "NPC", "ThievesGuild"}. You could do this automatically via the path or by some sort of rules engine.
Finally, I would really like it for a more complex information storage system to fully support 1, 2, and 3-tuple metadata. A 1-tuple is a tag, a 2-tuple is a key-value pair associated with a file, and a 3-tuple would be a relation between files with an optional key/value (some relations would be merely a tag while others might want a key/value). 2-tuples are obviously useful since they exist in limited form currently via file attributes, though they unfortunately are not exposed very well to users even though some file systems support arbitrary key/value pairs as attributes.
3-tuples are a little more niche, but I think it would be useful for keeping track of stuff like what file imports another. Humans probably wouldn't generate the metadata in the previous example, but it would help tools play nicely together since you could have a small static analysis tool which simply updates this metadata which could then be used by other tools or by users directly. One of the greatest things about file systems is that they allow for separate tools to interact with the same data in a structured way.
Or better idea. Use directed graphs. Directed graphs are strictly a superset of both tree and tag (set) functionality.
Tags (sets) can be thought as a special case of directed graph. You make the tag a node in the graph and you make files also nodes and have edge pointing in direction of the tag node.
You can then do graph queries to find files tagged in a certain way.
But graphs offers so much more.
Because they are superset of both trees and sets, you can use it to represent both, at the same time.
It is not very useful to have a lot of things tagged the same way because you end up with just a long list of things. Whereas in a graph you could say that you want to find objects from which you can reach certain tag node and all these objects can still have their own structure and even be part of multiple structures.
For many years I had this idea to build my PIM where I could make arbitrary nodes being anything that could let me connect anything to anything.
A node could be an email, a file, a link to external external website, a task, a contact, a reminder, etc.
And you could connect everything to anything and have, for example a project that has important emails attached to it, the email could have attached a reminder to respond and a file that you want to include in response, and a note.
You could browse this graph as a tree because locally it could be interpreted as a tree, you open the tree one level by finding all elements that pointing to the node.
It is not knowledge classification. It is association (like this PDF file is associated with this project).
But association can be used for classification (this PDF file is associated also with THAT topic). Which is exactly my point.
Hierarchical filesystems can be used for classification (create folders for classes, put things in right folders). But the issue is that the user has to perform classification to even be able to store the file and you can only put the file in one class so if you do that poorly you will have hard time finding it later (a condition which I call disorder).
Association lets you overcome that problem by allowing you to associate the item with multiple classes.
If your aim is to be able to find it later it is better to put the item in too many classes. A filesystem that only allows you to put the item in one class requires you to make a good decision immediately.
Additionally, there is low cost of making a mistake when adding another association but there is comparatively larger cost of changing the classification by putting the file in another folder.
You're using the case of some directed relationships to argue for exclusively directed relationships. That is a false generalisation. Using a directed graph (exclusively) requires that classification assocations be strictly directed.
That's simply not the case. It can appear that way in certain instances --- an author writes a book, a book doesn't write an author. But books influence authors, and multiple authors (as contemporaries) can influence one another. Topical classifications may descend through one of several directions: a history of a technology developed in a place by a specific person might have points of entry by date, location, biography, technology, application(s), or consequences. There's no single "home" for that concept, but a web (a non-directed, potentially cyclic) graph with multiple relationships, many bidirectional.
That leaves you with a few options:
- You can abandon the directed graph and utilise an indexing and search schema which more accurately describes the relationships.
- You can abandon strict accuracy and settle on a useful directed graph which imposes an arbitrary (and incorrect) hierarchy over the subject. Where physical storage is based on topicality, the requirement of a single locality imposes this requirement.
- You can find an alternate basis for defining location, and conceptually map that by other means. Here a key issue is (as I've described in several earlier comments) that position alone is neither a guide to the content (adjacent documents may be utterly unrelated), the researcher (there is no streightforward exploration path to a specific record or concept), nor the curator (topics must be specifically assigned and aren't inherent or evident by position).
In declaring that "it is not about knowledge classification", you've attempted to change the scope of the discussion. Even in the example you give, the case falls apart. A project may be associated with multiple PDFs, and a PDF may be associated with multiple projects. How do you describe that set of relations in a directed graph?
No. I made a small PoC and one of the things I tried to learn was if it is possible to basically disallow creating cycles (because they are a headache). I found that preventing cycles completely was irritating to me as a user as it would occasionally prevent me from making a change and require to fix the structure somewhere else before I could do whatever I wanted to accomplish in the first place.
Ultra Recal is an app for managing personal information that uses SQLite internally to store a tree of objects.
Look at Data Explorer. It has various types of nodes, you can define new types of nodes, and there is no restriction on how these nodes can be nested other than they form a tree.
I used Ultra Recall for couple of years for my personal workflow.
In a normal hierarchical filesystem you have this restriction that files are placed in folders and that's about it. You can have multiple types of files but folders are just bags for files with not much additional data (other than folder name) and you can't have files nested below other files.
In Ultra Recall every object regardless of type accepts child objects.
I found this to be very powerful, for example, I could have an email but then write a note being a child object of the email. Or I could create an object that was an actual working reminder. Then I could move that email to be nested under some task object and Ultra Recall had facility called "saved search" which allowed you to define a search for example for all projects that don't have next actions under them or all tasks that have deadline within next week.
Some other types of nodes are actual files -- you can put a webpage or a document and these become just part of the tree.
One problem with this is that when you try to add an object you need to figure out a location. As Ultra Recall is a tree (there is always one parent), when you are adding items you need to decide location (ie. what the parent is) even when this is not strictly meaningful or when there are multiple potential candidates.
For example, sometimes you have something you have vague idea you want to preserve but you just don't have time now to decide. Or an item could potentially be useful for multiple projects.
But sometimes you can be sloppy with this. For example, Ultra Recall offers full text search so you are free to just dump some of the stuff in a folder and forget. When you need it you just run a search for any part of the document or its properties (tags) and it gets brought up.
So I created couple of "dump" folders where I could just throw stuff in (like "read later").
I started writing keywords at the top of documents and I could then more or less ignore when the document was placed if I could not find a good location for it. Deciding where to put stuff is a chore when you create a lot of documents and rely of classification to find them later.
But this only solves the case when you have no immediate parent candidate. What if there are multiple parents that you would like to use?
For example, the parents are various projects and the child is an email that is related to those multiple projects. I would like to be able to have multiple parents (projects) to which the item (email) is attached.
So I was toying with the idea of making a little better Ultra Recall by relaxing the requirement that an item has only a single parent node. In effect, this creates a graph where any item can have multiple incoming relations (children) and multiple outgoing relations (parents).
Just commenting to note that I'm finding this interesting and provocative. Not necessarily convincing, though I'm thinking my way through questions and concerns.
I'll throw one thought out just for grins: why do we bother with documents and records at all, and what is the value in going through old records.
(I'm asking far less because I doubt any value, somewhat more because I've been put this question (by someone whose sanity and motives I very much doubt, and for whom no response seems sufficient), and because I think it nudges at some very-frequently unquestioned assumptions and motives about this whole endeavour.)
That said, I'll try to return to and respond to your comments more directly.
> why do we bother with documents and records at all
We bother because from a developers' perspective, if you are interested in interoperability with other software, it is convenient to store information in files that can be passed by name to another piece of software.
Nowadays we have a lot of webapps that do not store information in files (the databases behind them use files but this is not important for the user). If you notice, interoperability is pretty bad and has to be developed separately, every single time for every single use case. You can't take your "facebook file" and grep it for the contacts you are interested in or pass it to some other program. You can't make a backup by creating the copy of the file, and so on.
But I digress.
The issue isn't about files being a problem (I don't think they are). The problem is about organizing them. There simply isn't any good organization idea implemented in practice. People say files are too hard for normal users but what they really mean is that users are lost in a normal filesystem or they are going to be lost after they have created enough files without being super tidy.
I would be glad if files themselves vanished completely from the users' view and only stayed there as an intermediate storage layer.
Thanks for that, though I suspect developers aren't the only user-community for documents ;-)
If I can paraphrase your statement about files: they solve a technical problem in computer space, but don't reflect how people want to and/or need to access information.
I'm reminded somewhat of mainframe DASD storage, which has a few characteristics:
- A largely flat directory structure. There can be directory equivalents, but to a limited depth (1 or 3 maximum IIRC, possibly depending on OS release).
- Structured files. That is, the concept of a "flat file" or a "binary file" doesn't exist (or is rarely used), rather files are specifically structured, into records (~= lines or rows) and fields (~= individual values), with a record type indicator as the first few bytes of each record. A file could contain (and most often did) multiple different record types, as a hierarchical data file.
- Specific programmes to read each file type.
(At one point I worked in writing code to read data files, including IBM mainframe files. That was many moons ago and much of the work was not on IBM mainframes themselves, though some was. I've suppressed much of the trauma....)
These days we tend to use an RDMBS database structured as rectangular tables with columns and rows (also invented at IBM, and available on mainframes, though not often part of my previous life).
Funny thing: SQL was originally pitched as a query language that could be taught to secretaries and administrative staff to directly run queries against databases. Either admin staff were much more brilliant then, or that was an ambitious vision....
Or ... we have various structured data formats that aren't rectangular: XML, JSON, HTML itself, amongst others. These have a structure and can be picked apart, though they're still fairly complex.
What I've seen over a career of 30+ years is that the vision of some single universal data format keeps getting pitched ... and keeps not materialising. Because Reasons a specific application has its own needs, and/or the "standard" is so complex that it is never implemented the same way twice. Data interchange always needs to be specifically structured and engineered.
For mostly-textual information, which is where I think of tagged systems being most appropriate, the internal structure of documents is ... fairly loose. Metadata + bag of bits, mostly ASCII / Unicode. But that still leaves us with the problem of sorting out what's where and how to access it.
I'd argue that most present filesystems are inadequate to the task of and by themselves, though they might be extended. An interesting and fairly simple example is the maildir format for storing email. It lives on a standard filesystem, and its naming convention is opaque (email message IDs). The file structure itself, based off RFC 822/2822, includes a set of defined and structured fields (headers) providing metadata. A special access program, a mail reader, provides access to the mailbox(es), though other tools can provide more programmatic process. Principle organisation is by data, though other metadata aspects can be used.
As a rough proxy of what a document store might look like ... this isn't a completely bad start, and it might offer guidance to how a more generalised document-oriented filesystem might be structured.
Backing up a bit myself, I've been trying to establish two related, but distinct points:
- There is not a single unambiguous or consistent hierarchical structure to topical knowledge overall. This is a trap numerous organisational attempts, dating to Aristotle, have fallen into. (The history and attempts themselves are fascinating, for those into that sort of thing.) That is, the hierarchy as a whole is not directed.
- An attempt at such a mapping is frequently made where a work or works (frequently: encyclopedias, in the former case, libraries in the latter) must be arranged such that there is a single canonical ordinal or spatial organisation. There are other alternatives, such as alphabetic ordering used in many dictionaries and encyclopedias. This works where the alphabet involved is reasonably compact, has a definitive single collation order, and entries themselves are at least reasonably distributed across the indexing scheme (e.g., there's not a single character with an overwhelming number of topics). It's effectively a hash using the first character under the writing system.
- Even in the far more relaxed case of a defined relationship (RDF triples are frequently used in bibliographic classification), there are cases of relations which are not themselves directed. E.g., authorship is a relationship which exists between one or more persons and a work. A work has an author, an author has works. Similarly, two people are "in" a relationship. Remove authorship or relationship, and all the relations are removed. (Though there may be unrequited love, unrequited authorship is less often observed.) Though we might impute a relationship based on some other: if work has author, then the relationship author -> work can be imputed. This despite the fact that it is the author who volitionally created the work in the first place.... It's ... confusing.
Tagging is a form of RDF triple. Entity has tag. That gives us object and object, but not much by way of predicate, other than "has".
Your Ultra Recall example doesn't do much for me as I'm not familiar with it and can find little information online other than a CNET download link. That said, I'm vaguely familiar with knowledge management systems. UR seems vaguely similar to Hypercard or a Wiki, though with the significant distinction in your description that UR apparently has an explicit parent-child relationship between nodes, whereas in a Wiki (and AFAIU Hypercard), relationships are merely as peers. As you note, this ... can lead to awkwardness and ambiguity, as well as imposing a cognitive load when creating a new node.
OTOH, there are entities which have a specified order. The collection of characters within this post would have much less utility as an unordered (or differently-ordered) set. Pages and chapters in a book, volumes or episodes in a series, scenes in a film, slides in a presentation ... without a specific order, much value is lost.
The alternative I favour is for some metadata to be inherent or intrinsic, whilst other metadata is specifically assigned, created, or evolves.
Intrinsic metadata might include: contents, creation (or aquisition/curation) date, author(s) (if known, whcih is generally the case for works created within a given management system). In general, time may well be the most universal initial placement mechanism, that is, thinking of the system as a filesystem, date-based access is a principle access method.
Pretty much all other metadata is either explicitly assigned or occurs through interactions. A work might be linked from another. It could have a specific title and subject(s) assigned. Translations, summaries, and reviews might be created from it. It might be cited, or cite other works. That's all metadata, but it isn't endogenously entrinsic to the work itself. (Citations of other works approach this, but still need to be distinguished from a fully endogenous characteristic as a citation references something external.)
An area I've been exploring but haven't really settled on or found a satisfactory standard is development of word-ngrams or tuples from within the document itself. Whitespace-delimited word clusters, ranging from 2--5 typically, can be highly useful in identifying common, distinctive, and unique elements within works. There are numerous issues with flattening the space (capitalisation, stemming, spelling variants, transliterations, homophones, homoglyphs, character substitutions), but these are a powerful tool. Ngrams are compute and space intensive, but seem quite useful. (For a time Amazon offered a "statistically distinctive phrases" or similar feature for books, showing terms that occurred with far greater than typical frequency within a given work. That's been retired for years.)
Another space I've been exploring has been various topical classification schemes, including both the Library of Congress Classification (the A-Z subject headings and subdivisions) and the Library of Congress Subject Headings (an independent set of descriptions frequently found in the LoC classification section of a book's publication and copyright page(s). Trying to find a single, or a sufficient-but-useful-and-not-excessive set of high-level summaries for works is its own interesting task. My sense is that much of that might be offloaded to an AI based on a known corpus (the LoC have catalogued some 40--50 million works, there are additional samples through published academic articles approaching 100m works, this might prove viable, given full-text access).
And, depending on the types of works you're interested in classifying, there's the question of how to approach images, audio, musical composition, choreography, video, data, and software. Perhaps also plastic arts (sculpture or physical objects), etc., etc. There's also the question of how to distinguish between levels or degrees of a specific work --- in bibliographic literature, the Work / Manifestation / Instance distinction. War and Peace by Leo Tolstoy is a work, the English translation is a manifestation, and the 1972 BBC teleplay on videocassette would be an instance, for example.
Casting a wider net, there are transactions (financial and others), and the plethora of data presently being captured by various surveillance apparatuses. The question of whether improving cataloguing of this would even be of positive value occurs ... there's nothing quite so dangerous as a highly-structured data trove. I've been trying desperately to get into Paul Otlet's work and subsequent efforts to look at approaches tried, and their successes or failures.
I'm leaning toward a system where metadata is aquired at a set of levels:
- General metadata: title, author, publisher, translator.
- Curated metadata: topics, citations, references, projects and workflow.
Back to the top of this thread though: I don't see these as specifically hierarchical or directed, and from what I've seen, things get muddled where such requirements are imposed where they do not have to be. Again, your Ultra Recall example seems to illustrate this.
No, the intention is not to map it 1:1 to a hierarchical filesystem. My intention is to select a starting point, squint your eyes a little bit and then pretend it can be traversed from that point, the same you do when you search for a page on Wikipedia and then try to move around by clicking on links on pages.
On Wikipedia you can choose any starting article and probably traverse entire Wikipedia. Your traversal tree will be different depending on what you select as starting point but the end result will be the same -- you will have traversed entire Wikipedia. When you see an article you already saw -- you just skip it for the purpose of traversal.
Another way to imagine this: Windows Explorer
Select your starting article as root.
Then articles that are reachable directly from the root as first level nodes, then second level nodes are articles reachable from first level nodes, etc.
This is infinite depth tree because any cycle will cause an infinitely long branch.
But it is fine for a manual traversal (as a human you will notice you are repeating).
And if you are traversing automatically there are easy ways to detect you have already seen the node (just detect the node you are visiting has id that has already been registered somewhere else in the tree -- maybe put all ids in a hash set?)
---
I just figured out while trying to answer your question that the better description of what I would like to achieve would be "hyperlinked filesystem" except that hyperlinks also carry additional information about the link itself. In a filesystem there is only one type of link (parent/child folder/file relationship). In a hyperlinked filesystem you would have possibility of many types of relationships.
On a normal filesystem Emacs could create a file with a tilde appended to its name to express it is backup of a file without tilde.
On a hyperlinked filesystem Emacs could create an explicit link between the files with a suitable role.
---
A filesystem like that would probably need multiple other ways to locate the files but it does not have to bee a tree-based structure. A tree-based structure would require the user to find a place for the file which misses the point of the filesystem (being free from having to find a location for the file). For example, additionally to graph based searches you should be able to search for properties including full text search on properties like names, descriptions, etc.
I like how this article is laid out as first defining the existing systems used today since while it’s all I know, I haven’t spent the time defining it. And then describes a number of inspirational examples of how it could be different.
Desktop file systems seem impossibly hard to change at this point, but cloud storage and mobile file systems are still so new and not amazing in my opinion - there’s still hope for a better experience.
I don't think it is just a preference for system files and application files. For documents, sure.
Think about an application that tries to find the settings file. Which tags should it use in the search? How does the user know that no other file on the computer will suddenly be found because they used the same set of tags?
Another key difference is that a hierarchical (or positional) attribute is acquired inherently by a document's position within that hierarchy or location.
In both cases, the hierarchy is inherently positional, affords locality, and guides both the reader and the document by asserting that locality inherently.
If you're in a place, then that place gives nearness or farness from objects.
If you're an object, you're in some location inherently.
Tags, in this sense, are placeless, and don't provide attributes of or guidance to nearness. At least not without some additional mechanism.
Also known as "facets". Sometimes you might want to browse "year/album/artist" too (think Various Artists / compilation albums), instead of the more common "artist/year/album". Or even "artist/year/album" for the more prolific artists out there.
Progressively "drilling down" is super low-hanging fruit with regards to tag-based organization, it's seriously neglected.
I also agree that the lack of drill-down on tag-based organisation ... kills about 99.9996% of the value. It's sufficiently valuable that you'd expect this to be part of basic implementation libraries by now.
You don't need to define relationships between tags. Like in my audio player the songs have certain tags which don't have anything to do with each other and yet the player can create tree like structures from them. For example when you have songs with tags for %genre%, %artist%, %album% and %title%, the player can automatically generate various tree structures with different levels, like:
Note that in the case of artist -> album -> title, there is in fact an actual explicit hierarchy that's being reconstructed. In the case of conceptual tags ... any such structure is far less evidently defined.
Now when the user wants a tree with the first level being genre and the second being title, the player just goes through all the know genres to build the first level with two items (genre_1 and genre_2), then for the genre_1 sub-tree it generates a list of all songs with genre_1 and displays their title, ... So you end up with:
The point to my question wasn't that there is magic involved, but to ask what the specific method was. Maybe I should have made that clearer, though I was hoping it would be obvious.
In the artist -> album -> song case, to repeat myself, the recognised hierarchy reflects an actual one that's present in the original artefacts. (This needn't always be the case: multiple artists might play the same song, the same song might be on multiple albums, song title itself does not necessarily imply either artist or album. Though in the simple case it does.)
In the genre/title example, how would it be clear that title is a subclass of genre rather than the other way 'round?
And what of fusion or crossover works --- is DNA's remix of "Tom's Diner" folk-rock, a capella, or trip-hop?
> In the genre/title example, how would it be clear that title is a subclass of genre rather than the other way 'round?
It's not clear and it can be the other way around. You can also ask the player to list the genre as a subclass of titles. It'll be just a less useful representation, because the first level usually holds almost as much entries as there are songs, since titles are often unique.
> And what of fusion or crossover works --- is DNA's remix of "Tom's Diner" folk-rock, a capella, or trip-hop?
It can be all of those, like I said you can have multi-value tags. Then this song would show up three times in a tree with a genre level.
> Digging into method, then: more-frequently-occurring tags are presumed to be more general than less-frequently-appearing tags?
No, all tags are handled in the same way by the player. It's completely up to the user to specify the hierarchy in which tags can be browsed. The player then builds the tree dynamically.
> If the latter, it's not clear to me how the tags are directed, other than in the sense of an RDF triple's subject, predicate, and object.
I'm kind of lost here, since I'm failing to see where the confusion comes from. Also, I'm not a native English speaker, so I might be missing something or my explanations just aren't well enough. So let's try this again with some simple pseudo code. Tags here are nothing but key:value pairs associated to audio files. So if the user asks for the tree structure like key_3/key_4/key_1 the player can simply do the following:
keys = {key_3, key_4, key_1};
tree = new Tree;
for song in songs {
subtree = tree
for key in keys {
// key here might be something like "Artist"
// value here might be something like "Iron Maiden"
value = song.get_key_value(key)
if !subtree.has_node(value) {
// if the subtree doesn't already have the "Iron Maiden" node in the current level, add it
subtree.insert_node_sorted(value)
}
// use the "Iron Maiden" node as the new subtree
subtree = subtree.get_node(value)
}
subtree.insert_song_sorted(song)
}
The number of tags should be more or less similar to the number of non-leaf nodes in a hierarchy, or else you aren't capturing the same information. Any tag that applies to more than half of the files is probably useless. On my blog, the tags "blog" and "technology" are definitely useless. That's fewer than 250 entries and already it has cruft.
Were people consistent when they added tags? Does your system suggest tags automatically? Is this actually a full-text search minus stop words? Is there a librarian who cleans up after you and merges tags that have the same meanings? Would the full-text search be more useful than tags?
(I couldn't tell you without downloading my entire archive and doing some magick to count them, as Pocket won't give me a count itself.)
Pocket lags by well over a minute in even starting to populate my tags list after I enter some text into the field. (Chrome/Android, using an external keyboard.) It similarly takes several minutes to scroll through the tag list using either the Web client or the Android app.
You could argue that this is a piss-poor UI/UX issue, and I'd agree. It's piss-poor UI/UX issue that Pocket have failed to address since I first reported it over five years ago.
Hierarchies and tags are not an "xor" proposition.
I always use both at the same time.
Hierarchy gives you a way to organize things according to each others. I want hierarchy, because if I deal with thing A, I may have then to specifically have a look to child B or C.
It helps with context and granularity.
While tags allow you to attach multiple categories to things and filter according to that.
You can use a hierarchy as a poor man tag, but really, you should use both.
There just needs to be a default view that shows you the most important tags to start from - either by most used files, or largest number of files, or most active recent changes, or as managed by someone. The nice thing about tags is all of those could be top level tabs and you are sailing.
I've recently built a tagging system based on SKOS [1]. This supports hierarchical as well as associative relationships between tags (while not strictly requiring either), as well as ad hoc groups of tags.
While SKOS was intended for more formal vocabularies, I've found its use as a basis for a tagging system makes exploration and navigation of a topic area reasonably organic, as it allows users to specify relationships only as they see as fit and intuitive.
Wildland (from architect of QubesOS, blockchain bits are optional) is tackling this problem: bottom-up ontology graphs that can be meshed into a global address space, items can appear at multiple places (similar to Bear) in the hierarchy.
Finally, a topic I can speak with some authority on.
I'm not sure that tags make that much sense as an organizing schema for a filesystem. It's just too painful for items to not have canonical names. In my original design, a bookmark had a bunch of tags but was keyed on it's canonical URL.
Maybe it makes more sense when we're only talking about a user's personal objects rather than the entire filesystem. Having a giant, flat namespace also seems wrong - Wikipedia seems especially strange in this regard, having Thingy_(Star_Wars) where Thingy is both a real thing and also present in some specific context; I frequently think it should be Star_Wars/Thingy (since it has a specific context it makes sense in and does not make sense outside of it.)
Tags were mostly a way for a user to mark found things for that user's future self to potentially retrieve. They end up encoding some combination of the user's internal state and the material of the item being tagged. Different users therefore have different needs - expert users tag things differently than beginners (ie "java" vs "programming" - to a beginner there's little distinction between subtypes of programming things.) Users tend to tag much more wildly when they are not offered a reference of previously-used tags, as well.
In my experience, an item that had been tagged a large number of times would have the most popular tag used 50% of the time, the second 33% of the time, and so on. This held strongly enough that much deviation from this was a sure sign of spamming.
This is also where tagging falls down - tagging things for retrieval by OTHER people is almost always mis-incentivized and ands up getting spammed to hell.
Reminds me of an old project I did where I built a fuse system for this, files still lived on a normal ext3 fs, but the overlay presented files as tag paths, yo you could access, for instance, a movie like /tagfs/movies/year/1999/matrix.avi or /tagfs/movies/genre/scifi/matrix.avi
There was a special path /tagfs/untagged/ which listed any files that didn't have at least one tag.
macOS had this since Leopard days. First it was implemented via the OpenMeta([1],[2]) and later, by Apple, via Maverick meta tags. They both use the XATTRS of the filesystem to store the meta-data.
I didn't use macOS for a long time now, but when OpenMeta was still active, there were tons of applications, that supported it. I loved it and miss it badly. The only issue I had with it, was adding all those many tags to your files. Something, I'd like to solve with AI, these days.
For me, the best system is a hierarchical file system with meta tags in XATTRS, that also can be used to build hierarchies, if needed.
Wasn't this the intent behind Windows Vista - a tag-based DB-as-filesystem with hierarchical paths just one "lens" through which to view the DB? I use Google Drive this way, largely through search rather than directory-based organization, though I do also employ that for often-used collections.
I like to organize files in a similar way but use spaces in filenames. I find it a lot easier to read filenames that way, and in the shell, <tab> will add all the necessary quotes / backslashes.
> Soft links have a nearly opposite set of problems as hard links – soft links can span different file systems, but they generally don’t track the target files getting moved or renamed (except that Windows provides such a system service, but may not be reliable); while hard links are all indistinguishable, some application software behaves differently on soft links than on real hard-linked files. In spite of these problems, hard and soft links still require an exponential amount of effort to classify a set of files in multiple ways, and require the user to manually remember all the possible paths that a file can be reached from (important when editing and removing files, not important when browsing/retrieving files). They are non-scalable kludges compared to true tagging.
As a fan of hierarchies for file organization, and as someone with quite literally thousands of soft links, when I read things like this, I don't know if the person arguing against links has tried this approach. It works totally fine for me.
Yes, with soft links, moving the original file breaks the links. I wrote a fairly simple bash script to automatically fix these for my reference PDF files. It works because each PDF file I save has a unique file name. So figuring out where the links need to point is pretty simple. That makes me a "power user", I guess, but the author is at least at the same level and I think could figure it out.
With respect to the "exponential amount of effort to classify a set of files in multiple ways", I guess the author is referring to navigating the hierarchy to link a file in multiple places? I use tagging at my work, and I personally find scrolling through my list of about 200 tags to be comparable in terms of time to navigating through a hierarchy. The bottleneck is the human.
I use hard and soft links (mostly the former for files and the latter for directories) and I think it works very well in practice despite the theoretical problems brought up.
While we're at it, go beyond overwriting a single content stream per name. VMS had ;version numbered suffixes for versions of each file. Git follows a model where each unique content item is a blob and there are references to it from different names and versions. Lots of 'filesystems' didn't have directories, or did have them but were implemented in a flat structure: mainframe OSes and LAN filesystems like Novell, which I discovered one time when an entire drive got all it's files deleted by some bug. It was possible to undelete them from the LAN 'trash' where all the dir/file names were in a very long flat list.
Just cancelled my Google Drive subscriptions last week due to this, and moved to SFTP on rented metal. If you can’t 100% predict what I’m looking for then don’t even bother. Just give me files and folders.
The article fails to mention existence of graph databases or graph theory, except for tuple spaces which leads to that direction. Property graph model is the keywords to search for if you are interested in graph databases' suitability for data organisation: http://graphdatamodeling.com/Graph%20Data%20Modeling/GraphDa...
With graph databases you can easily and efficiently model any kind of network, including ones that are hierarchical or almost hierarchical by allowing a node to refer to multiple parent nodes instead of one.
My original idea with my bookmark extension Spellbook was this latter kind of graph, and I implemented a prototype called Grimoire using Ruby on Rails and Neo4j graph database that worked very well.
The Spellbook currently only allows adding new bookmarks into the hierarchical structure imposed by browser APIs, but features an easy to use search feature to find the right category. Spellbook is available for Chrome and Firefox, but the Firefox version seems broken again by their API changes: https://github.com/peterhil/spellbook
I got the idea for Spellbook, because I was learning programming, but also having some projects that involved quite a bit of research on subjects like audio physics, statistics, data visualisation etc., and I wished I could keep my bookmarks in a generic hierarchy, but also attach some categories (bookmark folders) to categories for my different projects at the same time.
So, I also see the value in having subcategories in addition to freely formed networks or associations.
I also had the idea to apply these kind on ideas about nonhierarchical file organisation to music or photo library organisation, and really wish there was a file system level support.
Didn't the BeOS filesystem (BFS and later OpenBFS) make this possible? I remember being impressed with they way extended attributes made the file system and files into a database for any programs that wanted to interact with them. And, I don't know exactly how it worked but the system supported live queries that allowed you to drill down based on the attributes. MacOS and Windows both created things intended to appear like live queries but they always seemed slower and less elegant.
One of the best uses of tags is to let a file effectively exist in multiple “folders”
For example I have folders of screenshots named after various shows and games, and I use tags to further organize the images based on their suitability for different “reactions” on online forums :)
So naturally I have hundreds of tags but macOS doesn’t seem to keep up with that many and after a certain point it feels like Apple have forgotten about tags and improving their integration into the system.
For tags I use meta data, which are included as mapped text at the extreme end of many media formats. For example ID3 data on MP3 files.
I have found this incredibly helpful for MP3s because I have thousands of them and there are many similar names. Windows Explorer provides columns for this data in its detailed file system view (not by default) which is incredibly helpful and trivial to customize.
For everything else folders are enough. I have hundreds of movies on a hard disk and yet folders are enough. When I do need more the data I want is generated by the file system: last modified, file size, and so forth.
What improves file usability the most for me is network access by meta data. For example Windows Explorer and OSX Finder are nice but I would rather have the exact same interface on the same local machine for a bunch of remote machines regardless of their file system or operating system. Then copy to a different machine is just drag and drop from one window onto another in an application that looks like some local OS, that windowing interface needs to allow sorting and filtering and search by meta data just like Windows Explorer. Having an application that does this for me has been great.
If you're going to use a non-standard tagging system, it's best to have a backup. For example, Gmail uses tags instead of folders, and to most uses they seem to be folders. It's possible to not even be aware of this difference.
Sometimes gmail loses all your tags. Could you imagine if all your files ended up in the root? Also consider how you would back up such a schema in a traditional file system.
The same is happening in webshops. Products can be assigned to hierarchies (categories) but in shops like Amazon it is obvious it is more like tagging.
And amazon thanks to that is opaque, I never look around for stuff, just search and find or not. They could have made something better where you can discover other stuff.
- A controlled vocabulary. This addresses the problem of numerous terms referring to the same concept, the same terms applying to different concepts, disagreements on spelling or charactersets, and standardisation or cross-references between multiple terms.
- Externally applied, subject to authority independent of an author or publisher. This addresses the problem of keyword-stuffing.
- Capable of referencing information not within a document itself. Its place of publication, earlier or later versions, cultural context, citations, amongst others.
- Can be applied to nontextual media: images, audio, video, data.
Tags could be a controlled vocabulary, but in almost all cases they are free-for-all.
Archive of Our Own has an interesting scheme, where authors tag their pieces freely, and volunteers behind the scenes enrich those pieces with tags that are indeed from a controlled vocabulary.
(They also do other mind-boggling stuff: in those tags they encode all kinds of information, so that you can search for a Kirk-Spock love story where violence is involved and Spock is dominant, and so on…)
I could have been clearer, but tags, even in a highly informal process, *are supplied by the reader(s) or curator(s) rather than the author/publisher, at least in the context of organising one's own archive. In that sense, tags are, even if not highly structured, a controlled vocabulary.
(I'm not referring to tagging that's provided by a publishing site itself, though yes, that's a fairly common practice. Tagging by a skilled third-party curator or librarian can of course be excellent.)
Yeah, tagging really seems like the job of a file manager (and indexer) rather than a file system. That seems like an easy way to get everything we want without rewriting billions of lines of code that deal with hierarchies.
Good ideas, but there are a few things wrong with this.
First, we forget that filesystems are not hierarchies, they are graphs, whether DAG's or not. [1]
Second, and this follows from the first, both tags and hierarchy are possible with filesystems as they currently are.
Here's how you do it:
1. Organize your files in the hierarchy you want them in.
2. Create a directory in a well-known place called `tags/` or whatever you want.
3. For every tag `<name>`, create a directory `tags/<name>/`
4. Hard-link all files you want to tag under each tag directory that apply.
5. For extra credit, create a soft link pointing to the same file, but with a well-known name.
This allows you to use the standard filesystem tools to get all files under a specific tag. For example,
find tags/<name> -type f
(The find on my machine does not follow symbolic links and does not print them if you use the above command.)
If you want to find where the file is actually under the hierarchy, use
find -L tags/ -xtype l
Having both hard and soft links means that 1) you cannot lose the actual file if it's moved in the hierarchy (the hard link will always refer to it), and 2) you can either find the file in the hierarchy from the tag or you know that the file has been moved in the hierarchy.
Of course, I'm no filesystem expert, so I probably got a few things wrong. I welcome smarter people to tell me how I am wrong.
#include "someLib/someLib2/someThing.h" would look like in a "tagging" world?
Sorry, probably stupid example, but hierarchial has been there from the begining (I still remember how limiting Apple ][ DOS was without folders, and how cool ProDos when it introduced them).
> #include "someLib/someLib2/someThing.h" would look like in a "tagging" world
#include "someLib/someLib2/someThing.h"
Or whatever you choose as a delimiter for a hierarchy of tags.
The biggest advantage is that you can use an (well ;) infinite number of tags to get to the same 'file' (without the hassle of needing to use symlinks), so that could also be
#include "currentlyUsedLibs/someThing.h"
and
#include "foo/bar/baz/what/why/hmm/someThing.h"
The problem isn't the hierarchy, but that there only exist a single (modulo symlinks) hierarchy to every 'file'. And that every sibling must have a distinct 'name' (the cat images in the article).
There is also the pecularity in the C++ preprocessor that it'll probe candidates among -I<folders> and pick first one that suceed. Now you would need to support this for "tagging" too. Not sure even how. It does not translate well I think.
(Obviously I'm just complaining here, on the side of artists I think it was much easier for them (and for me actually, when comes to my own photos) to find things by tags - be it automatical, date, time, is it a document, location, etc.) - but then having to rely on consistent ("stable") name is going to be hard.
ok, maybe this is the right thread for this question: Anyone has a good structure for their own mp3s? I'd imagine tagging all files and using those tags for playlists would be a good way to do it. Does anyone do something like this or similar and can give pointers?
The MusicBrainz database is probably the best there is, and it does a good job of automatically tagging your music. Once tagged, every file will have a unique musicbrainz_trackid (UUID) in its metadata, which can be used to recover/update the metadata associated with the track automatically from the database, which is constantly updated (and to which you can contribute if it is missing metadata for your tracks).
You can configure Picard to arrange files and rename them however you want. It has some simple scripting functionality so you can name things conditionally based on the presence or absence of metadata, etc. [https://picard-docs.musicbrainz.org/en/tutorials/naming_scri...]
If you are concerned about privacy, you can run your own musicbrainz instance in a VM and download a copy of the entire database.
Picard is extensible with Python. There's some existing plugins for generating playlist files.
A complementary alternative I'd suggest is beets[1], a front-end agnostic CLI tagging utility that also matches your files against the MusicBrainz database and can both correct the ID3 tags and maintain a directory hierarchy based on those tags.
The biggest shortcoming of the MusicBrainz database I've found so far, however, is genre tags. Most releases seem to have only one or a handful of genres listed with no consistent genre hierarchy convention, but I've been experimenting with an extensions that pulls genre tags from discogs.
I guess I'll need to blog about it because I can't find any information online, but still nothing has beat Sony's SonicStage, which is by most accounts very annoying proprietary software for working with minidisc players that want ATRAC encoding, but also included an excellent tag navigator that worked like so:
While playing any song in your library, you could display a graph view which showed the song center screen, and radially arranged spokes enumerating what the song was tagged with, "rock", "instrumental", "upbeat" etc, and when you clicked that tag, it would become center-screen and all the songs with that tag would be radially arranged around that tag. So you could navigate your library by kind of surfing the tag-graph and hit play/add to playlist as you go.
Last I checked there were .exe's compatible with Windows 10 available so I'll have to download it again and try it out.
I had this idea long ago but at the same time I worry about things too fluid. Both for performance and both for information efficiency. Tree can be seen as a preemptive good enough tag order. Some obvious dimensions like category, time will always be of use.
I've been thinking about this a lot in the past and also think that tags would indeed make the most sense, at least for a new type of "desktop UI" which is created around the idea to quickly listing and finding files by typing tag names instead of an actual "desktop metapher".
I guess the idea didn't get much traction because coders are often using the cmdline (where the hierarchical organization works reasonably well), while creative professionals often use the file organization methods offered in specific applications and probably don't spend all that much time in Finder/Explorer (except maybe for bulk-copying files around).
Specific to photos, and most commonly scanned photos is the problem of "approximate" or "uncertain" dates.
The various blogs (exiftool forum &c) all discuss different ways of coercing "probably 1952" or "before 1940 and after 1920" into the EXIF structure.
I'd love to hear of ways to leverage tags to do this. perhaps assign a specific date (the blogs often say 1952-01-01:01:02:03 as a datetime, to signal "uncertainty") could be improved on by an "uncertain" tag.
There's also the syntax of tag separation inside EXIF. The various systems can't entirely agree what notation to use.
> 1952-01-01:01:02:03 as a datetime, to signal "uncertainty"
I instantly thought that least significant fields are used to indicate uncertainty range (10 days around 1952-01-01, or 10 years around 1952?). After all, the older the photo, the less sure we can be about its timestamp (usually), so we can use even/odd seconds as a flag to indicate exact time or uncertain period length (log scale). It would be quite an evil hack of a common fixed timestamp format.
That's very clever. I thought about using alternate daytime fields to signal the degree of uncertainty as a magnitude to the declared date Time, which then becomes the centre. Or you can use some positive/negative signals to show not before or not after.
ExifTool handles zeroes in datetime strings (I had to add that support to PhotoStructure a couple months ago!), but I suspect most other applications will croak on those values.
I'd also really like a way to identify very-rough dating. A (custom?) tag could encode "precision", maybe with a time units, so a +/- 10 year "precision" would handle your "between 1920 and 1940" case.
If you've got a suggestion, I'm interested: PhotoStructure has a ton of users with this same issue.
I prefer the custom tag approach, remembering that precision may go to day or month or year independently: I may know its easter, but not which year the wedding is, if you follow my reasoning, or know its 1932, but not which day of which month.
That said, in month digits you have 13-99 and in day digits you have 32-99 as well as 00 in each field.
So if you find yourself wanting to overload a field, there is one special value in each of month and day to flag one thing, and then 6 bits in each other field than year, to signal specific things. Could be 1 bit for +/- and then 32 values of imprecision, which means you can express uncertainty within a year completely, relative to any date.
The standards guy in me says "please don't overload fields"
I always see tags as a patch for when search doesn’t work effectively. If I have a hold of some content that I will want to find later and it doesn’t have the metadata needed to show up in a future search, I have to add that metadata myself.
My solution is that all content goes into a new folder. Projects get meaningful folder names so that at least a search for “School” will find “School Trip 1997”, even if the documents in that folder “PASSPOR~.PDF” don’t themselves show up in the search results.
In my experience it’s been rare where tags have been used to put data in more than one place.
Tagging assumes user will always have the ability to type or input the file tag metadata. Hierarchy allows finding files by just clicking up or down the hierarchy tree and double clicking the file(s).
I've been doing a hybrid approach for a hypertext system I built.
All documents have a canonical position in a filesystem, but are arbitrarily "taggable". Tags themselves are also files, but the property of "having a tag" is encoded as one file linking to another. The only difference between a tag and say an article is that the tags are in a different directory (not that they need to be, but it makes it easier to parse as a human).
All documents automatically list internal backreferences, which means the tag files become useful for navigation.
I find that using Everything search (https://www.voidtools.com/) makes me use the filesystem more like a tagging system. I still name files and directories meaningfully, but I don't worry about the hierarchy at all. Then when I want to get something, I just search (parts of) the terms I want and see the matching paths instantly.
Using the filesystem hierarchically now feels painfully slow and awkward.
I do this with fd and ripgrep, which also lets me do full-text search through the files themselves. My filesystem has become much flatter and coarser-grained; just a few very large categories.
I love Bear with it’s 1+ hiearchical tags per note. It’s the perfect combination of both worlds for me. Would encourage the author to check it out (if it’s any different).
it's easier for me to put folder "summer photos 2012" under photos than tag everything. In some cases it'd just bring garbage, i.e. script tag -> get ready of tsunami of script.js that got downloaded with html files i downloaded. I'm saying I can't think of a scenario where it'd be mostly helpful but that don't know whether there are other scenarios where there are such scenarios where tags are better.
If only the HFS and the OS using it would consistently yield the documents I recently used, across all file system dialogues, that alone would be a huge benefit for me.
Apart from that I'd love to see a tag based file system. Would use.
Hmmm, I wonder why they don't talk about the notion / wiki style of data organizational model with links inside this article. I guess because it's more text document format dependent.
And then we should talk about categories (and subcategories, relations or associations) instead of tags, which to many people seem to represent a flat namespace of tags.
, plus some of the discussion threaded from those posts. (Sorry, my stuff needs rewriting and updating but I'm not in the position to do it at present. If there's anything you would like to ask about please do. https://news.ycombinator.com/item?id=9809041 and https://news.ycombinator.com/item?id=10548477 touch on things that are a bit further down the line, but related—in particular, to the handling of "internal metadata" and files with a compound internal structure.)
https://www.researchgate.net/publication/321840994_Ranganath...
A notable one that hierarchies embed the point of view of the classifier, e.g. the Dewy Decimal's ridiculous classifications of religions (codes 200–299), making minute distinctions like 285 (Presbyterian, Reformed, Congregational), 286 (Baptist, Disciples of Christ, Adventist), then putting all non-Christian religions under a handful of afterthought headings 292–299: 294 for Hinduism, Buddhism, Sikkhism and other religions of Indian origin, 295 for Zoroastrianism and its descendants, 296 for Judaism, 297 for Islam and Bahaism lumped together, and 299 for New Age.
Unfortunately, most of us do not have access to the services of a librarian to develop a taxonomy that corresponds to our own point of view then classify our files accordingly, which is why simple hierarchical taxonomies have endured and faceted ones seldom beyond specialized applications like Digital Asset Management.