Assignment for week 7: Ontology of collaborative work
The most important task of a systems analyst is to decide which entities the system is going to represent. This is a strange task, one that falls across the boundary between engineering and metaphysics. The basic question is this: what does this particular system think that the world is made of? For the past quarter century, the task of answering this question has been called "data modeling". Although methodologies for data modeling have proliferated, their basic method is to perform a rational reconstruction of the concepts of the people -- the customers, that is, and not necessarily the "end users" -- who want the system built. The customers' nouns become data entities and their verbs become processes. I make data modeling sound easy, but of course it's not. It has the advantage of encouraging the analyst to listen and learn about the world where the system will be used, but it has the disadvantage that the customers' concepts, for all their authenticity, might not add up to a sufficiently solid metaphysics. In this assignment, therefore, we will skip the listening part and focus on our own concepts.
Your task this week is to prepare a data model for an information service that might support interdisciplinary team-based work. You'll work back and forth between guessing what entities the service should represent and guessing what it will do. Come up with a few scenarios, and present a data model that would provide the groundwork for them. I do not assume that the meaning of this assignment is obvious, and so I will explain it at some length.
Let us begin with an example. Suppose that one were to design a service for reserving meeting rooms. The meeting room calendar would be visible to everyone who works in a given office, presumably by pulling up a Web page, and anyone would be able to claim a block of time. What entities would such a service need to represent? To find out, one simply pulls out the words from the informal specification I've given. I'll start by repeating the relevant hunk of text:
... for reserving meeting rooms. The meeting room calendar would be visible to everyone who works in a given office, presumably by pulling up a Web page, and anyone would be able to claim a block of time.Here are the nouns:
meeting roomhere is another word that looks meaningful:
calendar
someone/anyone
office
block of time
Web page
visiblehere are the verbs:
pull up (the Web page)We have most of what we need, but we need to clean it up a little:
claim (block of time)
works (in the office)
* The entity "meeting room" seems straightforward. In practice there will be gray areas: how about classrooms, lunch rooms, and so on? Perhaps those decisions will be left to the people who configure the service for a particular office; they will decide what precisely the phrase "meeting room" will mean for them.
* How about the entity "calendar"? We only need to represent it as a distinct entity if there is more than one calendar. One calendar can include many meeting rooms, so do we have any reason to support multiple calendars? I can imagine a reason, for example if the same system is going to support many offices. We should ask the customer.
* The words "someone" and "anyone" are obviously meant as informal glosses. In logical terms they are quantifiers, and they leave implicit what they are quantifying over. The phrase "everyone who works in a given office" sounds like a relationship, so we can say that an entity of "person" can be related by "works in" to an office. Once we start filling in the entity-relationship diagram, we will be forced to ask some useful questions: can the same person work in more than one office? We might also wonder about the connection between this entity of "person" and an entity such as "employee" that another data model might have. Questions will soon arise. Can a visitor reserve a meeting room? How about a temporary employee? An intern? The big boss who works in headquarters?
* The entity "office" is clear enough. We'll have to decide whether every office has exactly one calendar and vice versa; if so then that constraint is easy enough to express in an entity-relationship diagram. Or perhaps we don't want the extra generality of multiple offices and multiple calendars, in which case neither "office" nor "calendar" will be entities. Or perhaps we have some good reason to loosen the mapping between offices and calendars.
* The phrase "block of time" could be an entity. It sounds like it has some internal structure, like a date, a starting hour, and an ending hour. That's fine. In an entity-relationship diagram, this internal structure would be coded as more entities and relationships. So the "block of time" entity could have both a "starts at" relation and an "ends at" relation to the "time" entity, as well as a "occurs on" relation to the "date" entity. This kind of thinking will be alien to those who are accustomed to object-oriented design, but it adds up to the same thing in the end.
* How about "Web page"? Is that an entity? Perhaps not -- it might be the way that calendars are implemented. We have to keep our service layers straight, and "Web page" might be an entity on a different service layer -- the Web, to be precise -- than the one we are designing.
* How about "visible"? At first it sounds like a relationship between the "calendar" entity and the "person" entity. But in an office with 100 employees, that would require us to record 100 relationships between people and calendars. If we really are saying that every calendar is visible to everyone who works in a given office, then maybe it's enough to relate a calendar to an office (and vice versa), and then relate an office to a bunch of people (and vice versa). The "visible" relationship might be redundant. If someone wants to query whether calendar X is visible to person Y, that query could simply be expanded into a two-part conjunction: is there an office O such that X works in O and the calendar for booking meeting rooms in O is Y?
* Let's talk about the verbs. The verbs are probably not part of the data model. They represent processes that we can implement once the data model is defined and the necessary databases are implemented. But it is useful to look at the verbs and make sure that they stand in a coherent relation to the data model. We can ask "who can do it?" and "what can they do it to?". A person can pull up a calendar if they stand in a certain relationship to it, and so on. All looks well. Of course, we may discover extra complexities and ambiguities later on, but hopefully by clarifying our concepts now we have minimized the potential for major surprises.
That's a simple example. Here are some other examples that would be a lot more complicated:
* A system for recording design rationales. As the team talks its way through the design process, someone takes notes on every proposal offered, every issue raised, every constraint discerned, every trade-off defined, every choice made, every reason given, and so on. That way, once the design is completed and the finished system is implemented, the original designers' wisdom doesn't disappear when they do. A later generation of designers who need to modify the system can go back to the design rationale and learn why the choices were made. That way, they can avoid making mistakes that the original designers had thoughtfully avoided. In fact, because the rationale explicitly represents every component of the designed system, it should be possible to hyperlink from a diagram of the system to a complete rationale for the design of each part. The problem with design rationale capture is that the design rationales are wildly complicated. The necessary data model is huge and baroquely metaphysical. This is a case where it's important to be judicious about structure, only imposing structure on those elements of the rationale that someone truly needs to be able to search on, compute with, or otherwise turn into standardized data. After all, if nobody really needs the rationale to become an array of database entries then you can much more easily just roll a videotape during the design meetings, or else hold structured interviews with the designers and capture those on tape. The resulting unstructured data, with perhaps some metadata on top of it, will be much easier to create than structured data, and will present many fewer controversies.
* A system for recording designs, including the partial designs from halfway through the design process. Here you would need a vocabulary in which all of the designs could be described. You might want entities such as "component", "input", "output", "process", and even "entity" -- after all, you would want to represent the data model for a design. Such a system would be terribly useful: you could have a design editor with a point-and-click interface. The interface could automatically check your design for obvious errors, or it could prompt you to fill in missing information. It could also display the design in many different diagram conventions, because it would have some comprehension of the internal structure of the design.
* A system for recording skill sets. Imagine an organization with 10,000 employees, all of whom work in an endless shuffle of interdisciplinary teams. A program manager might enter the skill set that a team would need taken all together, and the system could propose different groups of people who would cover all those skills. You would probably want to represent availability, geographic constraints, and things like that. Perhaps the skills can be represented with a nice big hierarchical vocabulary, or perhaps skills are more complicated. The question, however, is not whether "skills" as such are complicated in general, as a matter of objective truth. Rather, the question is how the customer talks about skills, and how the customer might want to talk about skills in the future. Get the talking started, and then parse it.
Entity-relationship data models are not terribly complicated. They force clear thinking about some very basic questions, and as such they are a huge advance over the haphazard design methods of yore. But they are limited in many ways. Perhaps most importantly, they are not hierarchical. That is, they do not provide a natural way to represent relationships between entities that generalize or specialize one another. Consider, for example, a data model with entities such as "person" and "employee". How to notate the obvious relationship between these two entities? One could define a relationship like "is-a", as in "an employee is a person". But that would be wildly ambiguous. Why not say "Joe is an employee"? Because those are two differnet meanings of "is-a", only the former of which belongs in an entity-relationship diagram. (Joe is not an entity; he is an entity instance. In other words, Joe is not a concept but a particular example of a concept. He would have his own row in a database.) Also, the entity-relationship scheme provides no natural way to represent what mathematicians call the "transitivity" of "is-a": if a vice-president is an employee and an employee is a person, it follows that a vice-president is a person. AI people who work in the area of "knowledge representation" have invested tremendous effort sorting out (what they call) the "epistemological status" of all of the various links one might wish to draw in an entity-relationship diagram: links representing causal connections, logical connections, part-whole relationships, and so on. Those who are interested in such things can see the citations in my chapter of "Technology and Privacy".
We need not get into those things for the assignment, but I do think it is useful to work through an example of (what the object-oriented framework calls) a class hierarchy. Having worked through your data model, therefore, you might give some thought to the natural abstractions that your model presents. To continue the previous example, if your data model includes entities called "person" and "employee", then obviously "employee" is a subclass of "person" (although police dogs are supposedly considered police department employees). Or to take a more abstract example, "decision" might be a subclass of "action", which is a subclass of "event". This is where the metaphysics comes in. One might try to enumerate every subclass of "action" which occurs in the course of team-based work -- it sounds crazy, and it's definitely impossible, but it's a useful exercise anyway. Don't try to represent these class/subclass relationships in the entity-relationship model (it can be done, but you don't want to know how). Just start from your entities, call them each a class, fill in the missing subclasses and superclasses to complete the class hierarchy, and draw a simple diagram in which the classes are arrayed in a hierarchical tree-structure, or an outline.
As a reminder, don't try to do everything. Your data model should not be an encyclopedia or a dictionary. The point is to make design studies that focus in a sustained way on a limited set of issues in hopes of learning larger lessons. If your data model has fifteen or twenty entities then you're in good shape, and you can probably do a thoughtful design exercise with fewer. "Sustained" here means working the example over several times, hitting walls or noticing patterns each time, and then starting over. Next week we'll move onward from the data model to the uses that could be made of it. But it'll be useful to think ahead this week, making scenarios that might be spelled out in more detail next week. In fact, next week you may well have to backtrack, reworking part or all of your data model as the practicalities of data plumbing become clearer.