Data storage

Data storage



	Index
	Main
	Insight


	Design
	Data storage
	Operations
	Legend import


	Concepts
	Overview
	Publications
	Annotations
	News


	Using Insight
	Configuring

Data typology

As any other Content Management System, Shaman manipulates two kind of data :

Content Data. Can be XML, images, sound files, or any kind of binary content. Information is "atomical" : one content resource is not supposed to be changed internally (a change would lead to a new version, or an overwrite). There can be dependancies between resources, but those dependancies are supposed to be handled at import time, or at processing time.

Repository Data. This data can be about content data, or user information, or site structure. Information is subject to changes at fine levels. Data is structured, with notions of referential integrity.

Possible solutions

Here is a comparison between different solutions for storing both content and its metadata :


Solution	Pros	Cons
Everything in a RDBMS	Referential integrity. (Eventually) transactions.	Most of RDBMSs poorly handle Binary Large OBject (BLOB) fields. Opening a simple HTML page with a dozen of images could lead to a serious performance drop. Programming database access is always a pain, due to "impedance mismatch" between objects and tables. RDBMS don't handle graph structures gracefully.
Everything in the filesystem	Fast, efficient access for content data	Requires to implement a referential integrity layer, plus random access to repository files. In fact, this would lead to re-develop a RDBMS.
RDBMS for repository data, plus filesystem for content data	Each is used for the best of it can do	Need to keep both in sync

Oracle 8i and later versions claim to support very big content files the way we need. This sounds good. Did somebody hear about an Open-Source, free version of that stuff ?-)

It appears clearly that an hybrid solution (RDBMS + filesystem + synchronization) is the best compromise in long-term. Synchronization is not so difficult to write, if concurrent accesses to repository are correctly managed.

Intermediate solution : Prevayler

At this early development stage, we want to get a quick validation of a Model describing the Repository Data (its correct semantic would make of Shaman a development platform of choice).

Everybody knows that "dismounting" objects before putting them in a database is painful and time-consuming ("remounting" is not better). On the other hand, Open Source Object Persistence solution we looked at were relying on tricky mappings / stub generation, while putting severe restrictions to object semantics. So, things were looking bad.

...Until we found the Prevayler project. Prevayler is a clever persistence engine, which does nothing else than letting all Java objects living in memory, knowing how to restore them when needed.

Prevayler allows to program plain old Java objects, just make your Model Serializable and modify it through Command objects (which must be serializable, too).
And that's all.

Too good to be true ? Just have a look at the Prevayler demos.

We consider use of Command objects as a Good Thing, since it should ease concurrent access management. In fact, we planned to use them before discovering Prevayler.

Data volumes

A quick-and-dirty evaluation of first Shaman production use (2 years) gave us the following results :

20 topics

10 publications per topic

120 files per publication

700 Students

15 Questionnaires per topic (Students answer to all)

20 forum entries per Student and per year

This gives a total of 286.420 Buisness Objects. Taking an average of 4 implementation objects per Buisness Object, we stay around 1,2 million Java objects in memory. Tests with Prevayler show that a 32 bits JVM can handle this, with careful Garbage Collection strategy.

Migration from Prevayler to JDBC

Did we add the necessary indirection level for a transparent switch from Prevayler data storage to JDBC data storage ?

No. We did'nt.

In fact this indirection level is a burden at early stages of development, when Operation interfaces may change often. The indirection level would lead to create one interface / base class for each Operation, doubling the amount of work for any update.

Once Operation interfaces will be stable, it will be possible to split Operations parameters and implementation(s), without breaking client code and tests.
The client part will instantiate the "parameter" part, and the server will instantiate the "implementation" part when receiving the Operation object. Reflexion (or some faster solution based on bytecode generation) will be used to merge parameters into Operation objects. After Operation (implementation part) has been performed, a new Operation (parameters part) is created and sent back to client with results.
This will allow to let client code untouched, while switching implementation depending of the Server context.

Object-Relational mapping

Our latest investigations lead us to consider OJB, which seems the very best free, Open Source objet-relational mapping tool available at this time. We intend to keep Object-Oriented semantics for code clarity and maintenability, and avoid hacking into tables using JDBC directly.

Conclusion

So we believe we can use Prevayler for the first year of production use. This will gave us the time to evolve towards a JDBC-based implementation once the Repository Data requirements will be well-understood.