Skip to the content.

Avoiding Anemic Domain Models with Hibernate

One of Hibernate's most under-appreciated features is its ability to persist private fields. This feature is useful for avoiding what Martin Fowler calls the Anemic Domain Model anti-pattern, where domain objects (entities) are reduced to "dumb" record structures with no business logic. In an Anemic Domain Model, you lose all the benefits of OOP: polymorphism, data hiding, encapsulation, etc.

The Anemic Domain Model may have originally evolved from EJB CMP, which requires any persistent field to be accessible directly with a public getter/setter. Developers using POJO frameworks like Hibernate often duplicate the same pattern, though, simply replacing the entity beans with POJOs.

This is not just an academic discussion; this has real consequences for the quality of a codebase. (Academically, this is part of the OOP-RDBMS "impedance mismatch"--in particular, that there is no distinction between a setter/constructor call that actually mutates/constructs an object and one that is merely incidental to materializing an existing object's state from persistent storage.) Let's say you're developing a system for issue tracking with a business rule like "anyone can create a ticket or change its status, but only managers can raise it to 'critical.'" A fragment of an Issue object might look like this (some detail omitted to focus on encapsulation/data hiding issues):
public class Issue {

private String m_status;
public String getStatus() {
return m_status;
public void setStatus(String newStatus) {
if (newStatus == STATUS_CRITICAL && !getCurrentUser().isManager()) {
throw new SecurityException("critical.requires.manager");
m_status = newStatus;
This looks great until you realize that setStatus(STATUS_CRITICAL) is also going to be called from the persistence layer in materializing an existing Issue that is already critical, not just when making an explicit change through the UI workflow. Since anyone can view any issue, SecurityException will be thrown when a non-manager tries to view an issue that is already critical. We immediately recognize that the persistence layer needs a way to get "privileged" access to set the underlying field directly, bypassing business logic.

The typical workaround is to give up encapsulation and move the business logic into the corresponding service layer object (e.g., stateless session bean) for issue transactions:
public class IssueManager {

public Issue findIssueById(Long id) ;
public Issue newIssue(... fields ...) {
// begin TX
// ... setup new issue
if (status == STATUS_CRITICAL && !getCurrentUser().isManager()) {
throw new SecurityException("critical.requires.manager");
// ...
// commit TX
public void changeStatus(Long id, String status) {
// begin TX, load issue
if (status == STATUS_CRITICAL && !getCurrentUser().isManager()) {
throw new SecurityException("critical.requires.manager");
// commit TX
Now, two real consequences are apparent. First, giving up encapsulation leads to cut-and-paste programming, violating the "don't repeat yourself" principle; this increases the risk of error of the business rule not being cut-and-paste again somewhere it's needed. Second, you lose polymorphism; it is now very difficult to have a subclass of Issue with slightly different business rules. (For example, maybe the main Issue has no restriction on setting status, but a specific type of issue has the critical-requires-manager rule.)

It's true that you could have two separate sets of getters/setters in the Issue itself, one that applies business logic and one that allows direct access and is only used by persistence. This would address the polymorphism issue. But if that direct accessors are also public (as EJB CMP requires) then you still lose data hiding; nothing prevents your service layer/transaction scripts from calling these methods directly.

If you're using Hibernate, though, there is a very elegant solution. Hibernate is effectively "privileged" by manipulating bytecode, so it can touch private fields directly. Hibernate gives you two options in the above scenario:
  • You can have two separate bean-style properties linked to the same underlying field, one with private getters/setters and the other with public. The private methods access the underlying field directly, and the public ones apply business rules. This is the preferred approach, but has the downside of verbosity, plus you have to use different property names in HQL (private) and everywhere else (public).
  • Hibernate can also persist fields directly by using the "access" attribute on and so on. The upside is that this is more concise with only a single public bean-style property, but using access="field" requires the field name to exactly match the private instance variable name; this won't work if you have some kind of Hungarian naming convention like "m_foo". You can do something like access="MyFieldAccessor" where MyFieldAccessor is a custom class implementing, implementing your naming convention (mapping bean property names to member var names) but that requires extra effort.
There are other uses for this feature in Hibernate:
  • Primary keys are generally supposed to be immutable by normal business logic, set only within the persistence layer. So, "setId" methods can almost always be private or protected.
  • Collections getters and setters can also be kept private, to preserve data hiding (prevent rep exposure). Otherwise, when business logic can manipulate a collection directly, it's difficult to enforce business rules on the collection elements, or even to ensure the elements are of the correct type. (The latter may partially be addressed by generics in Java 5 and/or Hibernate 3.)
I believe JDO also instruments classes at runtime to get similar privileged access to persistent fields.
Read More

more PostgreSQL performance junk

Someone at work was running a big delete (100k rows) and it was taking forever, as if it were hung. We couldn't figure out what was going on, and there was clearly a non-linear effect: smaller deletes on 10k rows were completing very fast, less than 5 sec, while 100k rows was still running after 20 mins.

We think the big delete was taking so long because PostgreSQL may try to keep a rollback buffer for the whole delete operation in memory or something like that, causing thrashing or something like that. I haven't tried this on Oracle, but I'm guessing that it and other databases may be smarter about managing their physical storage directly (RAM vs disk) rather than relying on the underlying OS.

but then again, if you're touching 100k rows at once, probably not a bad idea to commit every so often anyway, so as to avoid a long-running transaction that could potentially hose other users.
Read More

PostgreSQL on cygwin: "Bad system call"

I was getting this "Bad system call" message from PostgreSQL 7.4.x on cygwin. It was working earlier, and I thought I had done everything right--cygserver was up and running, so I didn't know what was up. ipcs gave the same error.

Turns out I had forgot the magic word: "CYGWIN=server". The first time I installed PostgreSQL (and read the docs), I had just set the var on the command line (CYGWIN=server pg_ctl start ...) and never put it in my profile. Easy enough to fix.
Read More

PostgreSQL performance of "where exists"

Today I was looking into the performance of a PostgreSQL query with a "... where exists (select 1 from ...)" subquery:
select foo_id from foo

where exists (select 1 from bar where bar.foo_id = foo.foo_id)
I was surprised to find out that this query actually ran faster when I restructured it with a SELECT DISTINCT and a JOIN:
select distinct(foo_id) from bar

join foo on bar.foo_id=foo.foo_id
Some references on the web I've found suggest that EXISTS is the preferred way to write the above query in general. Because it's a boolean condition, in theory the database needs to scroll fewer rows because it can stop as soon as the first match is found; and the DISTINCT can be expensive if the results from the join version would not have been unique.

An ancient PostgreSQL mailing list post indicates that rewriting the query as a JOIN may be faster than EXISTS in PostgreSQL, because the join can take advantage of indexes while EXISTS does a nested loop. But, then again, I'm still using PostgreSQL 7.3.x, and EXISTS handling may well have been improved in 7.4.
Read More

"Client CVS Branch" anti-pattern

Scenario: Your team developed a custom application for Client A. The application is generally useful, so it gets re-sold to Client B. Client B wants some customizations, which are at first superficial (CSS, images, etc.), and the client expects a quick turnaround. So, you need a way to store Client B’s new version of the app in source control somehow, and you take the first approach that comes to mind: you create a branch of the original, and make the customizations for Client B on the new branch.

(Without loss of generality, I’m assuming you’re using CVS; Subversion and Perforce have different branch/merge models but they basically boil down to the same thing. But if you’re using PVCS or VSS, or worse, not using source control at all, you have bigger problems which I’ll save for some other time.)

At first, this works pretty well. But over time, you sell the same app to more clients, and they are each asking for more substantial new feature development. The strategy starts to break down, causing a whole series of problems:

- Bug fixes have to be explicitly merged into each client’s branch individually.

- Since the code in each client branch diverges over time as different things are added or changed in one or the other, merging bug fixes results in more manual conflict resolution. (The same thing goes for new features.) This also makes for more re-testing of the same thing.

- There is a potential for wheel reinvention. If you develop a new feature for Client C, and Client D asks for a new feature that is “close but not quite” the same as Client C’s, it may be developed independently twice rather than building it once in a way that accommodates both sets of requirements.

- You fail to realize potential economies of scale of in support or maintenance that you should get from having a single solution, since each client effectively has their own one-off version.

The problem stems from the lack of a well-defined “trunk” in source control that provides the common baseline functionality. Instead, each client’s version was branching off of another client’s branch (Client A’s) rather than from a common trunk. So there was no way to nail down which part of the code stays constant for all clients.

Here's a few ways to solve this problem:

- Have each client’s version be a branch from a common trunk, and have the discipline to make as much functionality in the trunk configurable at deployment/runtime as possible (the later the binding, the better). That way, you increase the percentage of code that all clients have in common, and establish a common baseline version that multiple clients share. Also, there will then be a well-defined process for upgrading a client’s branch to a new version of the baseline “core” code, and many fewer post-merge conflicts to resolve manually.

- Most teams won’t actually have the discipline to consistently put in the extra effort to make new features configurable. So you can take this a step further: give each client is own separate repository that contains only the customizations for that client, then apply those customizations as a patch against the common baseline version. This will force you to think about whether any given code should be common and shared across all clients or whether it is a customization.

- There is also a management aspect to solving this problem: make sure someone is accountable for the entire solution as deployed for all clients, not just each individual client’s project. When you’re only accountable for your own client, you’ll inevitably take the path of least resistance to keep your own client happy and not see the bigger picture of delivering for all clients more effectively (this is actually rational behavior according to game theory).

Read More