XML stands for eXtensible Markup Language. It is one of the varients of markup language that includes HTML which is used in webpages. In the accompanying screen shot, you can see the code that is used to draw the Google homepage. Most browsers allow you to view this code.


The important things to notice are that :

THE CODE HAS THE LOOK OF ENGLISH (but not as we know it Captain!)

This has the advantage of being able to read easily, and also to learn easily. Mind you it can become complicated, but not as much as raw computer code which a whole different language!


The use of the indent has a significance. Fundamentally it says that the indented knowledge relates to the parent item in the same way that an adjective tells you something extra about a noun. When you think about 'red car' and 'green car', you logically understand that there are two cars but that they are not identical. If you were told there were a green car and a red car in the street, and asked "How many cares are there?", you would answer two. If you were asked "How many orange cars are there?" you would answer "None".


As you have just read, the indent has the status of an adjective providing more specific information about the parent. Of course, you might have appreciated a problem in this. Depending on what you are trying to describe, the schema will look very different. If you have a green apple, green car, red apple, red car and blue aeroplane, it depends on whether you are an engineer or artist as to whether you classify object>colour (this is what I would do!), or colour>object. Neither is right or wrong, it depends on your expert domain as to which fits. The other missing thing is that when you make these hierarchies, there is often a stated or implied relationship which also arises from the expert domain. Naming that relationship is also a problem! We doctors intuitively understand that things are connected (patient:disease, GTV:CTV, disease:recurrence) but if we had to put a name to all of these connections, it probably wouldn't get done and the disagreements would be legion - you may have heard the joke - what is a camel?1


* it looks like <xxx>XXXX</xxx>
* xxx = the concept, or 'semantic entity', in radiation oncology this will include things like 'FieldSize'2 or 'T_Stage'3. The semantic entity is a descrete concept understood by the domain expert (in radiation oncology, that is you!) basically it is a well recognised or standard term which is embued with professional meaning. A good example is FIELD_SIZE4
* XXXX = the instantiation of the semantic entity. So if xxx = 'Diagnosis', then XXXX = "C61" is a consistent structure
* the carets (<, >) and forward slash (/) are arranged to form the code.

XML in use

To put this last thing into XML, and you should see the points above:


You can parse this (this is an Informatics term meaning "read" or "process") to understand that

there is a patient called Michael Kirk Douglas who has been diagnosed with a cancer of the oropharynx of the squamous cell carcinoma, NOS type.

You might now as "why can't I just write this and then let a computer work out the code?", and that would mark you out as someone who doesn't know or think a lot about data! I shall demonstrate the problems!

there is a patient called Micheal Doiglss with a SCV of the iripharunz.

All I have done is to substitute letters beside each other on the keyboard. What does SCC mean? Small cell cancer? Squamous cell carcinoma? Spinal Cord Compression? Can a computer program all of these mistakes? Some are easy to fix, but you have to understand, left to free text the number of variations to how you can say something is multitudinous and often at cross purpose. Published data on estraction of medical data from medical texts rarely gets better than 90%, and often doesn't even reach this level.

But I would like to show you something which helps your work load! If you think that you are maximally efficient, you can leave off here because you won't appreciate the benefit. I have altered the text a little to reveal something -

there is a patient called Michael | Kirk | Douglas who has been diagnosed with a cancer of the [look up what ICD10 C10.9 means and insert>] oropharynx of the [look up what ICD10M 8070/3 means and insert>]squamous cell carcinoma, NOS //type. //

You see, the XML code reveals a knowledge structure which can be directly translated into plain english. And more than this, other transformations are possible which far exceed the usefulness of your dictation -

c'è un paziente chiamato Michael | Kirk | Douglas che è stato diagnosticato un cancro dell'[look up what ICD10 C10.9 means, send to Google Translate and insert translated term>]orofaringe del [look up what ICD10M 8070/3 means, send to Google Translate and insert translated term>]carcinoma a cellule squamose, tipo NOS

The XML structure is mutable into any language, presuming that Italian oncologists have the same knowledge as Australian oncologists (I have met them, and yes they do. They read the same literature!)