$Revision: 1.4 $
$Date: 2000/11/15 16:09:02 $
INTRODUCTION 1 TABLES 1.1 Table B 1.2 Table D 1.3 Local tables B & D 1.4 Code figures and flags 1.5 Descriptor representation 2 DECODING 2.1 BUFR message structure and decoding strategy 2.2 Replication 2.3 Basic BUFR operations and structure of decode 2.4 Bit manipulation to construct values 2.5 Output and display 2.6 Coordinates and instrumentation 2.7 Increments 3 ENCODING 3.1 Compression 3.2 Setting up descriptor sequences 3.3 Preparation of values to be encoded 3.4 Run-length encoding 4 QUALITY OPERATIONS 4.1 Bit maps 4.2 Bit maps and operators 4.3 Assumptions made to clarify specification 4.4 Programming strategy 4.5 Use of decode output in application programs 4.6 A comparable UK Met Office extension 5 TO SET UP A BUFR SYSTEM 5.1 Table access 5.2 Programs to handle messages 5.3 Calls to encode & decode Crown Copyright 1990, 1993, 1995 Meteorological Office, London Road, BRACKNELL, RG12 2SZ Note: This paper has not been published. Permission to quote from it should be obtained from the Director of Met Office User Services.
BUFR is universal in that it contains a description of the data as well as the values. The description gives a list of the elements whose values follow. It does this in a coded form that requires a set of tables to interpret it. BUFR was developed for meteorological data, but can transmit whatever elements have table entries.
BUFR is binary in that values are not confined to some number of decimal digits, as with a character-based code , or to a machine-dependent word-length, but coded in a number of bits given by one of the above tables, which can be changed if necessary by appropriate operations.
The simplest BUFR message consists of a number of descriptors followed by the values of the corresponding elements. But not all BUFR descriptors correspond to elements: some descriptors represent operations to change the way a value is coded (as above), others make the description more concise by repeating descriptors or getting sequences of descriptors from a table rather than including them in the message.
So the most essential component of a BUFR system is the table of elements, Table B. Less essential, in that messages can be made without them, are the table of sequences (Table D) and the set of possible operations.
For each element the entry in Table B gives a name, the SI units, the number of bits in which to code a value, a scale factor which can be changed by a power of 10, and a reference value to be subtracted from the original value to leave a positive number to be encoded.
The operations enable the number of bits, the scale and the reference value to be changed. They also make it possible to add quality control flags, values and differences, to skip fields and so on - the list may be further extended.
Space taken up by the description can be saved by replication and use of Table D sequences. Space in the data section can be saved by compression if several similar sets of values are coded together, by expressing the set of values of each element as a minimum, an increment width and a set of increments (in a reduced number of bits) to be added to that minimum.
The sequence of descriptors can arrange data in ways not covered by existing code forms. Space and time (coordinate) elements locate the values that follow them, space and time increments can be defined, so time sequences of regularly occurring values can be encoded. Run-length encoding is provided for images.
The sections which follow describe how the tables have been set up and the various operations encoded at the UK Met Office. These notes are to be read in conjunction with the section on FM 94 BUFR in the WMO Manual on Codes, and concentrate on points which could cause confusion.
Table B has 64 element classes, each with room for 256 entries for elements in the class. A class contains e.g. temperatures, or various humidity elements, or year, month, day... second. No information is conveyed by the choice of class for an element (but see 2.6: the distinction between coordinate classes and others is important): it just groups related or similar elements together to show at a glance what entries already exist in that field.
An entry consists of: descriptor, name, units, scale factor, reference value and number of bits used to encode a value (or rather to encode value*(10^scale)-refval ). In the format defined in the WMO Manual on Codes for exchange of Table B each entry takes 95 characters, though this allows only 40 characters for the name, which is elsewhere defined as up to 64 characters . In this form a full Table B would take up more than a megabyte, which may not be a practical proposition, so we use a more compact machinable form to make the information accessible more efficiently.
Most names of elements are much less than the maximum length, and the table is still only sparsely filled, although elements are accumulating rapidly. So a packed version of the table is possible, the names and units being replaced by 8-bit lengths followed by that number of characters.
This is the approach adopted by the UK Met Office. Our operational Table B was based on the idea that the most frequently used entries were likely to be for elements 1-64 in classes 1-32, an eighth of the table. (This seemed sensible in the 1980's, but now looks less consistent with the existing entries!) Faster access is provided to these "kernel" entries, by means of pointers which can be located immediately from the class and element number, whereas "local" entries (all the rest) can only be found by sequential searches within classes.
This gives a table with the following components:
length in description of octets field 1 length of entry (only if "local") 1 element number (only if "local") 1 length of name (=LN) LN name 1 length of units (=LU) LU units 1 format (R: real, N: numeric, F: flag/code, C: chars) 1 scale 4 reference value 1 field width(The "format" - not in the printed Table B - was introduced to help in decisions about which elements an operation applied to. But the rules have since been simplified, and the distinction between flags, code figures and characters could be made using the units field.)
Table D is merely a form of shorthand to cut down the length of the description in a BUFR message. The fact that a sequence occurs in a particular category should not be taken as providing information about, say, the instruments used (for which appropriate elements should be used). And the words in the right-hand column of the Manual on Codes' list of sequences should not be taken to imply information which is not given explicitly by the descriptors themselves: no words appear in the Manual on Codes' BNF definition of the format of Table D!
Our operational Table D is constructed on the same principles as Table B. Each "kernel" entry is ND descriptors (two octets each) preceded by a length 1+ND*2. A "local" entry has the number of the sequence in the category inserted between its length (1+1+ND*2) and the descriptors. The pointers are set up in the same way.
An arbitrary limit of 16 descriptors in a sequence was suggested in the early stages of BUFR, in the hope that any long descriptor sequence could be broken down into sequences useful in other contexts. We find that sequences of more than 2 or 3 descriptors are seldom useful in more than one context, and that spliting up sequences to restrict the maximum to 16 just fills up the table faster (any change needs a new sub-sequence and a new overall sequence to include it), so we prefer to describe messages in single local sequences not restricted to 16 descriptors.
Class 0 exchange could be used, for instance, to send one centre's local entries, needed in the decoding of the data which follows, to another centre. Local entries from another centre may well clash with our own local entries, so we must let them override our entries for the duration of a particular decoding task but not update Table B or D permanently. Having established this need, we can take advantage of the system to define sequences for internal use, only required while a certain kind of message is being handled.
On the IBM mainframe, the Local B file should be given the DDNAME LOCALELM e.g.
//GO.LOCALELM DD DSN=MDB.BUFR.LOCALB,DISP=SHROn an HP, T3E or IBMSP unix machine, the Local B file should be given the name or symbolic link LOCALELM in the run directory (the directory the BUFR executable will run in), or if using the environment variable BUFR_LIBRARY, put LOCALELM in the library pointed to by $BUFR_LIBRARY.
Example LOCALELM file:
LOCAL BUFR TABLE B ENTRIES IN EXCHANGE FORMAT, AS DEFINED IN WMO MANUAL ON CODES ON CODES (112-BYTE ENTRIES) FXXYYYNAME1...........................NAME2...........................UNITS...................SCALREFVAL.....WID 002196satellite classification CODE TABLE 0 0 9 002197satellite channel centre frequency HZ -8 0 26 002198satellite channel band width HZ -8 0 26 002221segment size at nadir in x direction M 0 0 18 002222segment size at nadir in y direction M 0 0 18 002231height assignment method CODE TABLE 0 0 4IMPORTANT NOTE: There must be at least 1 blank line at the end of LOCALELM. If not, BUFR encoding or decoding will almost certainly fail because TABLEB will not be opened later.
On the IBM mainframe, the Local D file should be given the DDNAME LOCALSEQ e.g.
//GO.LOCALSEQ DD DSN=MDB.BUFR.LOCALD,DISP=SHROn a HP, T3E or IBMSP unix machine, the Local D file should be given the name or symbolic link LOCALSEQ in the run directory (the directory the BUFR executable will run in), or if using the environment variable BUFR_LIBRARY, put LOCALSEQ in the library pointed to by $BUFR_LIBRARY.
Example LOCALSEQ file:
309255 UPPER AIR SIGNIFICANT TEMPERATURES AND WINDS 001001, 001002, 001011 STATION NUMBER OR CALL SIGN 005002, 006002, 007001 LATITUDE & LONGITUDE, STATION HEIGHT 004001, 004002, 004003 DATE (YEAR, MONTH, DAY) 004004, 004005 HOUR & MINUTE (IF KNOWN) OF LAUNCH 002011, 002014, SONDE TYPE, TRACKING SYSTEM 002013 RADIATION CORRECTION 022042 WATER TEMPERATURE 104000, 031001, 008002 CLOUD DATA (LOW, MIDDLE, HIGH) 020012, 020011, 020013 CLOUD TYPE, AMOUNT & BASE FOR EACH LEV 008001, 106000, 031001 SEMI-STANDARD LEVELS (775MB ETC) 010004, 010003 PRESSURE & HEIGHT 012001, 012003 TEMPERATURE & DEW POINT 011001, 011002 WIND SPEED & DIRECTION 008001, 103000, 031001 SIGNIFICANT TEMPERATURES 010004, 012001, 012003 PRESSURE, TEMPERATURE & DEW POINT 008001, 103000, 031001 SIGNIFICANT WINDS 010004, 011001, 011002 PRESSURE, WIND SPEED & DIRECTION
But one further table can usefully be made, for decoding purposes only, or rather for displaying data coded in BUFR: it consists of brief descriptions corresponding to the code figures and flags. It seems best to avoid (as far as posible) displaying the code figures themselves: even where these correspond to existing WMO codes, not all users can be expected to know the codes, and many code and flag tables have been made specially for BUFR, either from scratch or by combining existing tables.
The problem is that (unlike element names) descriptions of code figures can be very long, especially where effectively several code figures and flags have been combined, as for present weather. This means that a brief form, say 12 characters, displayable in a table column, is not always easy to find.
But most of the code figures have, despite this, been compressed into a 12-character form which hopefully remains meaningful: those remaining will apppear as figures in a display, leaving the user to look them up in the Manual on Codes.
The structure of the code figure table is as follows. Each description of a code figure (maximum 12 octets) is preceded by a length (1 octet), each set of code figures in a table by a count (1 octet). For each table there is an index entry consisting of the descriptor (2 octets) and a pointer (2 octets), these index entries being stored sequentially with a count of code tables at the start.
Descriptors appear as 6-figure numbers in the BUFR documentation. But if F, XX & YYY in FXXYYY are fields of 2, 6 & 8 bits respectively, then the numerical value of a descriptor is not equal to FXXYYY read as a single integer, but F*16384+X*256+Y rather than F*100000+X*1000+Y.
We therefore need several functions for converting from one form to another: from a 16-bit field in section 3 of a BUFR message to separate values of F, X & Y and hence the 6-figure displayable form as above (for, say, error messages), and from a 6-figure form (as in the documentation, and therefore more readable) to the 16-bit form used in encoding and decoding. DESFXY (DESCR,F,X,Y) converts a 16-bit descriptor to values of F, X & Y (all integer) and the function IDES (FXXYYY) converts from 6-figure form F*100000+X*1000+Y to 16-bit form.
But note that to find a given meteorological element in a message it is generally not enough to find a single descriptor: to find an element like tropopause temperature means finding two descriptors, not necessarily consecutive: 008002 with a value of 3 for the tropopause and only then a temperature descriptor. So in practice a data base interface is needed between a BUFR decode as described here and a meteorological user.
BUFR Section 1: length in octets 1-3, originating centre in octets 5-6, flag for section 2 in octet 8, type of data in octets 9- 10, date/time in octets 13-17 Section 3: length in octets 1-3, number of reports in octets 5-6, compression flag in octet 7, descriptors in octets 8-9, 10-11 etc Section 4: length in octets 1-3, bit string starting in octet 5 7777The task of decoding as defined here is to achieve a correspondence between descriptors and bits in the data section, so that we know how many bits make up a value, what element it is a value of, any scale changes etc, and then return arrays of descriptors and values in such a way that it remains clear to a calling program which value corresponds to which descriptor.
Conceptually this is a matter of taking Table B entries, perhaps with modified scale figure etc, and adding a further column to give the corresponding value. But in fact there is no need to set up the whole of such an array, which could well occupy a megabyte for a large message. If the aim is to display the contents of the message, then lines can be output as they are set up rather than held in core; if not, then what is wanted as output is an array of values with all operations performed and a corresponding array of descriptors to identify the elements (the other columns are only used while an element is being handled and can be discarded when the next element is reached - except when quality operations are possible (see 4.2)).
So, although at first it might seem convenient to separate expansion of the description, that is the process of looking up sequences, performing replications, adding quality control fields etc, from the bit manipulation involved in finding the corresponding values, this is not advisable for reasons of efficiency.
But there are more fundamental reasons for combining expansion of descriptor sequences and bit manipulation. To see why, we need further consideration of the replication operation.
First we must distinguish between explicit and delayed replication. A replication descriptor says how many descriptors to repeat. It may also say how many times to repeat them, but this count may be set to zero, in which case it has to be found in the data. This makes sense where, say, the number of levels in a profile is not known beforehand and may vary from profile to profile: delayed replication enables the same sequence of descriptors to be used for all profiles (though obviously not with compression if the count varies).
A descriptor sequence which includes delayed replication cannot be expanded in isolation from the data. It would be possible to find the replication counts before the values of the elements (by adding up the number of bits to skip) and so keep the two processes more or less separate - but there are further complications.
Replication originally applied only to descriptors: the descriptor sequence was abbreviated to save space and has to be expanded to match the data. But when a replication operator is followed by a data repetition count, rather than an ordinary delayed replication, the data value itself must be repeated the same number of times. This is for run-length encoding of images consisting of a fixed number of values of a given element, the precision being such that many successive values may be the same.
For instance, any line of a radar image can be broken up into segments consisting of identical pixel values and segments where the values vary. The first kind of segment calls for data repetition, a descriptor and a value both encoded once to be repeated N times in the output; the second requires replication, N values to be coded in the message and one descriptor repeated N times in the output to correspond. Clearly such a descriptor sequence cannot be expanded in isolation from the data.
The third complication is the replication of coordinate increments. An element in one of the time or place classes immediately before a replication operator is taken to be included in the N-fold replication as an increment to be added N times, but without any further value in the data. There can be increments for more than one coordinate element.
Now consider nested replications, say for coding an image line by line: an outer replication for the number of lines in the image and inner replications to describe each line. The outer replication is preceded by, say, a latitude increment, the inner by a longitude increment; no pixel values occur except inside the inner replication.
Clearly the increment before the outer replication must be distinguished during the decoding process from that before the inner replication, or else it will be replicated again: it must be flagged as already replicated, and only unflagged when the expansion is complete.
In other words, there are descriptor sequences which cannot be reduced to sequences of element descriptors without destroying vital features of their relationship to the data. Hence sections 3 and 4 must be handled together.
F=0: element (class X, element Y in Table B) an element can be character or numeric, a numeric element a number, code figure or flag(s), and any element not in Class 31 can have associated fields F=1: replication (of the following X descriptors Y times) Y>0: explicit (count in descriptor) Y=0: delayed (count in data, either ordinary replication or data repetition F=2: operation X=1: change field width (by Y-128 bits) X=2: change the scale, i.e. multiply by a power of ten (by 10^(Y-128)) X=3: change reference values X=4: add Y-bit quality control field X=5: insert string of Y characters X=6: hide local descriptor [for quality operations see 4.2] F=3: sequence (category X, sequence Y in Table D) F=1 If replication is delayed, the count is found in the data. Increments immediately before the replication operator are counted and the increment descriptors added to the end of the sequence of descriptors to be replicated. Space is made (as for a sequence) and the replication carried out. The values of any replicated increments will be copied in the output value array. If a count in the data is zero, delete all the descriptors that would have been replicated, including the increments, as well as the replication operator and count. If the count in the data indicates run-length encoding, flag the element descriptor (asssuming that only one element at a time can be run-length encoded) and repeat it, leaving the operation to be completed by repeating the values in the value array. We also need a flag to be set when the descriptors are repeated and then unset when the value has been got from the bit string, to avoid looking in the bit string for further values. F=2,X=1,2,4 Width increment, scale increment and stacks of Q/C field width and field meanings are set accordingly and used whenever values of an element are found. Each value is then preceded in the output by the meaning of each field and the field itself, for as many pairs of meaning and value as are currently nested. F=2,X=3 Changed reference values are listed (in parallel arrays of descriptor and reference value) and the list consulted whenever values of an element are found. F=2,X=5 Inserted characters are put in the same string as character values. F=2,X=6 The descriptor and value are skipped - unless there is a local Table B entry with the same data width. F=3 Insertion of a sequence is simple. Space is made by moving the remaining descriptors down; the inserted descriptors overwrite the sequence descriptor itself, and scanning of the descriptors continues with no adjustment to the pointer, i.e. with the first descriptor in the inserted sequence.
There are several ways of doing this. It can be done a bit at a time, testing whether a bit is set in the bit string and building up the value by doubling and either adding one or not adding accordingly.
Our Fortran program takes a slightly more complicated (but faster?) approach, working an octet at a time. We start in octet N=I/8. In this octet NINIT=I-N*8, i.e. MOD(I,8), bits have already been used. The value will extend over NOCTET=(WIDTH+NINIT+7)/8 octets, and in the last of these octets NLAST=WIDTH+NINIT-(NOCTET-1)*8 bits will be used.
The value is segmented in this way, bits being shifted in an octet by multiplying or dividing by powers of 2. A value that fits into one octet is treated as a special case.
A character value is encoded one octet at a time.
A value which is all ones, i.e. equal to 2^(WIDTH-1), is missing except in the case of a one-bit element or associated field, which is simply a flag set on or off.
Operationally we use an Assembler program which works one 32-bit integer at a time. It cuts encoding/decoding times by a third.
Skip I/32 words, load two words, shift left MOD(I,32) bits to get rid of unwanted bits in previous values and right 32-W bits to align the value, losing any bits of following values.
The Assembler method assumes that no value will be too big for an integer, in our case 32 bits, and both routines at present output integer values - but it may be that in the future there will be elements with so many bits that this loses precision.
Example: a 13-bit value is split between octets as follows:
=====+++ ++++++++ ++====== octet 1 octet 2 octet 3 NOCTET=3, NINIT=5, NLAST=2 Build up the value V as follows: in this case: V1=MOD(OCTET(1),TWOTO(8-NINIT)) V1=MOD(OCTET(1),8) V2=V1*256+OCTET(2) V2=V1*256+OCTET(2) V =V2*TWOTO(NLAST)+OCTET(3)/TWOTO(8-NLAST) V=V2*4+OCTET(3)/64 where TWOTO is an array of powers of 2.
For character elements the corresponding value points to a character string: the value is length*2^16 plus pointer.
Ideally the N-th descriptor in the output would correspond to the N-th value or N-th row of values, i.e. all operators would have been used and then deleted, leaving only element descriptors. But unfortunately this is not generally so.
In the expansion of the BUFR descriptor sequence the following aims at first sight seem reasonable: (1) to leave a valid sequence of descriptors after any operation, (2) to end up with a sequence in one-to-one correspondence with the values, i.e. with no operators left in it, (3) to end up with a sequence that can be used to reencode selected subsets of values (reports) from a compressed message, (4) to end up with a sequence which can be used to decode another subset (if there are several subsets in the message with no compression).
Of these aims (3) is questionable, because what is wanted in section 3 of a BUFR message is more likely to be the original than the expanded sequence, (2) requires decisions about whether delayed replication counts are to be put in the output value array and what descriptors should correspond to quality control fields, (1) is unattainable for reasons like those described in 2.2, and (4) is internal to the decoding process, so better abandoned - it's simpler to keep the original sequence and repeat the expansion.
In fact aim (2) is inconsistent with (1) and (3): if our aim is correspondence with the values, and therefore operators are deleted after use, then we're left with replication counts with no replication operators; if the operators were left, then the descriptor count (X) would have to be adjusted during subsequent operations, which would be difficult.
So the best we can aim for is some correspondence between descriptors and values (essential - though some descriptors may have to be skipped) and the possibility of reencoding starting with the original descriptor sequence (though this would depend on the operations used).
So the output descriptor and value arrays depart from one-to-one correspondence and immediate reencodability in the following ways:
Our BUFR decode provides an optional display of the values (one line each: element name, units, value - if the value is a code figure, then if possible it is replaced by a brief description, and flags are handled similarly, a bit at a time).
Example of display:
WMO BLOCK NUMBER NUMERIC 33 WMO STATION NUMBER NUMERIC 946 LATITUDE (COARSE ACCURACY) DEGREES 45.00 LONGITUDE (COARSE ACCURACY) DEGREES 34.00 HEIGHT OF STATION M 205 TYPE OF STATION CODE TABLE MANNED YEAR YEAR 1996 MONTH MONTH 4 DAY DAY 21 HOUR HOUR 0 3 6 9 12 15 WIND DIRECTION AT 10M DEGREES TRUE 170 0 30 60 50 230 WIND SPEED AT 10M NUMERIC M/S 3.1 ********* 2.1 4.1 3.1 5.1 CLOUD TYPE NO CL CLOUD NO CL CLOUD NO CL CLOUD CU CAL NO CL CLOUD CU CAL CLOUD TYPE AC TR LEVEL AC TR LEVEL AC TR LEVEL AC TR LEVEL AC TR LEVEL AC TR LEVEL CLOUD TYPE NO CH CLOUD CI FIB (UNC) CI SPI SHEAF CI SPI SHEAF NO CH CLOUD NO CH CLOUD
This vagueness has led, for instance, to a disagreement about whether a station height (007001) can be incremented by the height increment 007005 to give heights in a profile. Being in Class 7, a coordinate class, 007001 can reasonably be taken to apply to any data which follows - rather than just giving information about the station, in which case it should be in a non-coordinate class. But even so the combination of 007001 and 007005 has been objected to.
If note 94.5.3.3, about coordinate elements "contradicting" one another, is seriously meant as something programmable, if we must in principle always be able to output from a decode all the coordinates of a given element, then we need up to ten (one for each coordinate class, some of them at present reserved) 256*256 bit tables, specified as part of the BUFR documentation, to allow decisions about contradiction to be made for all possible elements.
Fortunately a looser interpretation is possible in contexts not involving increments, which leaves decisions about contradiction to the user at the data base interface. We can say that the coordinate/value distinction is entirely a matter for retrieval, i.e. selection of data from BUFR messages: the user specifies values of coordinate elements (bearing in mind, for instance, that more than one descriptor is possible for latitude and longitude!) and it is at this stage that decisions must be taken about which of several "contradictory" coordinate elements to use.
But this is only one aspect of the problem. A coordinate can be redefined by a conflicting element in the same class - but there are times when we want to say that a coordinate no longer applies rather than decide between two conflicting coordinate elements. This is especially true of class 2, instrumentation. There is some agreement that an instrumentation "coordinate" can be cancelled by a missing value of the same element, but this leaves problems of interpretation.
Suppose we use the element "sonde type" and then (for comparison, say) have a measurement not made by a sonde. There may be a corresponding instrumentation element which could be interpreted as contradicting "sonde type"; but this is certainly not true for all possible elements. Sonde type implies, among other things, a temperature-measuring instrument; but the instrument used in a screen at the surface is assumed to be known! A missing value for sonde type could mean (if the above convention is accepted) that the coordinate no longer applies: but sonde data for an unknown sonde type is quite conceivable!.
(But is Class A meant to rule out such combinations of data? This is another vague feature of the BUFR system: the classification is partly by place and partly by instrumentation. "Surface data: land" appears to cover measurements by satellite and sonde and aircraft on the ground as well as ordinary anemometers and thermometers! Is the category of satellite sea surface temperature data 0, 5 or 31? )
Note also that the proliferation of instrumentation data in BUFR has made some early element names inappropriate: 002003, "type of measuring equipment used", is clearly meant only for PILOTs when the code figures are examined.
Clearly the current position is obtained by adding the increment, if there is one, to the original position. But what if there is more than one increment for the same element? The general BUFR rules would say the second overrides the first, so add the second increment to the original value; but increments before replications are clearly meant to take effect cumulatively, i.e. the value before the replication count is added repeatedly to the original value.
We must then assume that if a new original position is given, any increment is cancelled. If, for instance, we reach the end of a row in scanning an image, restating the original longitude will take us back to the start of the next row. Until the longitude is restated the increments remain in force, even outside the replication which added them, so that a run-length-encoded row, consisting of several segments, each with its own replication, will accumulate increments along the whole row, rather than go back to the original value at the start of each segment.
So we must assume that increments involved in replications always (not just within the replication) take effect cumulatively: that an increment can be cancelled by resetting the original coordinate at the start of a row, but then each step is always added to the current value of the increment, however many segments there are in the row.
Our decode program replicates the increments explicitly if an increment descriptor appears before a replication operator: the increments can then be converted to incremented values of the coordinate in a further pass through the output array.
Increments before replication operators are recognised by the presence of the word 'increment' in the name. The matching up of increments and elements incremented is (fortunately) an operation that can be handled outside the basic decode. We suggest incrementing an element only if an element in the same class (in classes 4-7) and with the same units is found with the same name as far as 'increment', or at least with the word 'increment' in its name, so as not to tie the increment recognition process to the word order of English (other centres may use translated element names, and the equivalent of 'increment' could come at the start rather than the end of a name!) - but one day there may be an element with 'increment' in its name which despite that is not an increment in the sense of this section), so this is still not a satisfactory proposal.
One BUFR rule about increments is clearly stated: 94.5.4.3 says that a replicated increment is added the first time to give the coordinate of the first set of replicated data, so the original coordinate in the BUFR message must be the first position or time minus the increment.
One such assumption concerns compression of character values.
Compression in general consists in taking N values of an element, finding the minimum and coding that in the current number of bits for the element, followed by an increment field width and N increments which, when added to the minimum, reconstruct the values.
Compression is done by scanning the values to find the maximum and minimum, allowing for missing data. Find the number of bits needed to code maximum minus minimum plus one (from the next highest power of 2, the smallest M such that max-min+1<2^M). That is the increment width. One is added because all ones would be taken as representing missing data; so if max-min=(2^M)-1 for some M, the number of bits needed is not M but M+1. Missing values are ignored in finding the minimum, but a flag is set if missing values exist: max=min with no values missing means no increments to be coded, but max=min with missing values means one-bit increments, set to 1 if the value is missing.
If a value cannot be encoded in the field width, it is set to missing before it can affect the range of values.
Now consider compression of characters. Character fields are left-aligned, so compression saves nothing even if all the values are short compared with the field width (only a change of the field width itself would save space). It saves a lot of work to assume (when encoding) that the "local reference value" coded before the increment width and increments is not necessarily the minimum as above (i.e. such that at least one increment is zero), but can be simply a convenient value, in this case binary zeros so that the characters are encoded unchanged. (But a check is made to see if all the character values are the same, in which case the value and a zero increment width are coded.)
During decoding, on the other hand, no such assumption can be made: other centres may well have gone through the laborious business of subtracting character strings to arrive at increments which are not in themselves characters.
Examples: values to be coded 45, 37, 19, 22, 17 minimum = 17, max minus min = 28, hence 5 bits values to be coded 21, 3, 13, 34, 5, 8 minimum = 3, max minus min = 31 - but an increment of 31 in 5 bits would have all 5 bits set and therefore mean missing, hence 6 bits are needed
Obviously any set of values can be encoded given a descriptor sequence which is in one-to-one correspondence. But usually a description comparable in length with the data is not acceptable when BUFR provides so many ways of abbreviating it. If data which has just been decoded is re-encoded, and the descriptor sequence in the original message is reusable, then it is reused; but there is no obvious way of making a shorter descriptor sequence automatically when no unexpanded sequence is available. Such a process would be like decompiling machine code into a high-level language.
In other words, descriptor sequences can be expanded but not contracted. We therefore need a way of checking that a sequence chosen by a user from entries in Table D and Table B will expand as expected, a way of showing clearly where values should come in an input array, what scale changes are required and so on.
A program to do this will obviously not be able to use counts in the data, but can for instance inset descriptors which will be replicated. (Because delayed replications can't be carried out, different programming techniques are called for: we need a stack of nested replications, with counts of descriptors at each level.)
One of the features of BUFR more easily overlooked when setting up descriptor sequences is the distinction between coordinate elements and others (see 2.6). Time and place precede values at that time and place, and elements in certain other classes, like instrumentation, likewise apply (until changed) to the values that follow.
This effect is not overridden by replication: if the coordinates in a group of replicated descriptors don't come first, they apply to the first values of the elements which follow in the replicated group and the second values of the elements before - then comes a further coordinate change, and so on.
Of course a user who wants all the data in a message knows how to interpret it and won't connect the values and coordinates wrongly. But a general retrieval program going through data of different kinds might well look for values of a certain element at given places and times, ignoring any other elements, and return wrong data if the coordinates are out of place.
The above-mentioned program (SCRIPT) for showing how a sequence will expand puts a blank line in front of any coordinate element (or sequence of successive coordinates), hoping that an unexpected break will warn a user that the strict interpretation may be not what is intended.
If the input is a real array, then the scale column can usually be ignored. What is required is values in the units specified. The scale can be taken as a warning about what rounding will be done in the course of encoding - but then presumably the precision of the data is reflected by the description chosen by the user at an earlier stage (whether to code temperatures in whole degrees, or tenths, or hundredths - with a change of scale if necessary). The user only needs to ensure at this stage that temperatures are in Kelvin rather than Celsius. (Obviously, if the input is an integer array, then a temperature in tenths is required if the scale factor is 1.)
The reference value in Table B is likewise not the user's concern. For temperature it was possible to choose units (degrees Kelvin) which always give positive values, so no nonzero reference value was needed; for latitude that is not possible, and so the encoding process must subtract a large enough negative number to give always a positive number to encode. But this requires no action by the user.
An example may help. A temperature is normally in degrees Kelvin with a scale factor of 1, i.e. in tenths. So real input requires a value like, say, 287.6; this number will be multiplied during encoding by 10 to give 2876, the value to go into the bit string (unless, of course, there is compression).
Beware that if the scale is changed and the reference value is not zero, then it may be necessary for the user to change the reference value to go with the new scale. (But a change is not essential if the scale change leads to less precision; and the expected range of values may be such that for greater precision no change is needed - the reference value only needs to be a large enough negative number.)
Beware also of scale changes for precipitation, where negative values are really code figures and so the reference value should stay as -1 or -2 regardless of changes. So a trace is always -1 or -2 regardless of scale. The encode and decode both assume that a negative value of any class 13 element with a reference value of -1 or -2 is a trace and therefore not scaled.
For character values we make the corresponding number in the value array a pointer to a character string (see 4.3 for details of the call). There is no need for a length, which is given by Table B with any adjustment. "Inserted characters" (operation 5, which gives the length) simply follow on in the input character string with no pointer in the value array.
The first is for straightforward delayed replication, which is explained clearly enough in the documentation. The second is for "run-length encoding" of images: if the range of pixel values is small, so that, when an image is scanned, many successive values will be the same, it is convenient to give the number of identical values rather than encoding the value that many times.
A descriptor pattern which makes this possible without requiring a different sequence of descriptors for each image is as follows. Any row can be broken up into a set of "parcels" each consisting of a number of strings of identical values followed by a string of different ones. In this way an image can be described by a general sequence of 15 descriptors (see below), to be expanded using the counts in the data.
The basic BUFR software can encode an image in this way if passed the counts and told to use this descriptor pattern. But this is not the only possible approach to image encoding, so the sequence of descriptors is not embedded in the basic programs, and the above outline can be implemented in various ways: for instance, greater compression could be achieved (at the expense of more elaborate programming) by treating values repeated only 2 or 3 times as if they were different (the values themselves take up less space than the extra counts required).
Our method is to provide a preliminary call which takes a 2-dimensional array representing an image and returns a sequence of values with counts inserted, ready to be encoded with the descriptors which are likewise returned by the program (with the element concerned, e.g. pixel value, and increments inserted). This is only one way of run-length encoding an image: the user can, of course, replace the call to RUNLEN by any program which produces valid sequences of values and descriptors to be passed to the encoding program.
1 005001 initial latitude (minus increment) 2 005011 latitude increment from row to row 3 113000 replicate the rows of the image 4 031002 number of rows 5 006001 initial longitude (minus increment) 6 110000 replicate "parcels" of different and same in row 7 031002 number of parcels in row 8 006011 longitude increment along row 9 101000 repeat a string of different values 10 031002 number of different values 11 030001 descriptor for pixel element itself 12 104000 replicate runs of identical values 13 031002 number of runs 14 006011 longitude increment along row 15 101000 replicate a string of identical values 16 031012 number of identical values 17 030001 descriptor for pixel element itself
Because this proposal was under consideration for so long we had to implement a comparable scheme to meet UK Met Office needs long before acceptance. Its code tables are the same as those accepted and its operators can be distinguished from the new ones, so provision for it has been left in the software as we have no plans for immediate conversion to the new operations in our own data base. This chapter therefore ends with a brief description of our BUFR extension.
A bit map is a set of values of the one-bit flag element 031031 (0 - data present, 1 - data not present). An N-bit map defines a subset of the N elements (elements rather than descriptors!) preceding an operator of the form 2XX000, where XX=22, 23, 24, 25 or 32. Elements here means effectively values in the data section, i.e. any delayed replication counts are included.
If M bits in a bit map are zero, then values of the corresponding M elements will follow in the data section as the result of any operation which uses this bit map. These values will be corrections, original values, differences, statistics etc as indicated by XX (together with 008023 or 008024 if XX is 24 or 25) or Class 33 elements in the case of 222000. But the values may not follow immediately and may not be consecutive; their positions in the data that follows will be shown by M place-holders of the form 2XX255 or M Class 33 descriptors. The I-th place-holder corresponds to a value of the I-th of the M elements with zeros in the bit map, encoded with its scale, data width & reference value as modified by any operations in force for the original value.
The set of operators finally accepted has redundancies resulting from the different versions the proposal went through. Of the four operators added later, 236000, 237000, 237255 & 235000, only 235000 is essential as the proposal now stands, and its definition is too restrictive.
236000 defines a bit map for use later, but a bit map can be recognised without it. 237000 reuses a bit map, but only one bit map can be currently defined, so again the descriptor is unnecessary. 237255 cancels a bit map, but a new bit map, taken to supersede the old one, would have the same effect. Only 235000 is essential: it unsets the end of the set of values referred back to by a bit map, leaving the next 2XX000 (where XX is 22 to 25 or 32) to reset it. Without this all quality operations would refer back to the same point.
Our decode also allows the same bit map to be used for different sets of elements. This possibility is, strictly speaking, ruled out by the operations as currently defined, but taking the least restrictive approach we see no reason why 235000 should cancel the bit map at the same time as changing the set of elements referred back to. If a new bit map follows, it will override the previous one; if not, the previous bit map can be left in force.
The only alternative is to stop the decode because a rule has been broken, whereas it may well be possible to continue successfully. But remember that, while this may be a useful feature, messages should still be encoded to follow the rules as closely as possible, or more restrictive decodes may fail!
Given this log, we need action to carry out quality operations at the following points:
In the case of quality operations, if the message contains several temperatures and a correction to one of them, the decode as described above would print out a temperature but not make it clear which original value was being corrected. Rather than leave higher-level programs with the same manipulation of bit maps to repeat, we need pointers to link original value and correction in the output descriptor array. This array already needs to include scale change and (modified) replication operators as well as element descriptors, because (as explained in 2.3) information which may be needed would otherwise be lost.
As pointers we use the place-holders (because XX gives information about the value added by the quality operation) with numbers set in the top bits. Each place-holder was replaced above by a descriptor; to set these pointers we keep a list of descriptors to be inserted in the sequence before completing the decode. The n-th insertion in this list puts a place-holder with n in the top bits after the original value and an identical place-holder with n set after the correction or whatever value is added. More than one such pointer can follow the original value. We can then get from original value to correction or vice versa by searching for a uniquely identified descriptor.
We need a set of comparable values or differences (observed minus analysed, observed minus forecast etc) to be attached to reported values. As well as the differences themselves we need descriptors to say whether the values are analysed, forecast, statistics, previous or neighbouring values, and also which model produced the forecast, the times of fields and so on.
We introduce an operator 223YYY which puts the Y descriptors which follow before each non-coordinate element to which the operation applies. The Y descriptors (or their expansion) may include 008023 and 008024, which will be taken as markers for added values or differences (respectively) of the element to which the sequence is attached; the descriptor for the element itself will be inserted as the attached sequence is expanded. The last descriptor of the sequence must be 008023 with a value of zero, to indicate that observed data (the original value) follows.
Added values of the element are encoded like the original value, with any changes of data width, scale and reference value in force; differences are encoded (at present) with a data width of N and a reference value of -2^(N-1), where N is the data width for the original descriptor. It would be better to use a width of N+1 and reference value of -2^N (as in the new operations, which had the benefit of our experience!), giving a range twice the original and centred on zero, which can encode all possible differences; at present any large relative humidity difference, for instance, may have to be set to missing.
The only changes of data width etc which apply to other elements in the attached sequence are those defined within it, which lapse when the original value is reached; any coordinate changes made within the sequence are likewise assumed to lapse at the end.
So the action taken for a descriptor with F=2, X=23 and Y>0 is as follows:
The start and end (marked by a descriptor 008023 indicating an observed value) of the descriptors defining the sequence of values are kept, the sequence being kept in its place until the operation is cancelled and inserted after any element descriptor, the element descriptor itself being put in the sequence after each 008023/024. An operator 223000 is put in the output descriptor array to mark the start of each occurrence of such a sequence.
Table Made by from input Size Rough number of program (file name) entries (Oct 93) B NEWTABLB EDITABLB 44K 450 elements D NEWTABLD EDITABLD 16K 150 sequences Codes NEWCODES EDITCODE 16K 120 code tables Note: 120 code table is about 1000 code figures.The input data (readable) can of course be edited to add new entries. These should be inserted so as to leave the descriptor numbers in sequence (rather than putting new entries at the end; i.e. no sort is done by NEWTABLEB etc). Remember that new sequences (in this version of Table D) must have no more than 16 descriptors (i.e. longer sequences should be broken up).
The three tables can be accessed (with the same file names as above) by the following programs:
TABLEB (X,Y,SCALE,REFVAL,WIDTH,FORMAT,NAME,UNITS)returns the fields of the Table B entry for 0XXYYY,
where X & Y (integers) are input and the rest (3 integers and 3 character strings) are returned.
(WIDTH=0 if there is no entry 0XXYYY in Table B)
TABLED (X,Y,SEQ,NSEQ)
returns the sequence 3XXYYY in Table D,
where X & Y are input and NSEQ is the number of descriptors returned in SEQ
(all arguments are integer, NSEQ=0 if no sequence 3XXYYY in Table D)
CODE (DESCR,VALUE,WORDS)returns in WORDS a description (not more than 12 characters) corresponding to the code figure VALUE of the descriptor DESCR (both integers)
(WORDS=' ' if no such code figure or value)
VALUE (STRING,IBEFOR,WIDTH) gets a value in WIDTH bits after the first IBEFOR bits of STRING, where STRING is section 4 of a BUFR message (starting with the length). VALOUT (STRING,IBEFOR,WIDTH,value) puts a value in WIDTH bits after the first IBEFOR bits of STRING.
ENCODE A VERSION 2 BUFR MESSAGE =============================== CALL ENBUFV2(DESCR,VALUES,NDESCR,NELEM,NOBS,NAMES,DATIME,MESAGE,CMP,L, EDITION,MASTERTABLE,ORIGCENTRE,DATATYPE,DATASUBTYPE, VERMASTAB,VERLOCTAB,EXTRASECT1,CHARSECT1,EXTRASECT2, CHARSECT2,SECT3TYPE) where DESCR Integer i/p then o/p : Is an integer list of BUFR descriptors, in an array big enough for any expansion needed. The array is changed following a BUFR encode, so needs to be reset if another encode is to be attempted with the orginal descriptors. VALUES Real i/p : Is a NOBS*NELEM real array of values to be encoded (in the units given by Table B; set missing values to -9999999.0) NDESCR Integer i/p then o/p : Is the number of descriptors (if this is zero, the descriptor sequence in MESAGE will be used; if the string needs expansion, NDESCR will be found changed on return). NELEM Integer i/p then o/p : Is the number of values implied by the descriptor sequence (not always the final value of NDESCR, because the output descriptors include some operators - see 2.3) NOBS Integer i/p : Is the number of sets of values to be encoded together NAMES Character i/p : Is a character string containing any character values (for each of which, except those "inserted" by 205YYY, the VALUES array contains a subscript pointing to the start of a field in this string (the length coming from Table B)) DATIME Integer i/p : Is a 5-integer date/time (year, month, day, hour, minute) MESAGE Character o/p : Is a character string for the output BUFR message (i.e. it will consist of binary data) CMP Logical i/p : Is TRUE if compression is required, FALSE if not L Integer o/p : Is the length of the BUFR message in octets EDITION Integer i/p : The BUFR edition number (section 1). Code -99 for the default (=2) MASTERTABLE Integer i/p : The BUFR master table (section 1). Code -99 for the default (=0) ORIGCENTRE Integer i/p : Originating centre (section 1). Code -99 for the default (=74) DATATYPE Integer i/p : Data category type (section 1). Code -99 for the default (=255) DATASUBTYPE Integer i/p : Data category subtype (section 1). Code -99 for the default (=0) VERMASTAB Integer i/p : Version number of master tables (section 1). Code -99 for the default (=2) VERLOCTAB Integer i/p : Version number of local tables (section 1). Code -99 for the default (=1) EXTRASECT1 Logical i/p : Code TRUE if there is extra data to be added to the end of section 1. Is so, the data in CHARSECT1 will be added. CHARSECT1 Character i/p : Extra data to add to the end of section 1. EXTRASECT2 Logical i/p : Code TRUE if there is data to be to put in section 2. Is so, the data in CHARSECT2 will be added. CHARSECT2 Character i/p : Extra data to put in section 2. SECT3TYPE Integer i/p : section 3, byte 7 (type of data). Code 1 for observed, o for other. Code -99 for default (=1)(The length of MESAGE cannot be much more than the total length of the three inputs DESCR, VALUES & NAMES. The dimension of DESCR may have to be greater than NELEM, because some manipulations expand before deleting.)
DECODE ANY BUFR MESSAGE ======================= CALL DEBUFR(DESCR,VALUES,NAMES,NDESCR,NOBS,MESAGE,DSPLAY) where DESCR will be returned as an integer list of descriptors in 16- bit form (see 1.4), VALUES will be returned as a NOBS*NDESCR real array of values in the units given by Table B, NAMES is a character string for any character values returned (for each of which the VALUES array will contain length*(2^16) plus a subscript pointing to the start of a field in this string, the corresponding descriptor being flagged by adding 2^17), NDESCR must be input as the length of DESCR and will be returned as the output descriptor count. This must be at least twice the the number of descriptors actually returned as some workspace is needed by the DECODE routine, NOBS must be input as the length of VALUES and will be returned as the number of sets of values (reports, profiles), MESAGE is the input BUFR message, DSPLAY is set to TRUE for a display of element names and values.(Unfortunately there is no way of telling how big DESCR, VALUES and NAMES must be without first decoding the message, hence dimensions are passed in NDESCR and NOBS to avoid overwriting.)