[Opm] File formats

Tue May 19 08:25:09 UTC 2020

I have been working for some time now with improving performance for loading of summary data for an ensemble of models. I have been testing with some real full field models and this is typical dimensions of the problems I have looked at

•Number of summary vector: 30000
•Number of timesteps: 3000
•Size of ensemble: 100

On a typical maching avaliable for engineers in Equinor maximum number of cpu's available is 12.

My conclusions is that the UNSMRY format is not suitable for load on demand with respect to performance, not even with utilizing multithreading.

The LODSMRY file format currently implemented in opm-common if a first approach to solve this. This is using existing utilities in EclUtil and the only difference is that the data is stored pr vector instead of per time steps. This helped a lot in terms of performance.

This implementation have limitations as mentioned by Alf. The most severe as I see it, is that we have to wait until all simulations are finished and then convert. Monitoring simulation results while the runs are in progress is important.

The vendor of Eclipse has done a similar approach as done with LODSMRY in OPM. When running Eclipse with the eclrun wrapper script from Schlumberer a h5 file (HDF5 format) is created from SMSPEC + UNSMRY at the end of the simulation. Notice that h5 file will not be created if the run is stopped before end of simulation. Hence, same limitations as our current implementation. The H5 file from eclipse is not supported by

I’ve had a look at the h5 file from eclrun (with HDcompass) and I see that this files have one dataset pr vector, which means that there will be a large number of datasets (# vectors) which needs to be updated for each timestep, if we should use this exact same format for writing a h5 file being continuously updated as the simulation progresses.

I have been testing HDF5 on my own and what I have looked in to saving all of the summary vectors in one 2 dimensional dataset (number timesteps x number of vector). The dataspace is define as unlimited in one of the dimensions (time steps). It is then easy to add summary data in chunks in one dimension (H5Select_hyperslab) during the simulation. It is also easy to extract data from “the opposite” dimension. So far I have only tested this on a small test case (5 x 4) so I don’t know how this will scale and if the performance on real models is good. I’m planning to do so now.

I also believe that the "new" summary file should be self-contained, that is not dependent on the SMSPEC file (as is the case with the current LODSMRY file in OPM). From inspection of the h5 from eclrun with HDCompass, I believe this file is.  The h5 file from eclrun is supported in S3graf when used together with a SMSPEC file, but it is not supported alone.

Regards

Torbjørn Skille

-----Original Message-----
From: Opm <opm-bounces at opm-project.org> On Behalf Of Alf Birger Rustad
Sent: tirsdag 19. mai 2020 09:50
To: Joakim Hove <joakim.hove at opm-op.com>; opm at opm-project.org
Subject: Re: [Opm] File formats

> The feasability of implementing/using said format in post processing
    tools should therefore be an important criteria.

I would even say a prerequisite. We already have it in opm-common in a shape that can be used without post processing tools, but if we are to support it within Flow, I believe we must have support in at least Resinsight.

> . I *think* Petrel / eclrun / eclipse has some functionality in this
    regard - if this is a file we can be compatible with that would make
    very much sense.

Thanks for pointing it out. Yes, there is such a format. There are a number of unknowns related to that format yet. What I believe already is clear is that it is not supported by Eclipse directly, so it is also of the type the is created after simulation is done. If anybody knows more about this format, please share.

> In addition to HDF5 I would consider looking into Parquet which at
    least is a much newer format than HDF5

Thanks for the suggestion! Yes, we should read up on alternatives before deciding. If anybody has any experience or knowledge on any of the containers, please share. I am in deep water here 😊

-----Original Message-----
From: Opm <opm-bounces at opm-project.org> On Behalf Of Joakim Hove
Sent: tirsdag 19. mai 2020 07:23
To: opm at opm-project.org
Subject: Re: [Opm] File formats

My take on this is:

 1. Yes I see the value of a transposed file format - however the value
    is quite limited before it is implemented in post processing tools.
    The feasability of implementing/using said format in post processing
    tools should therefor be an important criteria.
 2. I *think* Petrel / eclrun / eclipse has some functionality in this
    regard - if this is a file we can be compatible with that would make
    very much sense.
 3. In addition to HDF5 I would consider looking into Parquet which at
    least is a much newer format than HDF5

Here is an extensive file-format comparison:
https://indico.cern.ch/event/613842/contributions/2585787/attachments/1463230/2260889/pivarski-data-formats.pdf

On 5/18/20 5:51 PM, Alf Birger Rustad wrote:
> Dear community,
>
> We are at a cross roads with respect to file formats, and I hope you are motivated to help us arrive at the best solution. We need better load-on-demand performance for summary files than what is currently possible with the default Eclipse format for summary files. Currently you will find an implementation in opm-common that simply transposes the summary vectors, while still using the same Fortran77 binary format. That approach has mainly three drawbacks. One is that it is not supported by any post-processing application (yet).
> The second is that it can only be created from a finished simulation, so you need to wait for simulations to finish before you get the performant result file.

For a traditional column oriented file format in any sense I think you will need to write out the file in full, i.e. I think this will apply anyways. Use of a database format might resolve this, or at least handle the appending transparently, but that is maybe a bit overkill?

> The third being that it is not suited for parallel processing, so forget about each process writing out it's part.

For the summary files that is not so relevant, because the final calculation of summary properties like WWCT = WWPR / (WWPR + WOPR) is only done on the IO rank anyway.

Joakim

_______________________________________________
Opm mailing list
Opm at opm-project.org
https://opm-project.org/cgi-bin/mailman/listinfo/opm

-------------------------------------------------------------------
The information contained in this message may be CONFIDENTIAL and is intended for the addressee only. Any unauthorized use, dissemination of the information or copying of this message is prohibited. If you are not the addressee, please notify the sender immediately by return e-mail and delete this message.
Thank you
_______________________________________________
Opm mailing list
Opm at opm-project.org
https://opm-project.org/cgi-bin/mailman/listinfo/opm

-------------------------------------------------------------------
The information contained in this message may be CONFIDENTIAL and is
intended for the addressee only. Any unauthorized use, dissemination of the
information or copying of this message is prohibited. If you are not the
addressee, please notify the sender immediately by return e-mail and delete
this message.
Thank you