An outline of the HEPData publishing process can be found on the “Policy and Procedure” page. It involves uploading a tar or gzip archive of data files in the HEPData-compliant format (YAML) and having the uploaded material checked by a designated IRC member directly on the HEPData site. Communication between the uploader and the reviewer is arranged by the HEPData coordinator. Once the uploaded material is reviewed and certified by the IRC it will be made public (“finalized”) by the HEPData coordinator.
The data package prepared for submission to HEPData must conform to the specific format required by the HEPData portal - please see the HEPData documentation for details of the requirements. There is a useful collection of tips on the HEPData site, please peruse it. In addition, the DAP team created a few simple examples kept in the PHENIX repository on GitHub to illustrate basic features and options of the HEPData format. Beginners are encouraged to experiment with these examples by using the “sandbox” feature of the HEPData Portal (see Appendix B below).
The basic structure of a submission package is as follows. Data contents of each item included in the package (e.g. a plot) are described in a corresponding individual file formatted as YAML. For example, if there are 5 plots in the paper you are expected to provide 5 YAML data files. In addition, a special YAML file submission.yaml describes the submission package as a whole e.g. provides the names of the data and optional image files, list of keywords etc. It also contains an abstract (typically imported as LaTeX from the publication material); unfortunately, not every LaTeX feature will work correctly on HEPData and the output will need to be checked (see the “sandbox” reference below). The text of the abstract is contained in an attribute of the YAML structure which is named “comment” (which may be confusing).
Since YAML allows comments - lines starting with a “#” sign - it is very easy to add any sort of extra information to submission.yaml that may be helpful for communication with members of the Collaboration, reviewers and for the workflow of the submission process in general. For example, it is necessary to provide the Inspire ID of the paper for the HEPData submission. It should be placed in a comment line. Also, including the PHENIX-internal PPG identifier is highly recommended as it reduces the chances of human error and facilitates communication. Both Inspire ID and the PPG identifier can be easily incorporated in the comment lines of the submission.yaml file mentioned above (i.e. in lines of text starting with “#”). There can be any number of comment lines. Including information about the designated reviewer (member of the IRC for the paper) as an additional comment line is encouraged but not mandatory. Consider the following mock-up comment lines (which would be found on the top of the submission.yaml):
# PPG999
# InspireHEP: 99999
# Reviewing IRC member: M.Phenix mphenix@bnl.gov
This is not to be confused with the comment attribute of the YAML file which almost always contains the abstract of the published paper, typically typeset in LaTeX:
comment: The PHENIX Collaboration at the Relativistic Heavy Ion Collider has measured
open heavy-flavor production in minimum bias Au$+$Au collisions at $\sqrt{s_{NN}}=200$ GeV
It is probably the easiest to use existing examples of the submission.yaml files for guidance. They can be found in the official PHENIX repository. If there are existing data files in an ad-hoc format (text etc) these can be converted to the HEPData format with some effort. The DAP team is looking at technical solutions to facilitate this process. For example, if plots are generated using ROOT macros the code can be instrumented to output same data in a format compatible with HEPData. There is a helpful write-up about preparing data for upload:
Both fixed point and scientific notation formats for numerical data will be parsed correctly on the HEPData portal and visualization will reflect the values properly. The following snippet of YAML will work properly:
dependent_variables:
- header: {name: '$d\sigma/dM_{\gamma\gamma}$', units: 'mb/MeV'}
values:
- errors:
- {label: total, symerror: 0.005}
value: 1.5e-1
- errors:
- {label: total, symerror: 0.005}
value: 0.19
- errors:
- {label: total, symerror: 0.006}
value: 0.16
- errors:
- {label: total, symerror: 0.03}
value: 0.12
- errors:
- {label: total, symerror: 0.02}
value: 0.09
independent_variables:
- header: {name: 'Invariant Mass $M_{\gamma\gamma}$', units: 'GeV'}
values:
- {value: 1.4e-1, low: 0.08, high: 1.7e-1}
- {value: 0.2, low: 0.18, high: 0.22}
- {value: 0.3, low: 0.27, high: 0.33}
- {value: 0.4, low: 0.35, high: 0.43}
- {value: 0.5, low: 0.47, high: 0.55}
There is one caveat though: data tables in the corresponding sections of the HEPData page will display these data “as is” i.e. without further conversion. In the example above, the resulting table on the HEPData page will contain 1.5e-1 and 0.19 in adjacent table cells. The person responsible for an upload package should decide whether it is acceptable or desirable to mix two different notaitons in tables.
A separate issue is related to how many decimal places should be kept in the data point themselves vs their associated errors. This is currently being formalized as a policy, and if in doubt please contact the DAP team.
Like with many other data repositories, consistent use of keywords is essential for data discoverability. HEPData makes it possible to attach a set of keywords and keyphrases to the submission package, by adding a properly formatted section to the submission.yaml file as illustrated below:
keywords: # used for searching, possibly multiple values for each keyword
- {name: reactions, values: [P P --> Z0 Z0 X]}
- {name: observables, values: [SIG]}
- {name: cmenergies, values: [7000.0]}
- {name: phrases, values: [Inclusive, Integrated Cross Section, Cross Section, Proton-Proton Scattering, Z Production, Z pair Production]}
The HEPData keywords page contains useful details. The examples created by the PHENIX team as tutorials illustrate how the keywords are placed in the submission.yaml file. The examples use a slightly different YAML notation which is equally valid:
keywords:
- name: observables
values: [MASS]
- name: cmenergies
values: [2000.0]
- name: reactions
values: [P P --> GAMMA GAMMA X]
- name: phrases
values: ['PHENIX', 'ppg999', 'example 1', 'gamma gamma', 'Proton-Proton Scattering']
In the snippet of YAML code shown above, the “reactions”, “cmenergies” and “observables” are predefined keywords while the last item (phrases) contains optional key phrases supplied by the user (which can theoretically contain anything).
If applicable, it is very helpful to include a description of the reaction(s) pertinent to the publication in the keywords section, e.g.
# note that contrary to other notation systems there is no "+" sign anywhere
- {name: reactions, values: [P P --> Z0 Z0 X]}
# alternative notation for the same reaction that can be used interchangeably
- name: reactions
values: [P P --> Z0 Z0 X]
When describing the reactions it is important to use the particle notation universally used on the HEPData portal. Note that ‘Au’ (the gold nucleus) is conspicuously missing from the list, but Uranium is included. Since particle names are capitalized as per HEPData convention it is recommended to use ‘AU’ whenever a reference to the gold nucleus is needed. Quarks are denoted as ‘UQ’, ‘DQ’ and so on. As seen in the example above, in the decription of a reactions particles are separated by space, and the arrow is depicted in alphanumeric characters.
The “phrases” section can contain essentially any sort of key phrases including completely new ones, however it is helpful to start with the list of key phrases historically used on HEPData since they are more likely to be used in queries on the site. Note that the “keywords” page on this PHENIX site contains keywords to be used primarily for the PHENIX materials uploaded to the Zenodo portal, which are mostly distinct from the HEPData conventions. If you have any comments or suggestions regarding the keywords please contact the DAP Team.
Please contact the DAP Team for more information.
It is strongly recommended to pay close attention to the content of your submission i.e. numerical values of the the data points, error bars, labels and scales of axes etc, such that the finalized material is indeed final and would never need any adjustments.
If, despite best effort, there is an inaccuracy in a finalized submission it can still be corrected by the HEPData coordinator for PHENIX who must personally resubmit the package with the corrections included. In that case, a new version of the HEPData entry is created on the portal and becomes available by default under the same HEPData ID, while the old version is still available via the specific reference (“v1” etc). While this guarantees that the data presented on the site is the most accurate according to our judgement it also creates certain ambiguity for people who already accessed the old data and perhaps performed a few downloads. While this can addressed by communicating the fact of the correction, such situations should be avoided so the quality of the original submission is paramount.
The Python package “hepdata-validator” (available from GitHub) can be used to quickly check validity of a file. You may need to use a standard Python utility like pip or its equivalent to install this package. A simple example of usage:
>>> from hepdata_validator.data_file_validator import DataFileValidator
>>> df=DataFileValidator()
>>> df.validate(file_path='myGoodFile.yaml')
True
>>> df.validate(file_path='myBadFile.yaml')
False
The “sandbox” feature of the HEPData portal allows the user to thoroughly validate the submission package in the exact form it will be eventually published. It will catch trivial errors in YAML formatting, typos in the provided filenames, missing files etc. Importantly, it helps to determine whether the LaTeX-formatted abstract is rendered correctly - the LaTeX dialect used on the HEPData site is slightly non-standard e.g. the “%” is rendered correctly without a backslash and vice versa, it is impossible to include LaTeX comments in the abstract body.
Using the sandbox feature requires an account on HEPData which is quite trivial to obtain. Once you log into the portal, the “sandbox” option will be clearly marked in the menu in the upper right corner of the Web page. The sandbox won’t be visible to anyone without the link generated by the system, so the data are protected in this manner. Uploading material will require creation of a tar archive of your files. It is a good idea to delete the tar file once you complete validation to keep the repository folder clean. The sandbox section has a useful “dashboard” feature which allows the user to keep track of their test submissions. You can delete the entries you no longer need from your sandbox to make navigation easier.
If an upload to the sandbox is successful, the Web page will eventually refresh and the rendered contents will be shown. Time required to render the contents of your submission will depend on its complexity - simple single-table submissions will render in 10-20 sec while complex multi-table entries may take minutes. If there is a failure, the user is redirected to the file upload page. Either way, there is a notification sent via e-mail telling the user about the status of their submission (i.e. success or failure). Diagnostic messages generated by the sandbox validation system and delivered via e-mail are often useful and help identify problems.
The HEPData materials are kept in a designated folder in the PHENIX Collaboration HEPData repository on GitHub. Adding your HEPData files to the repository is done as follows: