Data
AqSolDB-A-curated-aqueous-solubility-dataset

AqSolDB-A-curated-aqueous-solubility-dataset

active ARFF CC0: Public Domain Visibility: public Uploaded 23-03-2022 by Onur Yildirim
0 likes downloaded by 0 people , 0 total downloads 0 issues 0 downvotes
  • Computer Systems Machine Learning
Issue #Downvotes for this reason By


Loading wiki
Help us complete this description Edit
Context AqSolDB is created by the Autonomous Energy Materials Discovery [AMD] research group, consists of aqueous solubility values of 9,982 unique compounds curated from 9 different publicly available aqueous solubility datasets. This openly accessible dataset, which is the largest of its kind, and will not only serve as a useful reference source of measured solubility data, but also as a much improved and generalizable training data source for building data-driven models. Content In addition to curated experimental solubility values, AqSolDB also contains some relevant topological and physico-chemical 2D descriptors calculated by RDKit. Additionally, AqSolDB contains validated molecular representations of each of the compounds. Citation If you use AqSolDB in your study, please cite the following paper. Paper: Nature Scientific Data - https://doi.org/10.1038/s41597-019-0151-1 Reproducible code: Code Ocean - https://doi.org/10.24433/CO.1992938.v1 Sources of AqSolDB eChemPortal - The Global Portal to Information on Chemical Substances. https://www.echemportal.org/. Meylan, W. M. Preliminary Report: Water Solubility Estimation by Base Compound Modification.Environmental Science Center, Syracuse, NY (1995). Raevsky, O. A., Grigorev, V. Y., Polianczyk, D. E., Raevskaja, O. E. Dearden, J. C. Calculation of aqueous solubility of crystalline un-ionized organic chemicals and drugs based on structural similarity and physicochemical descriptors.Journal of Chemical Information and Computer Sciences 54, 683691 (2014). Meylan, W. M., Howard, P. H. Upgrade of PCGEMS Water Solubility Estimation Method. Environmental Science Center, Syracuse, NY(1994) Huuskonen, J. Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology.Journal of Chemical Informationand Computer Sciences 40, 773777 (2000). Wang, J., Hou, T. Xu, X. Aqueous solubility prediction based on weighted atom type counts and solvent accessible surface areas. Journal of Chemical Information and Modeling 49, 571581 (2009). Delaney, J. S. ESOL: estimating aqueous solubility directly from molecular structure. Journal of Chemical Information and Computer Sciences 44,10001005 (2004). Llinas, A., Glen, R. C. Goodman, J. M. Solubility challenge: can you predict solubilities of 32 molecules using a database of 100 reliable measurements?.Journal of Chemical Information and Modeling 48, 12891303 (2008).

25 features

ID (ignore)string9982 unique values
0 missing
Namestring9893 unique values
0 missing
InChIstring9982 unique values
0 missing
InChIKeystring9982 unique values
0 missing
SMILESstring9917 unique values
0 missing
Solubilitynumeric7872 unique values
0 missing
SDnumeric2169 unique values
0 missing
Ocurrencesnumeric15 unique values
0 missing
Groupstring5 unique values
0 missing
MolWtnumeric6907 unique values
0 missing
MolLogPnumeric8193 unique values
0 missing
MolMRnumeric8318 unique values
0 missing
HeavyAtomCountnumeric107 unique values
0 missing
NumHAcceptorsnumeric40 unique values
0 missing
NumHDonorsnumeric19 unique values
0 missing
NumHeteroatomsnumeric56 unique values
0 missing
NumRotatableBondsnumeric61 unique values
0 missing
NumValenceElectronsnumeric307 unique values
0 missing
NumAromaticRingsnumeric16 unique values
0 missing
NumSaturatedRingsnumeric14 unique values
0 missing
NumAliphaticRingsnumeric14 unique values
0 missing
RingCountnumeric20 unique values
0 missing
TPSAnumeric2325 unique values
0 missing
LabuteASAnumeric8119 unique values
0 missing
BalabanJnumeric7057 unique values
0 missing
BertzCTnumeric7608 unique values
0 missing

19 properties

9982
Number of instances (rows) of the dataset.
25
Number of attributes (columns) of the dataset.
Number of distinct values of the target attribute (if it is nominal).
0
Number of missing values in the dataset.
0
Number of instances with at least one value missing.
20
Number of numeric attributes.
0
Number of nominal attributes.
0
Number of attributes divided by the number of instances.
80
Percentage of numeric attributes.
Percentage of instances belonging to the most frequent class.
0
Percentage of nominal attributes.
Number of instances belonging to the most frequent class.
Percentage of instances belonging to the least frequent class.
Number of instances belonging to the least frequent class.
0
Number of binary attributes.
0
Percentage of binary attributes.
0
Percentage of instances having missing values.
Average class difference between consecutive instances.
0
Percentage of missing values.

0 tasks

Define a new task