Incident reports from the San Franciso Police Department between January 2003 and May 2018, provided by the City and County of San Francisco. The dataset was downloaded on 05.11.2018. from [https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-Historical-2003/tmnf-yvry]. For a description of all variables, checkout the homepage of the data provider. The original data was published under ODC Public Domain Dedication and Licence (PDDL) [https://opendatacommons.org/licenses/pddl/1.0/]. As target, the binary variable 'ViolentCrime' was created. A 'ViolentCrime' was defined as 'Category' %in% c('ASSAULT', 'ROBBERY', 'SEX OFFENSES, FORCIBLE', 'KIDNAPPING') | 'Descript' %in% c('GRAND THEFT PURSESNATCH', 'ATTEMPTED GRAND THEFT PURSESNATCH'). Additional date and time features 'Hour', 'DayOfWeek', 'Month', and 'Year' were created. The original variables 'Category', 'Descript', 'Date', 'Time', 'Resolution', 'Location', and 'PdId' were removed from the dataset. One record which contained the only missing value in the variable 'PdDistrict' was removed from the dataset. Using this dataset for machine learning was inspired by Nina Zumel's blogpost [http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/]. Note that incidents consist of multiple rows in the dataset when the crime belongs to more than one 'Category', which is indicated by the ID variable 'IncidntNum' (ignored by default).