Using R in PostgreSQL

From edegan.com
Revision as of 12:26, 20 March 2018 by JeeminS (talk | contribs) (→‎Sample call of an aggregate function)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Perhaps a Simpler Alternative

Aggregate functions in PostgreSQL:
https://www.postgresql.org/docs/9.0/static/functions-aggregate.html

Aggregate Functions

Table 9-44. Aggregate Functions for Statistics

      Function	        |                    Argument Type	                   |                       Return Type	                             |                                        Description
=======================+==========================================================+=================================================================+=============================================================================================
corr(Y, X)	        | double precision                                         | double precision 	                                             |  correlation coefficient
covar_pop(Y, X)        | double precision                                         | double precision 	                                             |  population covariance
covar_samp(Y, X)       | double precision                                         | double precision		                                     |   sample covariance
regr_avgx(Y, X)        | double precision                                         | double precision  	                                             | average of the independent variable (sum(X)/N)
regr_avgy(Y, X)        | double precision                                         | double precision		                                     |  average of the dependent variable (sum(Y)/N)
regr_count(Y, X)       | double precision                                         | bigint		                                             | number of input rows in which both expressions are nonnull
regr_intercept(Y, X)   | double precision                                         | double precision	                                             | y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs
regr_r2(Y, X)	        | double precision                                         | double precision	                                             | square of the correlation coefficient
regr_slope(Y, X)       | double precision                                         | double precision		                                     | slope of the least-squares-fit linear equation determined by the (X, Y) pairs
regr_sxx(Y, X)	        | double precision                                         | double precision		                                     |	sum(X^2) - sum(X)^2/N ("sum of squares" of the independent variable)
regr_sxy(Y, X)	        | double precision                                         | double precision		                                     |	sum(X*Y) - sum(X) * sum(Y)/N ("sum of products" of independent times dependent variable)
regr_syy(Y, X)	        | double precision                                         | double precision	                                             |sum(Y^2) - sum(Y)^2/N ("sum of squares" of the dependent variable)
stddev(expression)     | smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric|	historical alias for stddev_samp
stddev_pop(expression)	| smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric|	population standard deviation of the input values
stddev_samp(expression)| smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric|	sample standard deviation of the input values
variance(expression)   | smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric|	historical alias for var_samp
var_pop(expression)	| smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric|	population variance of the input values (square of the population standard deviation)
var_samp(expression)	| smallint, int, bigint, real, double precision, or numeric| double precision for floating-point arguments, otherwise numeric|	sample variance of the input values (square of the sample standard deviation)

Sample call of an aggregate function

SELECT regr_slope(tothullcount, avghullcount) FROM hcllayerwinv;

Output

   regr_slope    
------------------
 2.31850202053387
(1 row)

Joe Conway's Documentation on pl/r

http://joeconway.com/doc/plr-US.pdf 
http://www.joeconway.com/plr/doc/plr-funcs.html
http://www.joeconway.com/plr/doc/plr-window-funcs.html

Sample Function Code in PostgreSQL

Creating the Sample Function

Inside a database (ex) tigertest):

CREATE OR REPLACE FUNCTION r_max (integer, integer) RETURNS integer AS ’
    if (arg1 > arg2)
        return(arg1)
    else
        return(arg2)
’ LANGUAGE ’plr’ STRICT;

Calling the Sample Function

SELECT * FROM r_max(2,3);

returns

 r_max 
-------
     3
(1 row)

Attempts to call R built-in function from PSQL

Regression

Creating Function:

CREATE OR REPLACE FUNCTION r_reg(y numeric[], x numeric[])
RETURNS numeric AS $$
 res <- lm(y ~ x)
 return(coef(res)[1])
$$ LANGUAGE 'plr' immutable;

Calling Function:

SELECT r_reg(tothullcount, avghullcount) from hcllayerwinv;

Results:

  r_reg 
-------
    0
    0
    3
    0
    5
    4
    0
    0
    5
    4
    0
    0
    6
    4
    4
    0
    0
    4
    0
    0
    4
    0
    0
    4
    0
    0
    3
    0
    5
    3
    0
    0
    7
    5
    3
    3
    0
    0
    6
    4
    0
    0
    0
    5
    4
    0
    0
    7

hcllayerwinv table from tigertest database

                       Table "public.hcllayerwinv"
      Column      |          Type          | Collation | Nullable | Default 
------------------+------------------------+-----------+----------+---------
 place            | character varying(100) |           |          | 
 statecode        | character varying(2)   |           |          | 
 year             | integer                |           |          | 
 layer            | integer                |           |          | 
 lcount           | numeric                |           |          | 
 nosingleton      | bigint                 |           |          | 
 nomultiton       | bigint                 |           |          | 
 nopair           | bigint                 |           |          | 
 nohull           | bigint                 |           |          | 
 totmultitoncount | numeric                |           |          | 
 avgmultitoncount | numeric                |           |          | 
 totpaircount     | numeric                |           |          | 
 avgpaircount     | numeric                |           |          | 
 totpairlength    | double precision       |           |          | 
 avgpairlength    | double precision       |           |          | 
 minpairlength    | double precision       |           |          | 
 maxpairlength    | double precision       |           |          | 
 tothullcount     | numeric                |           |          | 
 avghullcount     | numeric                |           |          | 
 tothullarea      | double precision       |           |          | 
 avghullarea      | double precision       |           |          | 
 minhullarea      | double precision       |           |          | 
 maxhullarea      | double precision       |           |          | 
 tothulldensity   | double precision       |           |          | 
 avghulldensity   | double precision       |           |          | 
 seedearlyinvf    | double precision       |           |          | 
 inflator         | numeric                |           |          | 
 seedearlyinvl16f | double precision       |           |          |

Regress by

--By: place, statecode, layer, --Regress: seedearlyinvl16f ON nosingleton, totmultitoncount, totpaircount, tothullcount, avgpairlength, avghullarea, avghulldensity

Installing PL/R on the dbase server

Check the version of the dbase by doing a

SELECT version();

This gives us 9.5.10

Then as root:

apt-get install postgresql-9.5-plr

Finally enable the extension in a dbase:

create extension plr;

See also: http://www.joeconway.com/presentations/plr-PGConfNYC2014.pdf

Instructions on Use

To use R from pgAdmin III, follow the instructions in this tutorial: choose the database, click SQL, and run "CREATE EXTENSION plr;". This allows R functions to be used in this database from then on. You should be able to run the given examples in the tutorial after this is done. This was run on databases tigertest and template1.

Another possibly useful presentation on PL/R.

How PL/R was installed on the RDP

On 2018-01-18, PL/R was installed to allow PostgreSQL users to use R functions from SQL queries. It was installed using this tutorial. The link from the tutorial for PL/R doesn't work, I used this instead.

It required R version 3.3.0 for PL/R to work with PostgreSQL 9.5. If the environmental variables R_HOME or PATH are edited so that they point to a different version of R, this might not work, and PL/R might have to be reinstalled. If the version of PostgreSQL is updated, both the R version and PL/R version will have to match it as well.